SQL Console: Query Hugging Face datasets in your browser
You can run SQL queries on the dataset in the browser using the SQL Console. The SQL Console is powered by DuckDB WASM and runs entirely in the browser. You can access the SQL Console from the dataset page by clicking on the SQL Console badge.
To learn more about the SQL Console, see the SQL Console blog post.
Through the SQL Console, you can:
- Run DuckDB SQL queries on the dataset (checkout SQL Snippets for useful queries)
- Share results of the query with others via a link (check out this example)
- Download the results of the query to a parquet file
- Embed the results of the query in your own webpage using an iframe
Examples
Filtering
The SQL Console makes filtering datasets really easy. For example, if you want to filter the SkunkworksAI/reasoning-0.01
dataset for instructions and responses with a reasoning length of at least 10, you can use the following query:
In the query, we can use the len
function to get the length of the reasoning_chains
column and the bar
function to create a bar chart of the reasoning lengths.
SELECT len(reasoning_chains) AS reason_len, bar(reason_len, 0, 100), *
FROM train
WHERE reason_len > 10
ORDER BY reason_len DESC
The bar function is a neat built-in DuckDB function that creates a bar chart of the reasoning lengths.
Histogram
Many dataset authors choose to include statistics about the distribution of the data in the dataset. Using the DuckDB histogram
function, we can plot a histogram of a column’s values.
For example, to plot a histogram of the reason_len
column in the SkunkworksAI/reasoning-0.01
dataset, you can use the following query:
Learn more about the `histogram` function and parameters here.
FROM histogram(train, len(reasoning_chains))
Regex Matching
One of the most powerful features of DuckDB is the deep support for regular expressions. You can use the regexp
function to match patterns in your data.
Using the regexp_matches function, we can filter the SkunkworksAI/reasoning-0.01
dataset for instructions that contain markdown code blocks.
Learn more about the DuckDB regex functions here.
SELECT *
FROM train
WHERE regexp_matches(instruction, '```[a-z]*\n')
limit 100
Leakage Detection
Leakage detection is the process of identifying whether data in a dataset is present in multiple splits, for example, whether the test set is present in the training set.
Learn more about leakage detection here.
WITH
overlapping_rows AS (
SELECT COALESCE(
(SELECT COUNT(*) AS overlap_count
FROM train
INTERSECT
SELECT COUNT(*) AS overlap_count
FROM test),
0
) AS overlap_count
),
total_unique_rows AS (
SELECT COUNT(*) AS total_count
FROM (
SELECT * FROM train
UNION
SELECT * FROM test
) combined
)
SELECT
overlap_count,
total_count,
CASE
WHEN total_count > 0 THEN (overlap_count * 100.0 / total_count)
ELSE 0
END AS overlap_percentage
FROM overlapping_rows, total_unique_rows;