Query datasets
Querying datasets is a fundamental step in data analysis. Here, weβll guide you through querying datasets using various methods.
There are several ways to select your data.
Using the FROM
syntax:
FROM 'hf://datasets/jamescalam/world-cities-geo/train.jsonl' SELECT city, country, region LIMIT 3;
ββββββββββββββββββ¬ββββββββββββββ¬ββββββββββββββββ
β city β country β region β
β varchar β varchar β varchar β
ββββββββββββββββββΌββββββββββββββΌββββββββββββββββ€
β Kabul β Afghanistan β Southern Asia β
β Kandahar β Afghanistan β Southern Asia β
β Mazar-e Sharif β Afghanistan β Southern Asia β
ββββββββββββββββββ΄ββββββββββββββ΄ββββββββββββββββ
Using the SELECT
and FROM
syntax:
SELECT city, country, region FROM 'hf://datasets/jamescalam/world-cities-geo/train.jsonl' USING SAMPLE 3;
ββββββββββββ¬ββββββββββ¬βββββββββββββββββ
β city β country β region β
β varchar β varchar β varchar β
ββββββββββββΌββββββββββΌβββββββββββββββββ€
β Wenzhou β China β Eastern Asia β
β Valdez β Ecuador β South America β
β Aplahoue β Benin β Western Africa β
ββββββββββββ΄ββββββββββ΄βββββββββββββββββ
Count all JSONL files matching a glob pattern:
SELECT COUNT(*) FROM 'hf://datasets/jamescalam/world-cities-geo/*.jsonl';
ββββββββββββββββ
β count_star() β
β int64 β
ββββββββββββββββ€
β 9083 β
ββββββββββββββββ
You can also query Parquet files using the read_parquet
function (or its alias parquet_scan
). This function, along with other parameters, provides flexibility in handling Parquet files specially if they dont have a .parquet
extension. Letβs explore these functions using the auto-converted Parquet files from the same dataset.
Select using read_parquet function:
SELECT * FROM read_parquet('hf://datasets/jamescalam/world-cities-geo@~parquet/default/**/*.parquet') LIMIT 3;
ββββββββββββββββββ¬ββββββββββββββ¬ββββββββββββββββ¬ββββββββββββ¬βββββββββββββ¬βββββββββββββ¬βββββββββββββββββββββ¬ββββββββββββββββββββ¬βββββββββββββββββββββ
β city β country β region β continent β latitude β longitude β x β y β z β
β varchar β varchar β varchar β varchar β double β double β double β double β double β
ββββββββββββββββββΌββββββββββββββΌββββββββββββββββΌββββββββββββΌβββββββββββββΌβββββββββββββΌβββββββββββββββββββββΌββββββββββββββββββββΌβββββββββββββββββββββ€
β Kabul β Afghanistan β Southern Asia β Asia β 34.5166667 β 69.1833344 β 1865.546409629258 β 4906.785732164055 β 3610.1012966606136 β
β Kandahar β Afghanistan β Southern Asia β Asia β 31.61 β 65.6999969 β 2232.782351694877 β 4945.064042683584 β 3339.261233224765 β
β Mazar-e Sharif β Afghanistan β Southern Asia β Asia β 36.7069444 β 67.1122208 β 1986.5057687360124 β 4705.51748048584 β 3808.088900172991 β
ββββββββββββββββββ΄ββββββββββββββ΄ββββββββββββββββ΄ββββββββββββ΄βββββββββββββ΄βββββββββββββ΄βββββββββββββββββββββ΄ββββββββββββββββββββ΄βββββββββββββββββββββ
Read all files that match a glob pattern and include a filename column specifying which file each row came from:
SELECT city, country, filename FROM read_parquet('hf://datasets/jamescalam/world-cities-geo@~parquet/default/**/*.parquet', filename = true) LIMIT 3;
ββββββββββββββββββ¬ββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β city β country β filename β
β varchar β varchar β varchar β
ββββββββββββββββββΌββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Kabul β Afghanistan β hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet β
β Kandahar β Afghanistan β hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet β
β Mazar-e Sharif β Afghanistan β hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet β
ββββββββββββββββββ΄ββββββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Get metadata and schema
The parquet_metadata function can be used to query the metadata contained within a Parquet file.
SELECT * FROM parquet_metadata('hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet');
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ¬βββββββββββββββ¬βββββββββββββββββββββ¬ββββββββββββββ
β file_name β row_group_id β row_group_num_rows β compression β
β varchar β int64 β int64 β varchar β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββΌβββββββββββββββββββββΌββββββββββββββ€
β hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet β 0 β 1000 β SNAPPY β
β hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet β 0 β 1000 β SNAPPY β
β hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet β 0 β 1000 β SNAPPY β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ΄βββββββββββββββ΄βββββββββββββββββββββ΄ββββββββββββββ
Fetch the column names and column types:
DESCRIBE SELECT * FROM 'hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet';
βββββββββββββββ¬ββββββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ
β column_name β column_type β null β key β default β extra β
β varchar β varchar β varchar β varchar β varchar β varchar β
βββββββββββββββΌββββββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββ€
β city β VARCHAR β YES β β β β
β country β VARCHAR β YES β β β β
β region β VARCHAR β YES β β β β
β continent β VARCHAR β YES β β β β
β latitude β DOUBLE β YES β β β β
β longitude β DOUBLE β YES β β β β
β x β DOUBLE β YES β β β β
β y β DOUBLE β YES β β β β
β z β DOUBLE β YES β β β β
βββββββββββββββ΄ββββββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ
Fetch the internal schema (excluding the file name):
SELECT * EXCLUDE (file_name) FROM parquet_schema('hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet');
βββββββββββββ¬βββββββββββββ¬ββββββββββββββ¬ββββββββββββββββββ¬βββββββββββββββ¬βββββββββββββββββ¬ββββββββ¬ββββββββββββ¬βββββββββββ¬βββββββββββββββ
β name β type β type_length β repetition_type β num_children β converted_type β scale β precision β field_id β logical_type β
β varchar β varchar β varchar β varchar β int64 β varchar β int64 β int64 β int64 β varchar β
βββββββββββββΌβββββββββββββΌββββββββββββββΌββββββββββββββββββΌβββββββββββββββΌβββββββββββββββββΌββββββββΌββββββββββββΌβββββββββββΌβββββββββββββββ€
β schema β β β REQUIRED β 9 β β β β β β
β city β BYTE_ARRAY β β OPTIONAL β β UTF8 β β β β StringType() β
β country β BYTE_ARRAY β β OPTIONAL β β UTF8 β β β β StringType() β
β region β BYTE_ARRAY β β OPTIONAL β β UTF8 β β β β StringType() β
β continent β BYTE_ARRAY β β OPTIONAL β β UTF8 β β β β StringType() β
β latitude β DOUBLE β β OPTIONAL β β β β β β β
β longitude β DOUBLE β β OPTIONAL β β β β β β β
β x β DOUBLE β β OPTIONAL β β β β β β β
β y β DOUBLE β β OPTIONAL β β β β β β β
β z β DOUBLE β β OPTIONAL β β β β β β β
βββββββββββββ΄βββββββββββββ΄ββββββββββββββ΄ββββββββββββββββββ΄βββββββββββββββ΄βββββββββββββββββ΄ββββββββ΄ββββββββββββ΄βββββββββββ΄βββββββββββββββ€
Get statistics
The SUMMARIZE
command can be used to get various aggregates over a query (min, max, approx_unique, avg, std, q25, q50, q75, count). It returns these statistics along with the column name, column type, and the percentage of NULL values.
SUMMARIZE SELECT latitude, longitude FROM 'hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet';
βββββββββββββββ¬ββββββββββββββ¬βββββββββββββββ¬ββββββββββββββ¬ββββββββββββββββ¬βββββββββββββββββββββ¬βββββββββββββββββββββ¬βββββββββββββββββββββ¬βββββββββββββββββββββ¬βββββββββββββββββββββ¬ββββββββ¬ββββββββββββββββββ
β column_name β column_type β min β max β approx_unique β avg β std β q25 β q50 β q75 β count β null_percentage β
β varchar β varchar β varchar β varchar β int64 β varchar β varchar β varchar β varchar β varchar β int64 β decimal(9,2) β
βββββββββββββββΌββββββββββββββΌβββββββββββββββΌββββββββββββββΌββββββββββββββββΌβββββββββββββββββββββΌβββββββββββββββββββββΌβββββββββββββββββββββΌβββββββββββββββββββββΌβββββββββββββββββββββΌββββββββΌββββββββββββββββββ€
β latitude β DOUBLE β -54.8 β 67.8557214 β 7324 β 22.5004568364307 β 26.770454684690925 β 6.089858461951687 β 29.321258648324747 β 44.90191158328915 β 9083 β 0.00 β
β longitude β DOUBLE β -175.2166595 β 179.3833313 β 7802 β 14.699333721953098 β 63.93672742608224 β -6.877990418604821 β 19.12963979385393 β 43.873513093419966 β 9083 β 0.00 β
βββββββββββββββ΄ββββββββββββββ΄βββββββββββββββ΄ββββββββββββββ΄ββββββββββββββββ΄βββββββββββββββββββββ΄βββββββββββββββββββββ΄βββββββββββββββββββββ΄βββββββββββββββββββββ΄βββββββββββββββββββββ΄ββββββββ΄ββββββββββββββββββ