Combine datasets and export
In this section, weโll demonstrate how to combine two datasets and export the result. The first dataset is in CSV format, and the second dataset is in Parquet format. Letโs start by examining our datasets:
The first will be TheFusion21/PokemonCards:
FROM 'hf://datasets/TheFusion21/PokemonCards/train.csv' LIMIT 3;
โโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโฌโโโโโโโโฌโโโโโโโโโโโโโโโโโโ
โ id โ image_url โ caption โ name โ hp โ set_name โ
โ varchar โ varchar โ varchar โ varchar โ int64 โ varchar โ
โโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโผโโโโโโโโโโโโโโโโโโค
โ pl3-1 โ https://images.pokโฆ โ A Basic, SP Pokemon Card of type Darkness with the title Absol G and 70 HP of rarity Rare Holo from the set Supreme Victors. It has โฆ โ Absol G โ 70 โ Supreme Victors โ
โ ex12-1 โ https://images.pokโฆ โ A Stage 1 Pokemon Card of type Colorless with the title Aerodactyl and 70 HP of rarity Rare Holo evolved from Mysterious Fossil from โฆ โ Aerodactyl โ 70 โ Legend Maker โ
โ xy5-1 โ https://images.pokโฆ โ A Basic Pokemon Card of type Grass with the title Weedle and 50 HP of rarity Common from the set Primal Clash and the flavor text: Itโฆ โ Weedle โ 50 โ Primal Clash โ
โโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโดโโโโโโโโดโโโโโโโโโโโโโโโโโโ
And the second one will be wanghaofan/pokemon-wiki-captions:
FROM 'hf://datasets/wanghaofan/pokemon-wiki-captions/data/*.parquet' LIMIT 3;
โโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ image โ name_en โ name_zh โ text_en โ text_zh โ
โ struct(bytes blob,โฆ โ varchar โ varchar โ varchar โ varchar โ
โโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ {'bytes': \x89PNG\โฆ โ abomasnow โ ๆด้ช็ โ Grass attributes,Blizzard King standing on two feet, with โฆ โ ่ๅฑๆง๏ผๅ่็ซ็ซ็ๆด้ช็๏ผๅ
จ่บซ็ฝ่ฒ็็ปๆฏ๏ผๆทก็ดซ่ฒ็็ผ็๏ผๅ ็ผ้ฟๆก่ฃ
็ๆฏ็ฎ็็ๅฎ็ๅดๅทด โ
โ {'bytes': \x89PNG\โฆ โ abra โ ๅฏ่ฅฟ โ Super power attributes, the whole body is yellow, the headโฆ โ ่ถ
่ฝๅๅฑๆง๏ผ้ไฝ้ป่ฒ๏ผๅคด้จๅคๅฝข็ฑปไผผ็็ธ๏ผๅฐๅฐ้ผปๅญ๏ผๆๅ่ไธ้ฝๆไธไธชๆๅคด๏ผ้ฟๅฐพๅทดๆซ็ซฏๅธฆ็ไธไธช่ค่ฒๅ็ฏ โ
โ {'bytes': \x89PNG\โฆ โ absol โ ้ฟๅๆขญ้ฒ โ Evil attribute, with white hair, blue-gray part without haโฆ โ ๆถๅฑๆง๏ผๆ็ฝ่ฒๆฏๅ๏ผๆฒกๆฏๅ็้จๅๆฏ่็ฐ่ฒ๏ผๅคดๅณ่พน็ฑปไผผๅผ็่ง๏ผ็บข่ฒ็ผ็ โ
โโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโดโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Now, letโs try to combine these two datasets by joining on the name
column:
SELECT a.image_url
, a.caption AS card_caption
, a.name
, a.hp
, b.text_en as wiki_caption
FROM 'hf://datasets/TheFusion21/PokemonCards/train.csv' a
JOIN 'hf://datasets/wanghaofan/pokemon-wiki-captions/data/*.parquet' b
ON LOWER(a.name) = b.name_en
LIMIT 3;
โโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโฌโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ image_url โ card_caption โ name โ hp โ wiki_caption โ
โ varchar โ varchar โ varchar โ int64 โ varchar โ
โโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ https://images.pokโฆ โ A Stage 1 Pokemon โฆ โ Aerodactyl โ 70 โ A Pokรฉmon with rock attributes, gray body, blue pupils, purple inner wings, two sharp claws on the wings, jagged teeth, and an arrow-like โฆ โ
โ https://images.pokโฆ โ A Basic Pokemon Caโฆ โ Weedle โ 50 โ Insect-like, caterpillar-like in appearance, with a khaki-yellow body, seven pairs of pink gastropods, a pink nose, a sharp poisonous needโฆ โ
โ https://images.pokโฆ โ A Basic Pokemon Caโฆ โ Caterpie โ 50 โ Insect attributes, caterpillar appearance, green back, white abdomen, Y-shaped red antennae on the head, yellow spindle-shaped tail, two pโฆ โ
โโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโดโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
We can export the result to a Parquet file using the COPY
command:
COPY (SELECT a.image_url
, a.caption AS card_caption
, a.name
, a.hp
, b.text_en as wiki_caption
FROM 'hf://datasets/TheFusion21/PokemonCards/train.csv' a
JOIN 'hf://datasets/wanghaofan/pokemon-wiki-captions/data/*.parquet' b
ON LOWER(a.name) = b.name_en)
TO 'output.parquet' (FORMAT PARQUET);
Letโs validate the new Parquet file:
SELECT COUNT(*) FROM 'output.parquet';
โโโโโโโโโโโโโโโโ
โ count_star() โ
โ int64 โ
โโโโโโโโโโโโโโโโค
โ 9460 โ
โโโโโโโโโโโโโโโโ
Finally, letโs push the resulting dataset to the Hub. You can use the Hub UI, the huggingface_hub
client library and more to upload your Parquet file, see more information here.
And thatโs it! Youโve successfully combined two datasets, exported the result, and uploaded it to the Hugging Face Hub.
< > Update on GitHub