Explore, Curate and Vector Search Any Hugging Face Dataset with Nomic Atlas
The best way to improve AI systems is to explore and refine the data that powers them. We're excited to announce Nomic's official data connector to Hugging Face Datasets.
The AI community has uploaded many fascinating datasets to Hugging Face, contributed by researchers, developers, and hobbyists working on AI training and evaluation. Nomic's official Hugging Face connector lets you import, explore, and curate any of these datasets in Hugging Face with only a few clicks. This makes it easy for anyone to see what's in these datasets, to create embeddings from them, and to search and organize these massive and important datasets in new ways.
Importing Hugging Face Datasets into Atlas
When creating a new dataset in Atlas, you can choose Connectors
as your upload option.
This will show you the list of available data connectors that integrate directly with the Atlas data upload pipeline.
Here's a video showing how to use the integration to connect directly with Hugging Face to upload Atlas from one of our example datasets.
First, select the dataset you want to import, either by clicking on one of our recommended example datasets, or searching for any dataset on Hugging Face.
We load the Datasets Viewer data preview from Hugging Face directly on the Atlas upload page so you can preview the data before uploading:
Next, a field needs to be chosen from the dataset to embed: this is the column from the dataset that will determine how the data gets arranged into semantically related clusters in the Atlas data map. We automatically select the best field for embedding from the dataset you choose, but you can choose a different field.
Then, give the dataset a name & optional description.
Finally, click Create Dataset
- and you're done! The data will be ingested into Atlas and you'll get an email when your data map is ready!
What can you do with the Hugging Face Datasets connector to Nomic Atlas?
With Atlas, you can:
• Explore an entire Hugging Face dataset in a data map.
• Generate and download embeddings from any dataset.
• Analyze datasets with powerful tools like vector search and topic modeling.
• Easily deduplicate your Hugging Face datasets.
• Go multiplayer with tagging, data collaboration, and share links.
Here are some incredible public datasets available on Hugging Face that you can now import into Atlas with just a few clicks for exploration.
Rotten Tomatoes movie reviews
The video above showed us preparing this dataset of Rotten Tomatoes movie reviews for Atlas.
Here's what it looks like to explore this dataset once the upload to Atlas & embedding of the review texts finishes (takes a few minutes for these 50k points).
In this clip, we're showing what it looks like to perform a vector search for the query "this film could have been a lot shorter" (because, let's be honest, a lot of movies these days could be a lot shorter).
We can then use the Atlas UI to zoom in on a cluster of reviews that are semantically related to our query!
US Public Domain newspaper articles
This dataset is a sample from the archives of US newspapers digitized by the Library of Congress for the Chronicling America digital library, available on Hugging Face here.
As of January 2024, the collection contains nearly 21 million unique newspaper and periodical editions published from 1690 to 1963! Here we're just exploring a subset of 50k rows:
The text for each data point was created with OCR (Optical Character Recognition). As a result, the scans of the newspaper images yielded each OCR text may not perfectly reflect the text of the original article - some typos have been introduced to the dataset.
We can use the clustering that Atlas performs to easily recognize which points have likely typos, and use tagging in Atlas.
In this clip, we're zooming in on a cluster that Atlas labeled Housing
. Taking a closer look, we can see each article is from the Classified section of the paper where people were advertising their homes and home appliances for rent or sale.
Atlas grouped these points together based on the similar semantic content in each text, even though some of the points contain typos of "Classified". We can identify them as such with an OCR-typo
tag to mark them for data cleaning later on.
OpenAssistant Conversations
The OpenAssistant dataset was the product of a massive global crowd-sourced effort facilitated by the LAION non-profit organization to collect and open-source a large multilingual dataset for fine-tuning chat assistants in the early days of LLM research (spring 2023, eons ago in AI-timescales). You can read their research paper for the project on arXiv.
The dataset contains conversations in lots of languages, including: English, Spanish, Russian, German, Chinese, French, Thai, Portuguese (Brazil), Catalan, Korean, Ukrainian, Italian, and Japanese!
So let's choose a multilingual embedding model option for uploading to Atlas. This uses the gte-multilingual-base model from Alibaba to assign embedding vectors to text that should capture the similarity of content regardless of the language used.
Once the data map is ready, we can explore the chat conversations grouped into clusters that contain text talking about similar concepts across all the different supported languages.
For example, we can perform a vector search for math and find a cluster of chat responses discussing calculus in English, Spanish, French, and Russian, all situated near each other.
Conclusion
The new Hugging Face integration with Nomic Atlas lets more people get real value from massive AI datasets with just a few clicks. Atlas makes it easier for users of all backgrounds to accomplish important data processing and analysis workflows in minutes, such as:
• Spot data quality issues revealed by Atlas' visual clustering
• Curate data with semantic deduplication and collaborative tagging
• Search over millions of points for semantically related data
• Generate, explore, and export vector embeddings seamlessly
• Share interactive data maps with your team
Head over to Atlas, sign up for a free account, and try the Hugging Face integration for yourself. We can't wait to see what insights you'll discover!