Polars

Polars is an in-memory DataFrame library on top of an OLAP query engine. It is fast, easy to use, and open source.

Starting from version 1.2.0, Polars provides native support for the Hugging Face file system. This means that all the benefits of the Polars query optimizer (e.g. predicate and projection pushdown) are applied and Polars will only load the data necessary to complete the query. This significantly speeds up reading, especially for large datasets (see optimizations)

You can use the Hugging Face paths (hf://) to access data on the Hub:

Getting started

To get started, you can simply pip install Polars into your environment:

pip install polars

Once you have installed Polars, you can directly query a dataset based on a Hugging Face URL. No other dependencies are needed for this.

import polars as pl

pl.read_parquet("hf://datasets/roneneldan/TinyStories/data/train-00000-of-00004-2d5a1467fff1081b.parquet")

Polars provides two APIs: a lazy API (scan_parquet) and an eager API (read_parquet). We recommend using the eager API for interactive workloads and the lazy API for performance as it allows for better query optimization. For more information on the topic, check out the Polars user guide.

Polars supports globbing to download multiple files at once into a single DataFrame.

pl.read_parquet("hf://datasets/roneneldan/TinyStories/data/train-*.parquet")

Hugging Face URLs

A Hugging Face URL can be constructed from the username and dataset name like this:

hf://datasets/{username}/{dataset}/{path_to_file}

The path may include globbing patterns such as **/*.parquet to query all the files matching the pattern. Additionally, for any non-supported file formats you can use the auto-converted parquet files that Hugging Face provides using the @~parquet branch:

hf://datasets/{my-username}/{my-dataset}@~parquet/{path_to_file}

< > Update on GitHub