Transforming your dataset
On this page weβll guide you through some of the most common operations used when doing data analysis. This is only a small subset of whatβs possible in Polars. For more information, please visit the Documentation.
For the example we will use the Common Crawl statistics dataset. These statistics include: number of pages, distribution of top-level domains, crawl overlaps, etc. For more detailed information and graphs please visit their official statistics page.
Reading
import polars as pl
df = pl.read_csv(
"hf://datasets/commoncrawl/statistics/tlds.csv",
try_parse_dates=True,
)
df.head(3)
βββββββ¬βββββββββ¬ββββββββββββββββββββ¬βββββββββββββ¬ββββ¬ββββββββ¬βββββββ¬ββββββββ¬ββββββββββ
β β suffix β crawl β date β β¦ β pages β urls β hosts β domains β
β --- β --- β --- β --- β β --- β --- β --- β --- β
β i64 β str β str β date β β i64 β i64 β f64 β f64 β
βββββββͺβββββββββͺββββββββββββββββββββͺβββββββββββββͺββββͺββββββββͺβββββββͺββββββββͺββββββββββ‘
β 0 β a.se β CC-MAIN-2008-2009 β 2009-01-12 β β¦ β 18 β 18 β 1.0 β 1.0 β
β 1 β a.se β CC-MAIN-2009-2010 β 2010-09-25 β β¦ β 3462 β 3259 β 166.0 β 151.0 β
β 2 β a.se β CC-MAIN-2012 β 2012-11-02 β β¦ β 6957 β 6794 β 172.0 β 150.0 β
βββββββ΄βββββββββ΄ββββββββββββββββββββ΄βββββββββββββ΄ββββ΄ββββββββ΄βββββββ΄ββββββββ΄ββββββββββ
Selecting columns
The dataset contains some columns we donβt need. To remove them, we will use the select
method:
df = df.select("suffix", "date", "tld", "pages", "domains")
df.head(3)
ββββββββββ¬ββββββββββββββββββββ¬βββββββββββββ¬ββββββ¬ββββββββ¬ββββββββββ
β suffix β crawl β date β tld β pages β domains β
β --- β --- β --- β --- β --- β --- β
β str β str β date β str β i64 β f64 β
ββββββββββͺββββββββββββββββββββͺβββββββββββββͺββββββͺββββββββͺββββββββββ‘
β a.se β CC-MAIN-2008-2009 β 2009-01-12 β se β 18 β 1.0 β
β a.se β CC-MAIN-2009-2010 β 2010-09-25 β se β 3462 β 151.0 β
β a.se β CC-MAIN-2012 β 2012-11-02 β se β 6957 β 150.0 β
ββββββββββ΄ββββββββββββββββββββ΄βββββββββββββ΄ββββββ΄ββββββββ΄ββββββββββ
Filtering
We can filter the dataset using the filter
method. This method accepts complex expressions, but letβs start simple by filtering based on the crawl date:
import datetime
df = df.filter(pl.col("date") >= datetime.date(2020, 1, 1))
You can combine multiple predicates with &
or |
operators:
df = df.filter(
(pl.col("date") >= datetime.date(2020, 1, 1)) |
pl.col("crawl").str.contains("CC")
)
Transforming
In order to add new columns to the dataset, use with_columns
. In the example below we calculate the total number of pages per domain and add a new column pages_per_domain
using the alias
method. The entire statement within with_columns
is called an expression. Read more about expressions and how to use them in the Polars user guide
df = df.with_columns(
(pl.col("pages") / pl.col("domains")).alias("pages_per_domain")
)
df.sample(3)
ββββββββββ¬ββββββββββββββββββ¬βββββββββββββ¬ββββββ¬ββββββββ¬ββββββββββ¬βββββββββββββββββββ
β suffix β crawl β date β tld β pages β domains β pages_per_domain β
β --- β --- β --- β --- β --- β --- β --- β
β str β str β date β str β i64 β f64 β f64 β
ββββββββββͺββββββββββββββββββͺβββββββββββββͺββββββͺββββββββͺββββββββββͺβββββββββββββββββββ‘
β net.bt β CC-MAIN-2014-41 β 2014-10-06 β bt β 4 β 1.0 β 4.0 β
β org.mk β CC-MAIN-2016-44 β 2016-10-31 β mk β 1445 β 430.0 β 3.360465 β
β com.lc β CC-MAIN-2016-44 β 2016-10-31 β lc β 1 β 1.0 β 1.0 β
ββββββββββ΄ββββββββββββββββββ΄βββββββββββββ΄ββββββ΄ββββββββ΄ββββββββββ΄βββββββββββββββββββ
Aggregation & Sorting
In order to aggregate data together you can use the group_by
, agg
and sort
methods. Within the aggregation context you can combine expressions to create powerful statements which are still easy to read.
First, we aggregate all the data to the top-level domain tld
per scraped date:
df = df.group_by("tld", "date").agg(
pl.col("pages").sum(),
pl.col("domains").sum(),
)
Now we can calculate several statistics per top level domain:
- Number of unique scrape dates
- Average number of domains in the scraped period
- Average growth rate in terms of number of pages
df = df.group_by("tld").agg(
pl.col("date").unique().count().alias("number_of_scrapes"),
pl.col("domains").mean().alias("avg_number_of_domains"),
pl.col("pages").sort_by("date").pct_change().mean().alias("avg_page_growth_rate"),
)
df = df.sort("avg_number_of_domains", descending=True)
df.head(10)
βββββββ¬ββββββββββββββββββββ¬ββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββ
β tld β number_of_scrapes β avg_number_of_domains β avg_percent_change_in_number_oβ¦ β
β --- β --- β --- β --- β
β str β u32 β f64 β f64 β
βββββββͺββββββββββββββββββββͺββββββββββββββββββββββββͺββββββββββββββββββββββββββββββββββ‘
β com β 101 β 1.9571e7 β 0.022182 β
β de β 101 β 1.8633e6 β 0.5232 β
β org β 101 β 1.5049e6 β 0.019604 β
β net β 101 β 1.5020e6 β 0.021002 β
β cn β 101 β 1.1101e6 β 0.281726 β
β ru β 101 β 1.0561e6 β 0.416303 β
β uk β 101 β 827453.732673 β 0.065299 β
β nl β 101 β 710492.623762 β 1.040096 β
β fr β 101 β 615471.594059 β 0.419181 β
β jp β 101 β 615391.455446 β 0.246162 β
βββββββ΄ββββββββββββββββββββ΄ββββββββββββββββββββββββ΄ββββββββββββββββββββββββββββββββββ