Data taken from kaggle
- Written in Rust (-> Fast and memory efficient)
- Relies on Arrow for Memory Mapping and column oriented data storage and access
- Offers a Python wrapper (pip installable, everything can be done in python)
- Easy to get started and use (does not require additional components like PySpark or Dask do)
- Multi-threaded compared to single-threaded numpy and pandas (Dask attempts to parallelize single threaded operations)
- (Semi-) Lazy execution
- Query optimizations (e.g. predicate pushdown, projection pushdown, β¦)
- Operations run in parallelizable contexts -> each column operation runs in parallel
- Proper NaN dtypes (unlike pandas where pd.NA is float)
- Distributes the work locally. For very big datasets this might be a limitation
=> Runs 5x-20x faster than pandas at 50%-75% lower memory consumption
H2O Database-like ops benchmark
Dota 2 Datensatz von Kaggle im format .csv
match.csv
(2.6MB) - Information about the played matches (50K matches)
match_id start_time duration tower_status_radiant ...
0 0 1446750112 2375 1982 ...
1 1 1446753078 2582 0 ...
2 2 1446764586 2716 256 ...
3 3 1446765723 3085 4 ...
cluster_regions.csv
(1KB) - Information about the clusters (geographic regions)
cluster region
0 111 US WEST
1 112 US WEST
2 113 US WEST
3 121 US EAST
purchase_log.csv
(289.8MB) - Informations about bought items (~18Mio transactions)
βββββββββββ¬βββββββ¬ββββββββββββββ¬βββββββββββ
β item_id β time β player_slot β match_id β
β --- β --- β --- β --- β
β i64 β i64 β i64 β i64 β
βββββββββββͺβββββββͺββββββββββββββͺβββββββββββ‘
β 44 β -81 β 0 β 0 β
βββββββββββΌβββββββΌββββββββββββββΌβββββββββββ€
β 29 β -63 β 0 β 0 β
βββββββββββΌβββββββΌββββββββββββββΌβββββββββββ€
β 43 β 6 β 0 β 0 β
βββββββββββ΄βββββββ΄ββββββββββββββ΄βββββββββββ
item_ids.csv
(3KB) - Informations about item names and mappings
item_id item_name
0 1 blink
1 2 blades_of_attack
2 3 broadsword
3 4 chainmail
players.csv
(126.9MB) - Informations about the players (500K entries, 10 for each match)
match_id account_id hero_id player_slot gold gold_spent ...
0 0 0 86 0 3261 10960 ...
1 0 1 51 1 2954 17760 ...
2 0 0 83 2 110 12195 ...
3 0 2 11 3 1179 22505 ...
Note: Anonymous users have the value of 0 for account_id
Compare polars and pandas using a set of typical operations (read data, join, groupby, filter, select, ...)
# API in some cases equal/similar to pandas
pl_match = pl.read_csv(f"{DATA_DIR}/match.csv")
match_with_region = pl_match.join(pl_cluster_regions, how="left", on="cluster").drop(
"cluster"
)
# Polars
purchases_with_item_names = (
pl_purchase_log.join(pl_item_id_names, how="left", on="item_id")
.drop("item_id")
.groupby(["match_id", "player_slot", "item_name"])
.agg(pl.col("time").list().keep_name()) # In other cases close to pyspark
)
# Pandas
purchases_with_item_names = (
pd.merge(pd_purchase_log, pd_item_id_names, how="left", on="item_id")
.drop(columns="item_id")
.groupby(["match_id", "player_slot", "item_name"])["time"]
.apply(list)
.reset_index()
)
Ο -> predicate pushdown (filter etc. on scan level)
Ο -> projection pushdown (column selection on scan level)
aggregate pushdown (aggregations on scan level)
# fetch takes a number of rows from the initial files and run the code
# Example: 18Mio entries in the largest file 'purchase_log'
# -> 100 Mio limit fetches all data
# -> 1 Mio limit fetches a fraction of the data
pl_final_fetch = pl_final_lazy.fetch(n_rows=int(1_000_000))
pl_final_fetch.shape
>>> (15305, 86)
df
ββββββββ¬ββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββ
β nrs β names β random β random2 β groups β
β --- β --- β --- β --- β --- β
β i64 β str β f64 β f64 β str β
ββββββββͺββββββββͺβββββββββββͺβββββββββββͺβββββββββ‘
β 1 β foo β 0.154163 β 0.900715 β A β
ββββββββΌββββββββΌβββββββββββΌβββββββββββΌβββββββββ€
β 2 β ham β 0.74 β 0.033421 β A β
ββββββββΌββββββββΌβββββββββββΌβββββββββββΌβββββββββ€
β 3 β spam β 0.263315 β 0.956949 β B β
ββββββββΌββββββββΌβββββββββββΌβββββββββββΌβββββββββ€
β null β foo β 0.533739 β 0.137209 β C β
ββββββββΌββββββββΌβββββββββββΌβββββββββββΌβββββββββ€
β 5 β foo β 0.014575 β 0.283828 β A β
ββββββββΌββββββββΌβββββββββββΌβββββββββββΌβββββββββ€
β 6 β spam β 0.918747 β 0.606083 β A β
ββββββββ΄ββββββββ΄βββββββββββ΄βββββββββββ΄βββββββββ
df[
[
pl.col("random").count().alias("count_method"),
pl.count("random").alias("count_function"),
]
]
ββββββββββββββββ¬βββββββββββββββββ
β count_method β count_function β
β --- β --- β
β u32 β u32 β
ββββββββββββββββͺβββββββββββββββββ‘
β 6 β 6 β
ββββββββββββββββ΄βββββββββββββββββ
Polars expressions such as sum() can be used in three different contexts.
- selection: df.select([..])
- groupby / aggregation: df.groupby(..).agg([..])
- hstack / add columns: df.with_columns([..])
# in any case there are multiple ways to select columns
df[
[
pl.sum("random").alias("sum_function"),
pl.sum(pl.Float64),
# pl.sum("^random.*$")
# pl.all().exclude(["nrs", "names", "groups"]).sum()
]
]
ββββββββββββββββ¬βββββββββββ¬βββββββββββ
β sum_function β random β random2 β
β --- β --- β --- β
β f64 β f64 β f64 β
ββββββββββββββββͺβββββββββββͺβββββββββββ‘
β 2.624589 β 2.624589 β 2.918206 β
ββββββββββββββββ΄βββββββββββ΄βββββββββββ
df.select(
[
pl.when(pl.col("random") > 0.5)
.then(0)
.otherwise(pl.col("random") * pl.sum("nrs"))
.alias("binary_function")
]
)
βββββββββββββββββββ
β binary_function β
β --- β
β f64 β
βββββββββββββββββββ‘
β 2.620768 β
βββββββββββββββββββ€
β 0.0 β
βββββββββββββββββββ€
β 4.476355 β
βββββββββββββββββββ€
β 0.0 β
βββββββββββββββββββ€
β 0.247774 β
βββββββββββββββββββ€
β 0.0 β
βββββββββββββββββββ
df.select(
[
pl.all(),
pl.when(pl.col("names") == "foo")
.then("cat_1")
.when(pl.col("names") == "spam")
.then("cat_2")
.otherwise("other")
.alias("name_category"),
]
)
ββββββββ¬ββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββ¬ββββββββββββββββ
β nrs β names β random β random2 β groups β name_category β
β --- β --- β --- β --- β --- β --- β
β i64 β str β f64 β f64 β str β str β
ββββββββͺββββββββͺβββββββββββͺβββββββββββͺβββββββββͺββββββββββββββββ‘
β 1 β foo β 0.154163 β 0.900715 β A β cat_1 β
ββββββββΌββββββββΌβββββββββββΌβββββββββββΌβββββββββΌββββββββββββββββ€
β 2 β ham β 0.74 β 0.033421 β A β other β
ββββββββΌββββββββΌβββββββββββΌβββββββββββΌβββββββββΌββββββββββββββββ€
β 3 β spam β 0.263315 β 0.956949 β B β cat_2 β
ββββββββΌββββββββΌβββββββββββΌβββββββββββΌβββββββββΌββββββββββββββββ€
β null β foo β 0.533739 β 0.137209 β C β cat_1 β
ββββββββΌββββββββΌβββββββββββΌβββββββββββΌβββββββββΌββββββββββββββββ€
β 5 β foo β 0.014575 β 0.283828 β A β cat_1 β
ββββββββΌββββββββΌβββββββββββΌβββββββββββΌβββββββββΌββββββββββββββββ€
β 6 β spam β 0.918747 β 0.606083 β A β cat_2 β
ββββββββ΄ββββββββ΄βββββββββββ΄βββββββββββ΄βββββββββ΄ββββββββββββββββ
df[
[
pl.col("*").exclude("^random.*$"), # alternatively: pl.all().exclude(...)
pl.col("names").list().over("groups").alias("names/groups"),
pl.col("names").unique().over("groups").alias("unique_names/groups"),
]
]
ββββββββ¬ββββββββ¬βββββββββ¬βββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββ
β nrs β names β groups β names/groups β unique_names/groups β
β --- β --- β --- β --- β --- β
β i64 β str β str β list [str] β list [str] β
ββββββββͺββββββββͺβββββββββͺβββββββββββββββββββββββββββββββͺβββββββββββββββββββββββββ‘
β 1 β foo β A β ["foo", "ham", "foo" "spam"] β ["ham", "spam", "foo"] β
ββββββββΌββββββββΌβββββββββΌβββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββ€
β 2 β ham β A β ["foo", "ham", "foo" "spam"] β ["ham", "spam", "foo"] β
ββββββββΌββββββββΌβββββββββΌβββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββ€
β 3 β spam β B β ["spam"] β ["spam"] β
ββββββββΌββββββββΌβββββββββΌβββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββ€
β null β foo β C β ["foo"] β ["foo"] β
ββββββββΌββββββββΌβββββββββΌβββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββ€
β 5 β foo β A β ["foo", "ham", "foo" "spam"] β ["ham", "spam", "foo"] β
ββββββββΌββββββββΌβββββββββΌβββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββ€
β 6 β spam β A β ["foo", "ham", "foo" "spam"] β ["ham", "spam", "foo"] β
ββββββββ΄ββββββββ΄βββββββββ΄βββββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββ