Time Series benchmark #135

MarcoGorelli · 2024-09-04T08:58:08Z

The M5 Forecasting Competition was held on Kaggle in 2020, and top solutions generally featured a lot of heavy feature engineering

Doing that feature engineering in pandas was quite slow, so I'm benchmarking how much better Polars would have been at that task

I think this is good to benchmark, as:

the competition was run on real-world Walmart data
the operations we're benchmarking are from the winning solution, so evidently they were doing something right

I think this reflects the kinds of gains that people doing applied data science can expect from using Polars

Here's a notebook with the queries + data: https://www.kaggle.com/code/marcogorelli/m5-forecasting-feature-engineering-benchmark/notebook

Run with SMALL=True for testing, then SMALL=False to run with the original dataset (full size)

Anyone fancy translating to SQL so we could check DuckDB too? My intuition is that this wouldn't be DuckDB's forte - which is fine, DuckDB is incredibly good at many other things - I think that making a friendly comparison involving this kind of benchmark would give a more complete picture than "DuckDB scales better than Polars because TPC-H!"

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Time Series benchmark #135

Time Series benchmark #135

MarcoGorelli commented Sep 4, 2024 •

edited

Loading

Time Series benchmark #135

Time Series benchmark #135

Comments

MarcoGorelli commented Sep 4, 2024 • edited Loading

MarcoGorelli commented Sep 4, 2024 •

edited

Loading