Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Time Series benchmark #135

Open
MarcoGorelli opened this issue Sep 4, 2024 · 0 comments
Open

Time Series benchmark #135

MarcoGorelli opened this issue Sep 4, 2024 · 0 comments

Comments

@MarcoGorelli
Copy link
Contributor

MarcoGorelli commented Sep 4, 2024

The M5 Forecasting Competition was held on Kaggle in 2020, and top solutions generally featured a lot of heavy feature engineering

Doing that feature engineering in pandas was quite slow, so I'm benchmarking how much better Polars would have been at that task

I think this is good to benchmark, as:

  • the competition was run on real-world Walmart data
  • the operations we're benchmarking are from the winning solution, so evidently they were doing something right

I think this reflects the kinds of gains that people doing applied data science can expect from using Polars

Here's a notebook with the queries + data: https://www.kaggle.com/code/marcogorelli/m5-forecasting-feature-engineering-benchmark/notebook

Run with SMALL=True for testing, then SMALL=False to run with the original dataset (full size)


Anyone fancy translating to SQL so we could check DuckDB too? My intuition is that this wouldn't be DuckDB's forte - which is fine, DuckDB is incredibly good at many other things - I think that making a friendly comparison involving this kind of benchmark would give a more complete picture than "DuckDB scales better than Polars because TPC-H!"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant