-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(python): Use polars parquet reader for delta scan #19103
base: main
Are you sure you want to change the base?
refactor(python): Use polars parquet reader for delta scan #19103
Conversation
Nice, I didn't know we could bring our own readers. |
Absolutely we can, this is actually encouraged. Polars parquet reader is also lots faster than pyarrow. The only thing is, this intermediate stage will be bound to same protocol support as the pyarrow scanner. At some point we need to finish a full native reader polars and delta-kernel-rs, some preliminary work I did here: https://github.com/ion-elgreco/polars-deltalake/tree/feat/delta_io_plugin Just hard to find time nowadays. A dev from the core team who is working on parquet could probably do this easier since that dev is deep into the polars rust code ^^ |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #19103 +/- ##
=======================================
Coverage 79.77% 79.78%
=======================================
Files 1531 1531
Lines 208561 208522 -39
Branches 2913 2922 +9
=======================================
- Hits 166377 166366 -11
+ Misses 41633 41600 -33
- Partials 551 556 +5 ☔ View full report in Codecov by Sentry. |
@ritchie46 one test fails on windows: https://github.com/pola-rs/polars/actions/runs/11192235502/job/31116136984?pr=19103#step:11:202 when it encounters a hive path, I am not seeing this on linux though |
@nameexhaustion can you take a look at this one later? |
The Windows test should be fixed after a rebase on main |
Great, I'm currently out so will take a look at it somewhere mid November |
Am I correct that this doesn't use the delta log for partition pruning? It seems we ought to go through the io plugin framework and then use I haven't worked with that io plugin framework to figure out how to avoid the following being a two step process. For instance suppose what I want is df = scan_delta(dt).filter(pl.col('node_id')==something).collect() I can two step that like this files = pl.from_arrow(dt.get_add_actions()).filter((pl.col('min').struct.field('node_id')<=something) & (pl.col('max').struct.field('node_id')>=something))['path']
df=pl.scan_parquet([dt.table_uri+'/'+x for x in files]).filter(pl.col('node_id')==something).collect() |
This an intermediate stage until we have something working with delta-kernel-rs.
Couple odd things: