Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

polars.LazyPolarsdataset .collect() streaming #519

Closed
butterlyn opened this issue Jan 21, 2024 · 9 comments
Closed

polars.LazyPolarsdataset .collect() streaming #519

butterlyn opened this issue Jan 21, 2024 · 9 comments
Labels
Community Issue/PR opened by the open-source community datasets enhancement New feature or request good first issue Good for newcomers

Comments

@butterlyn
Copy link

Description

Enable Polars streaming by default when saving polars.LazyPolarsDataset

Context

Enables larger-than-memory data processing, one of the main advantages of using Polars LazyFrames.

Possible Implementation

.collect(streaming=True) in polars.LazyPolarsDataset

If streaming cannot be performed for whatever reason, Polars disables streaming automatically at runtime, so having streaming as the default behaviour should be okay.

Possible Alternatives

Add a flag to enable/disable streaming through data catalog load_args. However, this may be problematic given streaming is not an argument of LazyFrame.sink_csv(), but rather an argument of LazyFrame.collect().

@astrojuanlu
Copy link
Member

Thanks @butterlyn for this feature request!

I'm going to suggest a third alternative, which is adding a dataset-level property, like this:

ds:
    type: polars.LazyPolarsDataset
    streaming: true

how does that sound?

@astrojuanlu astrojuanlu added enhancement New feature or request Community Issue/PR opened by the open-source community datasets labels Jan 24, 2024
@butterlyn
Copy link
Author

butterlyn commented Jan 24, 2024

@astrojuanlu Love the idea! That'd be perfect

@astrojuanlu astrojuanlu added the good first issue Good for newcomers label Jan 24, 2024
@astrojuanlu
Copy link
Member

This one is actually easy I'd say :) It requires adding a new argument to the initialiser:

def __init__( # noqa: PLR0913
self,
*,
filepath: str,
file_format: str,
load_args: Optional[dict[str, Any]] = None,

And then storing it in an internal property, and using it where appropriate.

@astrojuanlu
Copy link
Member

In fact, I'm thinking - rather than using .collect() and then .write_*, shouldn't we use .sink_ directly? cc @cpinon-grd (comes from https://linen-slack.kedro.org/t/16374083/hey-team-is-there-any-way-to-store-a-lazypolarsdataframe-wit#76e13870-de5a-4a1d-86c3-f0c30f2ebf25)

@MatthiasRoels
Copy link
Contributor

As per my comment here, I wouldn't recommend using streaming or sink_* methods. Even when using .collect(streaming=True), it is explicitly mentioned in the docs that streaming mode is considered unstable.

@astrojuanlu
Copy link
Member

astrojuanlu commented Jun 27, 2024

Streaming functionality is indeed considered unstable pola-rs/polars#13948

But as far as I understand, sink_* methods in non-streaming mode are okay?

@astrojuanlu
Copy link
Member

Let's close this issue in favour of #702, therefore no streaming=True but let's continue the discussion on using the lazy methods for LazyPolarsDataset.

@astrojuanlu astrojuanlu closed this as not planned Won't fix, can't repro, duplicate, stale Jun 27, 2024
@cpinon-grd
Copy link

cpinon-grd commented Jun 28, 2024

Hey! If I'm not wrong, processing larger than memory datasets is one of the key features of Polars. Polars docs state:

With the lazy API Polars doesn't run each query line-by-line but instead processes the full query end-to-end. To get the most out of Polars it is important that you use the lazy API because:

  • the lazy API allows Polars to apply automatic query optimization with the query optimizer
  • the lazy API allows you to work with larger than memory datasets using streaming
  • the lazy API can catch schema errors before processing the data

Isn't it a bit weird that in order to "get the most out of Polars", the Polars team recommends an unstable solution? If using streaming mode is unstable, what is the "recommended"/"your go to" solution?

@MatthiasRoels
Copy link
Contributor

MatthiasRoels commented Jun 28, 2024

If you use the Lazy API, you already get some optimisations such as predicate and filter pushdown. This means that you only read the rows/columns in memory that you need (as opposed to the full dataset).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Community Issue/PR opened by the open-source community datasets enhancement New feature or request good first issue Good for newcomers
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants