-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use pl.sink_*
in LazyPolarsDataset._save
#702
Comments
I just tried to implement it myself and encountered a problem. kedro-plugins/kedro-datasets/kedro_datasets/polars/lazy_polars_dataset.py Lines 207 to 236 in 58ef629
|
I would hold off such a change for now. As mentioned in the docs:
On top of that, I heard the Polars team is working on completely rewriting their streaming engine. So I would just stick with the current implementation... |
Indeed, streaming mode is unstable, but lazy non-streaming methods are considered stable, as far as I understand? About what to do with remote storages, maybe we can offer a |
Actually, the Regular methods on the other hand are stable. As a matter of fact, almost all eager methods use the corresponding lazy method under the hood (e.g. |
Description
When passing a lazy DataFrame to
LazyPolarsDataset
, it is currently collected into an eager DataFrame before writing it using the appropriatepl.write_*
function. This can be skipped by writing the lazy dataframe usingpl.sink_*
.Context
In some cases, it may be faster to collect the lazy DataFrame in streaming mode. Additionally, it is not always possible to collect the entire DataFrame (e.g., if the data is too large). Using
pl.sink_*
, the entire data set does not need to be loaded.Possible Implementation
In the
_save
function, the input DataFrame could first be coerced into a lazy DataFrame and then written to disk usingpl.sink_*
.Possible Alternatives
The text was updated successfully, but these errors were encountered: