You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Branch: dev
build_dataset runs out of memory when aggregating timestamps into buckets.
2024-08-23 13:21:58.081 | DEBUG | EventStream.data.dataset_base:__init__:479 - Built events and measurements dataframe
2024-08-23 13:21:58.085 | DEBUG | EventStream.data.dataset_polars:_agg_by_time:642 - Collecting events DF. Not using streaming here as it sometimes causes segfaults.
2024-08-23 13:22:06.849 | DEBUG | EventStream.data.dataset_polars:_agg_by_time:649 - Aggregating timestamps into buckets
fish: Job 1, 'PYTHONPATH=".;" python scripts…' terminated by signal SIGKILL (Forced quit)
Branch: dev
build_dataset runs out of memory when aggregating timestamps into buckets.
The relevant code is this:
EventStreamGPT/EventStream/data/dataset_polars.py
Lines 648 to 676 in ce9e2c3
To replicate, run generate_synthetic_data with n_subjects > 50,000 and then run build_dataset.
The solution with my dataset (about 5 million subjects) was to use a compute instance with more memory during the build_dataset phase.
This is partly a polars issue. I tried limiting number of threads and streaming, and it did not make a difference.
A refactor of agg_by_time would be nice to have, but not a must.
The text was updated successfully, but these errors were encountered: