-
Notifications
You must be signed in to change notification settings - Fork 648
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] enable large data use cases - decouple data input from pandas
, allow polars
, dask
, and/or spark
#1685
Comments
pandas
, allow polars
, dask
, and/or spark
pandas
, allow polars
, dask
, and/or spark
pandas
, allow polars
, dask
, and/or spark
pandas
, allow polars
, dask
, and/or spark
@fkiraly |
@Spinachboul, this is a very complex issue because it relies on at least a redesign of the API. You can work on this, but first we need an API written down that would allow multiple backends. Feel free to suggest something, or participate in the discussions in #1736 until we have consolidated this! We will also use some of the meet-ups for the pytorch-forecasting rework. |
@fkiraly Ohk then! I am not quite familiar with the abbreviation DSIPTS, could you please give some reference for it? |
There is a link directly at the top of the mentioned issue #1736
Primarily |
oh ok, thanks!! |
One addition from my side. We should at least implement one dataset that is able to read data by itself. So that the user does not need to provide a data object during initialisation. This would also improve or enable usage with large datasets. |
This PR carries out a clean-up refactor of `TimeSeriesDataSet`. No changes are made to the logic. This is in preparation for major work items impacting the logic, e.g., removal of the `pandas` coupling (see #1685), or a 2.0 rework (see #1736). In general, a clean state would make these easier. Work carried out: * clear, and complete docstrings, in numpydoc format * separating logic, e.g., for parameter checking, data formatting, default handling * reducing cognitive complexity and max indentations, addressing "code smells" * linting
agree! |
A key limitation of current architecture seems to be the reliance on
pandas
of the input, which limites useability in large data cases.While
torch
with appropriate backends should be able to handle large data,pandas
as a container choice, in particular the current instantiation which seems to rely on in-memory, will prove to be the bottleneck.We should therefore consider and implement support for data backends that scale better, such as
polars
,dask
, orspark
, and see how easy it is to get thepandas
pyarrow
integration to work.Architecturally, I think we should:
pandas
one of multiple potential data soft dependenciesThe key entry point for this extension or refactor is
TimeSeriesDataSet
, which requirespandas
objects to be passed.The text was updated successfully, but these errors were encountered: