Generate batches lazily #111

maxrjones · 2022-10-19T21:44:21Z

Is your feature request related to a problem?

This relates to several existing issues in the xbatcher repository, including #30, #37, and #109. As mentioned in #109 (comment), iterating over the entire dataset to build the batches can be prohibitively slow and memory intensive for large datasets.

Describe the solution you'd like

Based on past discussions with @jhamman as well as TJ’s comment, we should consider a solution in which the batch generator lazily constructs batches without iterating through the entire dataset/dataarray. One key design decision that we should reconsider is whether the _batches dictionary needs to store the sliced Dataset/Dataarrays. Alternatively, as TJ and Joe have mentioned, the _batches attribute could store the indices associated with each batch. In this case, the Dataset/Dataarray would be constructed when __getitem__() or __iter__() are called. In the simplest case, I expect the indices could be stored as a dict of dimension, slice pairs that can be unpacked and passed to .isel() (e.g.,

xbatcher/xbatcher/generators.py

Line 42 in 4d8e2c8

selector = {key: slice for key, slice in zip(dims, slices)}

). As another possibility, it could be worthwhile to explore whether a custom index for xarray can be used to define batches.

One complicating factor is that the indices associated with a batch depends on concat_input_dims, such that two schemas seem required for defining the coordinates/dimensions associated with a batch. Similar to the suggestion in #93 (comment), I wonder if batch generation for concatenated input dimensions should be separate from the core BatchGenerator class.

Describe alternatives you've considered

Optimize the code under the assumption that all batches will be generated eagerly when creating a BatchGenerator object, for example by stacking non-input dims before iterating over batch_dims and input_dims. This would still not work well for large datasets.

Additional context

This relates to #30 because I expect that the proposal to define a batch based on the indexes will be simpler if samples are sliced and combined into batches, rather than the current behavior of slicing samples after slicing the dataset into batches.

The text was updated successfully, but these errors were encountered:

tjvandal · 2022-10-19T23:37:26Z

I made a similar implementation in a commit yesterday on a forked branch. It works well for my use case at the moment and seems to pass the tests. I moved the concat_input_dims into __getitem__(). I have not been able to get the asv benchmarking to check the performance on other use cases, particularly the concat_input_dims option.

tjvandal@3cdf034

maxrjones · 2022-10-20T14:59:52Z

tjvandal@3cdf034

Nice, want to open a PR for these changes? I can run the asv benchmarks, which are currently used locally rather than as part of the CI.

tjvandal · 2022-10-20T20:02:20Z

Yes, I just open a PR for this.

jhamman · 2022-10-21T16:01:16Z

@maxrjones - do you think we should close this issue or is there more you'd like to do here?

maxrjones · 2022-10-28T14:47:17Z

do you think we should close this issue or is there more you'd like to do here?

#112 fixed the main point of this issue, so closing now. Related tasks (e.g., improving benchmarks) can be addressed separately.

maxrjones mentioned this issue Oct 19, 2022

Cache batches #109

Open

tjvandal mentioned this issue Oct 20, 2022

Lazily generate batches #112

Merged

maxrjones added the enhancement New feature or request label Oct 21, 2022

maxrjones closed this as completed Oct 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate batches lazily #111

Generate batches lazily #111

maxrjones commented Oct 19, 2022

tjvandal commented Oct 19, 2022

maxrjones commented Oct 20, 2022

tjvandal commented Oct 20, 2022

jhamman commented Oct 21, 2022

maxrjones commented Oct 28, 2022

Generate batches lazily #111

Generate batches lazily #111

Comments

maxrjones commented Oct 19, 2022

Is your feature request related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

tjvandal commented Oct 19, 2022

maxrjones commented Oct 20, 2022

tjvandal commented Oct 20, 2022

jhamman commented Oct 21, 2022

maxrjones commented Oct 28, 2022