Separate BatchGenerator into standalone Slicer and Batcher components? #172

weiji14 · 2023-03-07T06:59:17Z

What is your issue?

Current state

Currently, xbatcher v0.3.0's BatchGenerator is this all-in-one class/function that does too many things, and there are more features planned. The 400+ lines of code at https://github.com/xarray-contrib/xbatcher/blob/v0.3.0/xbatcher/generators.py is not something easy for people to understand and contribute to without spending a few hours. To make things more maintainable and future proof, we might need a major refactor.

Proposal

Split BatchGenerator into 2 (or more) subcomponents. Specifically:

A Slicer that does the slicing/subsetting/cropping/tiling/chipping from a multi-dimensional xarray object.
A Batcher that groups together the pieces from the Slicer into batches of data.

These are the parameters from the current BatchGenerator that would be handled by each component:

Slicer:

input_dims
input_overlap

Batcher:

batch_dims
concat_input_dims
preload_batch

Benefits

A NaN checker could be inserted in between Slicer and Batcher
- Support for valid examples #158
- Predicate option for BatchGenerator #162
All the extra logic on deleting/adding extra dimensions can be done on the Batcher side, or in a step post-Batcher
- Need to prepend a size 1 batch dimension to arrays returned from batch[] #36
- Should there always be a sample dimension? #127
Allow for creating train/val/test splits after Slicer but before Batcher
- Verde & Xbatcher -> Any connections / shared use? #78
- Also, some people do shuffling after getting slices of data, others may shuffle after batches are created, xref Add ability to shuffle (and reshuffle) batches #170
Streaming data for performance reasons
- In torchdata, it is possible to have the Slicer run in parallel with the Batcher. E.g. with a batch_size of 128, Slicer would load data up to 128 chips, pass it on to Batcher and feed it to the ML model, while the next round of data processing happens. This is without loading everything into memory.
- https://github.com/orgs/xarray-contrib/projects/1
Flexibility with what step to cache things at
- At Cache batches #109, the proposal was to cache things after Batcher when the batches have been generated already. Sometimes though, people might want to set batch_size as a hyperparameter in their ML experimentation, in which case the cache should be done after Slicer.

Cons

May result in the current one-liner becoming a multi-liner
Could lead to some backwards incompatibility/breaking changes

The text was updated successfully, but these errors were encountered:

maxrjones · 2023-03-08T21:17:22Z

Thanks for opening this issue @weiji14! Great idea for a refactor to simplify the code base, promote new contributions, and help solve the web of existing issues!

I think when using concat_input_dims=False, the division between Slicer and Batcher that you suggested makes a lot of sense and would be relatively simple to decouple (at least for those who've spent the time getting familiar with the current implementation).

When using concat_input_dims=True, it's a bit more complicated because batch_dims can impact slicing. Specifically, the input dataset is sliced on the union of input_dims and batch_dims in that case. There are a few options to account for this:

Break backwards compatibility by not ever slicing on batch_dims, even when concat_input_dims==True
batch_dims would need to also be included in Slicer
A third component could handle slicing on batch_dims between the Slicer and Batcher components
Additional slicing would happen in Batcher for this edge case

I expect that option 3 (a separate component for this edge case) would make the most sense. I'll work on this a bit now.

cmdupuis3 · 2023-03-09T18:52:47Z

I think this setup would mimic what I'm doing now with my rolling/batching scheme outside of xbatcher. The important thing there is that I can explicitly control the batch sizes, even with predicates involved.

I think if we include predicates though, we need to have a map that can "unbatch" the results because the map may not be straightforward, especially if there are overlaps between the result chips. See #43

weiji14 added the question Further information is requested label Mar 7, 2023

weiji14 mentioned this issue Mar 7, 2023

Support for valid examples #158

Open

weiji14 mentioned this issue Mar 22, 2023

Add ability to shuffle (and reshuffle) batches #170

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate BatchGenerator into standalone Slicer and Batcher components? #172

Separate BatchGenerator into standalone Slicer and Batcher components? #172

weiji14 commented Mar 7, 2023 •

edited

Loading

maxrjones commented Mar 8, 2023

cmdupuis3 commented Mar 9, 2023

Separate BatchGenerator into standalone Slicer and Batcher components? #172

Separate BatchGenerator into standalone Slicer and Batcher components? #172

Comments

weiji14 commented Mar 7, 2023 • edited Loading

What is your issue?

Current state

Proposal

Benefits

maxrjones commented Mar 8, 2023

cmdupuis3 commented Mar 9, 2023

weiji14 commented Mar 7, 2023 •

edited

Loading