Prepending batch dimensions to match Keras interfaces #39

cmdupuis3 · 2021-11-16T21:08:56Z

See issues #36 and #38

@jhamman @rabernat I'm not sure if this solves the problem in all cases, can you have a look?

rabernat · 2021-11-16T21:17:54Z

Thanks a lot for getting this PR started Chris!

I'm not sure if this solves the problem in all cases, can you have a look?

It's not possible for me to know that just by looking at the code. That's why we have tests!

As you can see your changes have caused existing tests to fail. That's because the generated batch dataset is being compared with an "expected" dataset.

https://github.com/pangeo-data/xbatcher/blob/d98ad21f65affa306dfd09d4b1d00e14f599fe65/xbatcher/tests/test_generators.py#L23-L32

Since the expected dataset does not have the extra dimension, the test fails. This raises the following question: should this feature be optional? If so, we need to create a new option for it. Just changing the behavior of the code in this way is a "breaking change" to our API, and could cause problems for other xbatcher users. Tests help guard against this.

Changing the API is not out of the question--this is a very new and experimental package. But it would need to be discussed with the other developers and motivated by a clear argument.

If we do add an option for this feature (e.g. squeeze_batch_dimension=True as a default, but optionally False to enable the behavior in this PR), that needs to be covered by tests as well.

You can run the test suite yourself locally as

pytest -v

(You'll need pytest installed.)

cmdupuis3 · 2021-11-17T14:49:02Z

Sounds good, I'll try adding the squeeze option and see what happens. I'm guessing xbatcher is supporting non-Keras use cases as well?

cmdupuis3 · 2021-11-18T20:33:14Z

Still need to add unit test(s)

codecov · 2021-11-19T18:12:03Z

Codecov Report

Merging #39 (f569190) into main (decee57) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##              main       #39   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            2         2           
  Lines           77        82    +5     
  Branches        18        20    +2     
=========================================
+ Hits            77        82    +5

Impacted Files	Coverage Δ
xbatcher/generators.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update decee57...f569190. Read the comment docs.

cmdupuis3 · 2021-11-19T19:12:43Z

@jhamman @rabernat I think this is ready to be merged

cmdupuis3 · 2021-11-19T21:41:18Z

Doing some cleanup atm

cmdupuis3 · 2021-11-30T21:27:15Z

@jhamman @rabernat Sorry for the confusion, I think this is in a mergeable state now.

weiji14

@maxrjones, since you're a bit more familiar with Keras, could you take a quick look at this PR in relation to #35 and #36 when you have time? This PR actually doesn't conflict too much with recent changes, though the tests might need some refactoring in line with #124.

That said, I'm thinking about how useful the squeeze_batch_dim is compared to just calling xr.DataArray.squeeze or xr.DataArray.expand_dims manually. I suppose that adding this extra squeeze_batch_dim option increases usability, but it can also be a source of extra confusion, especially since the required shape of an input tensor is very subjective to what model architecture people are using.

weiji14 · 2022-11-26T03:19:09Z

xbatcher/generators.py

@@ -105,6 +117,7 @@ def __init__(
        batch_dims={},
        concat_input_dims=False,
        preload_batch=True,
+        squeeze_batch_dim=True,


Is it ok to set the default to True? I.e. squeeze the batch dimension if all dims are included in the input dims. This might break existing behaviour, so wondering if the default should be False instead for backward compatibility.

It would definitely break existing behavior. I'm not really sure which would be more intuitive, I think that's kind of subjective, but personally I would agree with you.

The problem is that the existing behavior already squeezes the array, so I think it would be kind of a pain to have to use xr.DataArray.expand_dims, because you don't really know which dimension is being squeezed (because it's not there anymore). You'd probably end up digging through the code and iterating a few times, like I did in this scenario. Also, the fact that this behavior results in different-dimensional array results just from changing the batch dims, I think breaks the grammar that is being established here. I think your proposal to have False as default plus xr.DataArray.squeeze is much more logical here.

The problem is that the existing behavior already squeezes the array, so I think it would be kind of a pain to have to use xr.DataArray.expand_dims,

Yeah, I also find this unintuitive, the fact that xbatcher squeezes any non input_dims into a dim called sample (edit: there's actually a related issue at #127). So another way to think of this squeeze_batch_dim feature is that xbatcher will just return the cropped/sliced/chipped array without squeezing the dims. This is more important with higher dimensional (＞3D) arrays because sometimes you do want to preserve the extra dims.

weiji14 · 2022-11-28T14:58:01Z

xbatcher/generators.py

@@ -90,6 +98,10 @@ class BatchGenerator:
    preload_batch : bool, optional
        If ``True``, each batch will be loaded into memory before reshaping /
        processing, triggering any dask arrays to be computed.
+    squeeze_batch_dim : bool, optional
+        If ``False`` and all dims are input dims, each batch's dataset will have a
+        "batch" dimension of size 1 prepended to the array. This functionality is


Some suggestions on the documentation. Maybe L86 and L87 could be edited to indicate the squeeze behavior is controllable with squeeze_batch_dim?

Also, this sentence might be a bit confusing to read. So an extra dimension of size 1 is added/prepended only when batch_dims is None or unset. For cases where len(batch_dims) ＞= 1), the squeezing/collapsing of dimensions is still happening though. I'm wondering what's a good way to reword this to make it clearer on what is happening and why this option exists.

The problem would only appear in this one corner case, as far as I could tell. So, my solution was coded to only apply when there were no batch_dims (at least that was my intention). If you're changing the default behavior though, you can probably make this simpler, and therefore the docs would be less confusing too.

cmdupuis3 · 2022-11-28T23:21:44Z

Actually, if we go with the proposal of default False and rely on xr.DataArray.squeeze, this option becomes completely unnecessary.

weiji14 · 2022-11-30T20:02:48Z

Actually, if we go with the proposal of default False and rely on xr.DataArray.squeeze, this option becomes completely unnecessary.

Hmm, let's get a second opinion from someone who uses Keras. If they agree, we can close this Pull Request?

maxrjones · 2022-12-01T16:50:39Z

Sorry for taking so long to respond here. My understanding of the conversation above is that the purpose of this PR can be broken down to two issues:

Should there be an easier option to retain all the original dataset dimensions in the generated batches? Currently one would need to include all the original dims in the dict provided to input_dims. My opinion is that this feature would be useful. One option would be for the user to specify input_dims=False to avoid any stacking step. Since this is not the main purpose of this PR, I suggest opening a separate issue to discuss further if any of you agree that this is worthwhile.
Should there always be a sample dimension (previously called the batch dimension)? This is the topic of Should there always be a sample dimension? #127. If we chose to implement the feature suggested in point 1, I think it makes sense to include a sample dimension in the case in which the user specifies that stacking should take place. In the case without a .stack step, I think it’s trivial to add a length 1 dimension and that it is not worth the API complexity to support that through the batch generator.

If either of these points would benefit from some in-depth discussion, we could put this PR on the agenda for next Monday’s Pangeo ML working group meeting.

cmdupuis3 · 2023-01-18T23:23:39Z

Hi all, just wondering what the status of this is?

I'm writing some xbatcher example notebooks and I'm running into this issue again.

maxrjones · 2023-01-19T21:29:24Z

Hi all, just wondering what the status of this is?

I'm writing some xbatcher example notebooks and I'm running into this issue again.

Apologies again for the delay - my attention was drawn elsewhere due to illness, travel, and some other urgent tasks. If I recall correctly we proposed at our last meeting to always have a defined axis order including a sample dimension and provide the option to specify that no transposing/stacking should occur using input_dims=None. I will try to draft this alternative solution by Monday.

cmdupuis3 added 2 commits November 16, 2021 20:18

Prepend batch dimension if it's not there (partial fix?)

7b0bd95

whitespace

0679395

cmdupuis3 added 4 commits November 18, 2021 19:55

Wrap batch dim generation in a flag

1322eff

Linter fixes

ca34638

really?

630bb27

omg

c61846d

cmdupuis3 added 2 commits November 18, 2021 20:45

squeeze_batch_dim test sketch

0e8f716

squeeze_batch_dim test (attempt 2)

749ac26

Fix 1D squeeze_batch_dim test

142031d

cmdupuis3 force-pushed the main branch from 4612eed to 749ac26 Compare November 19, 2021 18:22

cmdupuis3 added 2 commits November 19, 2021 18:28

pull from origin

fb29cba

More squeeze_batch_dim tests; fix bug

da42a9c

cmdupuis3 marked this pull request as draft November 19, 2021 21:18

cmdupuis3 added 2 commits November 30, 2021 19:20

minor updates

a54a9b7

Merge remote-tracking branch 'pangeo-data/main' into main

0824d25

cmdupuis3 force-pushed the main branch from d7d9fd4 to 0824d25 Compare November 30, 2021 19:22

streamline squeeze_batch_dim tests

f569190

cmdupuis3 marked this pull request as ready for review November 30, 2021 19:54

cmdupuis3 mentioned this pull request Feb 17, 2022

Friendlier API for multiple inputs/outputs #35

Open

weiji14 reviewed Nov 26, 2022

View reviewed changes

weiji14 reviewed Nov 28, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prepending batch dimensions to match Keras interfaces #39

Prepending batch dimensions to match Keras interfaces #39

cmdupuis3 commented Nov 16, 2021

rabernat commented Nov 16, 2021

cmdupuis3 commented Nov 17, 2021

cmdupuis3 commented Nov 18, 2021

codecov bot commented Nov 19, 2021 •

edited

Loading

cmdupuis3 commented Nov 19, 2021

cmdupuis3 commented Nov 19, 2021

cmdupuis3 commented Nov 30, 2021

weiji14 left a comment

weiji14 Nov 26, 2022

cmdupuis3 Nov 28, 2022 •

edited

Loading

weiji14 Nov 28, 2022 •

edited

Loading

weiji14 Nov 28, 2022

cmdupuis3 Nov 28, 2022

cmdupuis3 commented Nov 28, 2022

weiji14 commented Nov 30, 2022

maxrjones commented Dec 1, 2022

cmdupuis3 commented Jan 18, 2023

maxrjones commented Jan 19, 2023

Prepending batch dimensions to match Keras interfaces #39

Are you sure you want to change the base?

Prepending batch dimensions to match Keras interfaces #39

Conversation

cmdupuis3 commented Nov 16, 2021

rabernat commented Nov 16, 2021

cmdupuis3 commented Nov 17, 2021

cmdupuis3 commented Nov 18, 2021

codecov bot commented Nov 19, 2021 • edited Loading

Codecov Report

cmdupuis3 commented Nov 19, 2021

cmdupuis3 commented Nov 19, 2021

cmdupuis3 commented Nov 30, 2021

weiji14 left a comment

Choose a reason for hiding this comment

weiji14 Nov 26, 2022

Choose a reason for hiding this comment

cmdupuis3 Nov 28, 2022 • edited Loading

Choose a reason for hiding this comment

weiji14 Nov 28, 2022 • edited Loading

Choose a reason for hiding this comment

weiji14 Nov 28, 2022

Choose a reason for hiding this comment

cmdupuis3 Nov 28, 2022

Choose a reason for hiding this comment

cmdupuis3 commented Nov 28, 2022

weiji14 commented Nov 30, 2022

maxrjones commented Dec 1, 2022

cmdupuis3 commented Jan 18, 2023

maxrjones commented Jan 19, 2023

codecov bot commented Nov 19, 2021 •

edited

Loading

cmdupuis3 Nov 28, 2022 •

edited

Loading

weiji14 Nov 28, 2022 •

edited

Loading