-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prepending batch dimensions to match Keras interfaces #39
base: main
Are you sure you want to change the base?
Conversation
Thanks a lot for getting this PR started Chris!
It's not possible for me to know that just by looking at the code. That's why we have tests! As you can see your changes have caused existing tests to fail. That's because the generated batch dataset is being compared with an "expected" dataset. Since the expected dataset does not have the extra dimension, the test fails. This raises the following question: should this feature be optional? If so, we need to create a new option for it. Just changing the behavior of the code in this way is a "breaking change" to our API, and could cause problems for other xbatcher users. Tests help guard against this. Changing the API is not out of the question--this is a very new and experimental package. But it would need to be discussed with the other developers and motivated by a clear argument. If we do add an option for this feature (e.g. You can run the test suite yourself locally as pytest -v (You'll need pytest installed.) |
Sounds good, I'll try adding the squeeze option and see what happens. I'm guessing xbatcher is supporting non-Keras use cases as well? |
Still need to add unit test(s) |
Codecov Report
@@ Coverage Diff @@
## main #39 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 2 2
Lines 77 82 +5
Branches 18 20 +2
=========================================
+ Hits 77 82 +5
Continue to review full report at Codecov.
|
Doing some cleanup atm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maxrjones, since you're a bit more familiar with Keras, could you take a quick look at this PR in relation to #35 and #36 when you have time? This PR actually doesn't conflict too much with recent changes, though the tests might need some refactoring in line with #124.
That said, I'm thinking about how useful the squeeze_batch_dim
is compared to just calling xr.DataArray.squeeze
or xr.DataArray.expand_dims
manually. I suppose that adding this extra squeeze_batch_dim
option increases usability, but it can also be a source of extra confusion, especially since the required shape of an input tensor is very subjective to what model architecture people are using.
@@ -105,6 +117,7 @@ def __init__( | |||
batch_dims={}, | |||
concat_input_dims=False, | |||
preload_batch=True, | |||
squeeze_batch_dim=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it ok to set the default to True
? I.e. squeeze the batch dimension if all dims are included in the input dims. This might break existing behaviour, so wondering if the default should be False
instead for backward compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would definitely break existing behavior. I'm not really sure which would be more intuitive, I think that's kind of subjective, but personally I would agree with you.
The problem is that the existing behavior already squeezes the array, so I think it would be kind of a pain to have to use xr.DataArray.expand_dims
, because you don't really know which dimension is being squeezed (because it's not there anymore). You'd probably end up digging through the code and iterating a few times, like I did in this scenario. Also, the fact that this behavior results in different-dimensional array results just from changing the batch dims, I think breaks the grammar that is being established here. I think your proposal to have False
as default plus xr.DataArray.squeeze
is much more logical here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that the existing behavior already squeezes the array, so I think it would be kind of a pain to have to use
xr.DataArray.expand_dims
,
Yeah, I also find this unintuitive, the fact that xbatcher squeezes any non input_dims
into a dim called sample
(edit: there's actually a related issue at #127). So another way to think of this squeeze_batch_dim
feature is that xbatcher will just return the cropped/sliced/chipped array without squeezing the dims. This is more important with higher dimensional (>3D) arrays because sometimes you do want to preserve the extra dims.
@@ -90,6 +98,10 @@ class BatchGenerator: | |||
preload_batch : bool, optional | |||
If ``True``, each batch will be loaded into memory before reshaping / | |||
processing, triggering any dask arrays to be computed. | |||
squeeze_batch_dim : bool, optional | |||
If ``False`` and all dims are input dims, each batch's dataset will have a | |||
"batch" dimension of size 1 prepended to the array. This functionality is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some suggestions on the documentation. Maybe L86 and L87 could be edited to indicate the squeeze behavior is controllable with squeeze_batch_dim
?
Also, this sentence might be a bit confusing to read. So an extra dimension of size 1 is added/prepended only when batch_dims is None
or unset. For cases where len(batch_dims) >= 1)
, the squeezing/collapsing of dimensions is still happening though. I'm wondering what's a good way to reword this to make it clearer on what is happening and why this option exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem would only appear in this one corner case, as far as I could tell. So, my solution was coded to only apply when there were no batch_dims
(at least that was my intention). If you're changing the default behavior though, you can probably make this simpler, and therefore the docs would be less confusing too.
Actually, if we go with the proposal of default |
Hmm, let's get a second opinion from someone who uses Keras. If they agree, we can close this Pull Request? |
Sorry for taking so long to respond here. My understanding of the conversation above is that the purpose of this PR can be broken down to two issues:
If either of these points would benefit from some in-depth discussion, we could put this PR on the agenda for next Monday’s Pangeo ML working group meeting. |
Hi all, just wondering what the status of this is? I'm writing some xbatcher example notebooks and I'm running into this issue again. |
Apologies again for the delay - my attention was drawn elsewhere due to illness, travel, and some other urgent tasks. If I recall correctly we proposed at our last meeting to always have a defined axis order including a sample dimension and provide the option to specify that no transposing/stacking should occur using |
See issues #36 and #38
@jhamman @rabernat I'm not sure if this solves the problem in all cases, can you have a look?