Should there always be a sample dimension? #127

maxrjones · 2022-11-18T01:02:53Z

What is your issue?

As shown in the section below, , there are a couple cases in which the batch generator will not include a sample dimension in the returned dataset/dataarray:

If the number of dimensions specified in input_dims does not exceed the number of dimensions in the original dataset by more than one. In this case, the original dimension names will be retained as long as concat_input_dims=False.
If all dimensions are included in input_dims and concat_input_dims=True. In this case, an extra dimension called input_batch is created along with the original dimensions appended with _input.

xbatcher/xbatcher/testing.py

Lines 126 to 142 in 59df776

    
           # Add a sample dimension if there's anything to get stacked 
        
           if ( 
        
               generator.concat_input_dims 
        
               and (len(generator.ds.dims) - len(generator.input_dims)) == 0 
        
           ): 
        
               expected_dims = {**{"input_batch": expected_sample_length}, **expected_dims} 
        
           elif ( 
        
               generator.concat_input_dims 
        
               or (len(generator.ds.dims) - len(generator.input_dims)) > 1 
        
           ): 
        
               expected_dims = {**{"sample": expected_sample_length}, **expected_dims} 
        
           else: 
        
               expected_dims = { 
        
                   **non_specified_ds_dims, 
        
                   **non_input_batch_dims, 
        
                   **expected_dims, 
        
               }

Would it be better to always include a sample dimension in the output batches, for example by renaming the excluded dimension to sample for case 1?

The text was updated successfully, but these errors were encountered:

cmdupuis3 · 2023-01-18T23:41:38Z

I'm looking at this again, and I think the way to understand my PoV is that I want the least friction between the for batch in bgen line and where I call my NN model (e.g., model.fit(...)). In my situation (case 1 I think), the lack of a sample dimension forces you to create a new sample dimension, and I would argue that this is non-trivial, for a few reasons:

It's not obvious to a first-time user that there is a dimension mismatch
When a first-time user's training loop inevitably fails, it's not obvious why there is a dimension mismatch
Creating a new sample dimension requires you to name it something. Ideally it would have the same name as the dimension that was discarded, but in practice, it's probably arbitrary. However, these are things that I don't think a user should even need to think about.
This situation seemingly creates an arbitrary state that the user must manage on their own.

Additionally, as per @weiji14's comments a while back, it's a much lower cognitive load to return an array with an extra sample dimension and use np.squeeze than it is to recreate a dimension inside the training loop.

cmdupuis3 · 2023-01-19T01:06:40Z

As it so happens, there is a bug with xr.DataSet.expand_dims. You can see the bug report here: pydata/xarray#7456

cmdupuis3 · 2023-01-19T20:09:10Z

It seems the xarray documentation is partially at fault, at least, according to them.

Apparently, there is no combination of xr.DataSet.expand_dims and xr.DataSet.transpose that will put the sample dimension of the DataSet in the first position. You can use axis=0 to do this for the DataSet's internal arrays, but this transposition does not affect the DataSet itself. According to pydata/xarray#7456, this is the desired behavior (believe it or not). Supposedly, the position of the DataSet dimensions are arbitrary, but Keras disagrees :(

maxrjones · 2023-01-19T20:39:37Z

Thanks for your thoughts! I commented on the other thread about the expand_dims/transpose behavior, but more generally you're correct that xbatcher will need to be responsible the correct ordering of dimensions as xarray's data model is generally agnostic to axis order.

cmdupuis3 · 2023-01-19T21:03:23Z

TBH I actually blame Keras/PyTorch for caring about axis order. So passé!

cmdupuis3 · 2023-02-02T20:17:53Z

Thinking about this some more, the current behavior does make sense if we're not considering an ML context. Like, if you wanted to sample a bunch of patches and average each of them, a sample dimension wouldn't make sense.

I'm thinking that we could have BatchGenerator wrappers for the ML libs, and then we can append a sample dimension there. I had a look at the existing ones, but I think they don't have this.

maxrjones added the question Further information is requested label Nov 18, 2022

maxrjones mentioned this issue Dec 1, 2022

Prepending batch dimensions to match Keras interfaces #39

Open

cmdupuis3 mentioned this issue Jan 20, 2023

Can't control batch size when all dims are input dims #163

Open

weiji14 mentioned this issue Mar 7, 2023

Separate BatchGenerator into standalone Slicer and Batcher components? #172

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should there always be a sample dimension? #127

Should there always be a sample dimension? #127

maxrjones commented Nov 18, 2022

cmdupuis3 commented Jan 18, 2023 •

edited

Loading

cmdupuis3 commented Jan 19, 2023

cmdupuis3 commented Jan 19, 2023

maxrjones commented Jan 19, 2023

cmdupuis3 commented Jan 19, 2023

cmdupuis3 commented Feb 2, 2023

Should there always be a sample dimension? #127

Should there always be a sample dimension? #127

Comments

maxrjones commented Nov 18, 2022

What is your issue?

cmdupuis3 commented Jan 18, 2023 • edited Loading

cmdupuis3 commented Jan 19, 2023

cmdupuis3 commented Jan 19, 2023

maxrjones commented Jan 19, 2023

cmdupuis3 commented Jan 19, 2023

cmdupuis3 commented Feb 2, 2023

cmdupuis3 commented Jan 18, 2023 •

edited

Loading