Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AutoEncoderKL output tensor dimension mismatch with Input #498

Open
shankartmv opened this issue Jul 11, 2024 · 3 comments
Open

AutoEncoderKL output tensor dimension mismatch with Input #498

shankartmv opened this issue Jul 11, 2024 · 3 comments
Assignees

Comments

@shankartmv
Copy link

I am trying to train a AutoEncoderKL model on RGB images with the following dimensions (3,1225,966). Here is the code that I use ( similar to what's there in tutorials/generative/2d_ldm/2d_ldm_tutorial.ipynb ).
autoencoderkl = AutoencoderKL(
spatial_dims=2,
in_channels=3,
out_channels=3,
num_channels=(128, 256, 384),
latent_channels=8,
num_res_blocks=1,
attention_levels=(False, False, False),
with_encoder_nonlocal_attn=False,
with_decoder_nonlocal_attn=False,
)
autoencoderkl = autoencoderkl.to(device)

Error is reported at line 27 (Train Model - as in the tutorials notebook)
recons_loss = F.l1_loss(reconstruction.float(), images.float()) RuntimeError: The size of tensor a (964) must match the size of tensor b (966) at non-singleton dimension 3

Using pytorchinfo package , I was able to print the model summary and can find the discrepancy in the upsampling layer.

===================================================================================================================
Layer (type:depth-idx) Input Shape Output Shape Param #
===================================================================================================================
AutoencoderKL [1, 3, 1225, 966] [1, 3, 1224, 964] --
├─Encoder: 1-1 [1, 3, 1225, 966] [1, 8, 306, 241] --
│ └─ModuleList: 2-1 -- -- --
│ │ └─Convolution: 3-1 [1, 3, 1225, 966] [1, 128, 1225, 966] 3,584
│ │ └─ResBlock: 3-2 [1, 128, 1225, 966] [1, 128, 1225, 966] 295,680
│ │ └─Downsample: 3-3 [1, 128, 1225, 966] [1, 128, 612, 483] 147,584
│ │ └─ResBlock: 3-4 [1, 128, 612, 483] [1, 256, 612, 483] 919,040
│ │ └─Downsample: 3-5 [1, 256, 612, 483] [1, 256, 306, 241] 590,080
│ │ └─ResBlock: 3-6 [1, 256, 306, 241] [1, 384, 306, 241] 2,312,576
│ │ └─GroupNorm: 3-7 [1, 384, 306, 241] [1, 384, 306, 241] 768
│ │ └─Convolution: 3-8 [1, 384, 306, 241] [1, 8, 306, 241] 27,656
├─Convolution: 1-2 [1, 8, 306, 241] [1, 8, 306, 241] --
│ └─Conv2d: 2-2 [1, 8, 306, 241] [1, 8, 306, 241] 72
├─Convolution: 1-3 [1, 8, 306, 241] [1, 8, 306, 241] --
│ └─Conv2d: 2-3 [1, 8, 306, 241] [1, 8, 306, 241] 72
├─Convolution: 1-4 [1, 8, 306, 241] [1, 8, 306, 241] --
│ └─Conv2d: 2-4 [1, 8, 306, 241] [1, 8, 306, 241] 72
├─Decoder: 1-5 [1, 8, 306, 241] [1, 3, 1224, 964] --
│ └─ModuleList: 2-5 -- -- --
│ │ └─Convolution: 3-9 [1, 8, 306, 241] [1, 384, 306, 241] 28,032
│ │ └─ResBlock: 3-10 [1, 384, 306, 241] [1, 384, 306, 241] 2,656,512
│ │ └─Upsample: 3-11 [1, 384, 306, 241] [1, 384, 612, 482] 1,327,488
│ │ └─ResBlock: 3-12 [1, 384, 612, 482] [1, 256, 612, 482] 1,574,912
│ │ └─Upsample: 3-13 [1, 256, 612, 482] [1, 256, 1224, 964] 590,080
│ │ └─ResBlock: 3-14 [1, 256, 1224, 964] [1, 128, 1224, 964] 476,288
│ │ └─GroupNorm: 3-15 [1, 128, 1224, 964] [1, 128, 1224, 964] 256
│ │ └─Convolution: 3-16 [1, 128, 1224, 964] [1, 3, 1224, 964] 3,459
===================================================================================================================
Total params: 10,954,211
Trainable params: 10,954,211
Non-trainable params: 0
Total mult-adds (Units.TERABYTES): 3.20
===================================================================================================================
Input size (MB): 14.20
Forward/backward pass size (MB): 26803.57
Params size (MB): 43.82
Estimated Total Size (MB): 26861.59
===================================================================================================================

@shankartmv
Copy link
Author

After some debugging I figured out a way to get around this problem. By resizing my images to standard 3:2 aspect ratio, (1024*720) I can see that the input and output shapes (obtained from pytorch.summary) of my AutoEncoderKL is consistent. But anyways, I would like to know the reason behind this error.

@xmhGit
Copy link

xmhGit commented Aug 6, 2024

I believe this is caused by downsampling and upsampling on data with a nan 2 power dimension.

@virginiafdez
Copy link
Contributor

I think this happens cause you have downsamplings that divide the spatial dimensions by 2 and upsample, so unless you play around with the paddings and strides to make sure things end up having the same size, you might run into errors. I would recommend simply padding your inputs to a size that is consistently divisible by 2.

@virginiafdez virginiafdez self-assigned this Oct 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants