AutoEncoderKL output tensor dimension mismatch with Input #498

shankartmv · 2024-07-11T13:31:14Z

I am trying to train a AutoEncoderKL model on RGB images with the following dimensions (3,1225,966). Here is the code that I use ( similar to what's there in tutorials/generative/2d_ldm/2d_ldm_tutorial.ipynb ).
autoencoderkl = AutoencoderKL(
spatial_dims=2,
in_channels=3,
out_channels=3,
num_channels=(128, 256, 384),
latent_channels=8,
num_res_blocks=1,
attention_levels=(False, False, False),
with_encoder_nonlocal_attn=False,
with_decoder_nonlocal_attn=False,
)
autoencoderkl = autoencoderkl.to(device)

Error is reported at line 27 (Train Model - as in the tutorials notebook)
recons_loss = F.l1_loss(reconstruction.float(), images.float()) RuntimeError: The size of tensor a (964) must match the size of tensor b (966) at non-singleton dimension 3

Using pytorchinfo package , I was able to print the model summary and can find the discrepancy in the upsampling layer.

================================== Layer (type:depth-idx) ================================== AutoencoderKL ├─Encoder: 1-1 │ └─ModuleList: 2-1 │ │ └─Convolution: 3-1 │ │ └─ResBlock: 3-2 │ │ └─Downsample: 3-3 │ │ └─ResBlock: 3-4 │ │ └─Downsample: 3-5 │ │ └─ResBlock: 3-6 │ │ └─GroupNorm: 3-7 │ │ └─Convolution: 3-8 ├─Convolution: 1-2 │ └─Conv2d: 2-2 ├─Convolution: 1-3 │ └─Conv2d: 2-3 ├─Convolution: 1-4 │ └─Conv2d: 2-4 ├─Decoder: 1-5 │ └─ModuleList: 2-5 │ │ └─Convolution: 3-9 │ │ └─ResBlock: 3-10 │ │ └─Upsample: 3-11 │ │ └─ResBlock: 3-12 │ │ └─Upsample: 3-13 │ │ └─ResBlock: 3-14 │ │ └─GroupNorm: 3-15 │ │ └─Convolution: 3-16 ================================== Total params: 10,954,211
Trainable params: 10,954,211
Non-trainable params: 0
Total mult-adds (Units.TERABYTES): ================================== Input size (MB): 14.20
Forward/backward pass size Params size (MB): 43.82
Estimated Total Size (MB): 26861.59
================================== =================================================================================
Input Shape Output Shape Param #
=================================================================================
[1, 3, 1225, 966] [1, 3, 1224, 964] --
[1, 3, 1225, 966] [1, 8, 306, 241] --
-- -- --
[1, 3, 1225, 966] [1, 128, 1225, 966] 3,584
[1, 128, 1225, 966] [1, 128, 1225, 966] 295,680
[1, 128, 1225, 966] [1, 128, 612, 483] 147,584
[1, 128, 612, 483] [1, 256, 612, 483] 919,040
[1, 256, 612, 483] [1, 256, 306, 241] 590,080
[1, 256, 306, 241] [1, 384, 306, 241] 2,312,576
[1, 384, 306, 241] [1, 384, 306, 241] 768
[1, 384, 306, 241] [1, 8, 306, 241] 27,656
[1, 8, 306, 241] [1, 8, 306, 241] --
[1, 8, 306, 241] [1, 8, 306, 241] 72
[1, 8, 306, 241] [1, 8, 306, 241] --
[1, 8, 306, 241] [1, 8, 306, 241] 72
[1, 8, 306, 241] [1, 8, 306, 241] --
[1, 8, 306, 241] [1, 8, 306, 241] 72
[1, 8, 306, 241] [1, 3, 1224, 964] --
-- -- --
[1, 8, 306, 241] [1, 384, 306, 241] 28,032
[1, 384, 306, 241] [1, 384, 306, 241] 2,656,512
[1, 384, 306, 241] [1, 384, 612, 482] 1,327,488
[1, 384, 612, 482] [1, 256, 612, 482] 1,574,912
[1, 256, 612, 482] [1, 256, 1224, 964] 590,080
[1, 256, 1224, 964] [1, 128, 1224, 964] 476,288
[1, 128, 1224, 964] [1, 128, 1224, 964] 256
[1, 128, 1224, 964] [1, 3, 1224, 964] 3,459
=================================================================================
3.20
=================================================================================
(MB): 26803.57
=================================================================================

The text was updated successfully, but these errors were encountered:

shankartmv · 2024-07-12T12:02:57Z

After some debugging I figured out a way to get around this problem. By resizing my images to standard 3:2 aspect ratio, (1024*720) I can see that the input and output shapes (obtained from pytorch.summary) of my AutoEncoderKL is consistent. But anyways, I would like to know the reason behind this error.

xmhGit · 2024-08-06T02:46:27Z

I believe this is caused by downsampling and upsampling on data with a nan 2 power dimension.

virginiafdez · 2024-10-25T09:58:01Z

I think this happens cause you have downsamplings that divide the spatial dimensions by 2 and upsample, so unless you play around with the paddings and strides to make sure things end up having the same size, you might run into errors. I would recommend simply padding your inputs to a size that is consistently divisible by 2.

virginiafdez self-assigned this Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AutoEncoderKL output tensor dimension mismatch with Input #498

AutoEncoderKL output tensor dimension mismatch with Input #498

shankartmv commented Jul 11, 2024

shankartmv commented Jul 12, 2024

xmhGit commented Aug 6, 2024

virginiafdez commented Oct 25, 2024

AutoEncoderKL output tensor dimension mismatch with Input #498

AutoEncoderKL output tensor dimension mismatch with Input #498

Comments

shankartmv commented Jul 11, 2024

shankartmv commented Jul 12, 2024

xmhGit commented Aug 6, 2024

virginiafdez commented Oct 25, 2024