a question about max_attn_resolution and crossattn layer numbers #3

yangyichu · 2024-07-12T05:55:25Z

I see in the provided example checkpoint, max_attn_resolution is set to be 16. So during encoding, the image will go through downblocks of 64x64, 32x32, 16x16, 16x16, cross_attn is added twice after 16x16 downblocks. Yet during decoding the image will go through 16x16, 32x32, 64x64, 64x64, and cross_attn is added only once, is this an expected behavior(resulting in asymmetric encoder and decoder structure)?

Manchery · 2024-07-13T05:12:49Z

Hi, thank you for your interest in our work!

You are correct. There are indeed two cross-attention blocks for the encoders but only one for the decoders. This wasn't an intentional design choice. Initially, the cross-attention mechanism was supposed to be applied to multi-scale features. However, I set the max_attn_resolution to 16 mainly to save memory usage. Despite this, the current architecture performs well in practice. I will conduct experiments with more cross-attention blocks (e.g., setting max_attn_resolution to 32) to see if this can further improve performance. Thank you for pointing this out to my attention!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

a question about max_attn_resolution and crossattn layer numbers #3

a question about max_attn_resolution and crossattn layer numbers #3

yangyichu commented Jul 12, 2024

Manchery commented Jul 13, 2024

a question about max_attn_resolution and crossattn layer numbers #3

a question about max_attn_resolution and crossattn layer numbers #3

Comments

yangyichu commented Jul 12, 2024

Manchery commented Jul 13, 2024