Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: The size of tensor a (4096) must match the size of tensor b (500) at non-singleton dimension 2 #67

Open
henanjun opened this issue Jul 12, 2022 · 3 comments

Comments

@henanjun
Copy link

I try to inference a new image with size (2048. 2048), it raises such a problem.

@HugeBob
Copy link

HugeBob commented Jul 21, 2022

I am having the same error with images of size 1920x1080 but "The size of tensor a (1980) must match the size of tensor b (500) at non-singleton dimension 2"

@shariqfarooq123
Copy link
Owner

shariqfarooq123 commented Oct 25, 2022

This is because there are only 500 learned positional encodings and if you try to infer an image much higher than the default model resolution, then the number of tokens in the transformer would increase beyond 500 and you will get the error specified above.

Proposed resolutions:

  1. (Recommended) Resize your image down to the model resolution (NYU: 640x480, KITTI: 1241x376) and upsample (e.g. bilinear interpolation) the result back to your resolution of choice.
  2. Interpolate positional encodings to the required size.
  3. Manually remove the positional encodings from the architecture and check the result. I have observed that positional encodings don't really add much to the performance.
  4. If you have a custom high resolution depth dataset, fine-tune new larger number of positional encodings (>500, total = HW/256, where 256=16x16=patch_size x patch_size)

@zydmtaichi
Copy link

This is because there are only 500 learned positional encodings and if you try to infer an image much higher than the default model resolution, then the number of tokens in the transformer would increase beyond 500 and you will get the error specified above.

Proposed resolutions:

  1. (Recommended) Resize your image down to the model resolution (NYU: 640x480, KITTI: 1241x376) and upsample (e.g. bilinear interpolation) the result back to your resolution of choice.
  2. Interpolate positional encodings to the required size.
  3. Manually remove the positional encodings from the architecture and check the result. I have observed that positional encodings don't really add much to the performance.
  4. If you have a custom high resolution depth dataset, fine-tune new larger number of positional encodings (>500, total = HW/256, where 256=16x16=patch_size x patch_size)

hi @shariqfarooq123 :
could you please share more details about resolution 3rd? I want to keep the imgs resolution but still confused about how to remove positional encodings from the repo's infer program.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants