You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am working on transformer models for multi-label image classification, and your paper titled "ML-Decoder: Scalable and Versatile Classification Head" attracted my attention.
However I couldn't understand one point in the article/code: In your code, random data is given from the query input and is set as non-learnable. What is the logic behind this? Generally, I saw that as queries some form of images/text is given (not randomly given). Is it reasonable to extract the relationship between image embeddings and random data with cross-attention? Could you tell me what the fundamental idea behind this is? I would be grateful if you could help me understand this issue.
Yours sincerely.
The text was updated successfully, but these errors were encountered:
Dear Sir,
I am working on transformer models for multi-label image classification, and your paper titled "ML-Decoder: Scalable and Versatile Classification Head" attracted my attention.
However I couldn't understand one point in the article/code: In your code, random data is given from the query input and is set as non-learnable. What is the logic behind this? Generally, I saw that as queries some form of images/text is given (not randomly given). Is it reasonable to extract the relationship between image embeddings and random data with cross-attention? Could you tell me what the fundamental idea behind this is? I would be grateful if you could help me understand this issue.
Yours sincerely.
The text was updated successfully, but these errors were encountered: