save_visual_results in visualBERT #36

guanhdrmq · 2023-08-28T03:03:16Z

Hi HIla,
Could you point where is the save_visual_results function definition? I use ViLT for multimodal transformer but cannot use num_tokens = image_attn_blocks[0].attn_probs.shape[-1] to set the number of tokens. For example, VILT for VQA task and image is 384*384 size. The number of vision and text mixed token is 185 including cls token, so the vision token is 144 and the text token is 40 (max length).
Thanks very much

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

save_visual_results in visualBERT #36

save_visual_results in visualBERT #36

guanhdrmq commented Aug 28, 2023

save_visual_results in visualBERT #36

save_visual_results in visualBERT #36

Comments

guanhdrmq commented Aug 28, 2023