Readability of CLIP notebook #28

josh-freeman · 2023-01-08T07:38:07Z

deleted modification of clip.clip.MODELS as it is no longer needed (all models are now there by default, see last commit)
there was a repetition of 17 lines of code twice in interpret. I replaced it with a function. I also added to this function an assertion that checks if the model is coming from Hila Chefer's version of CLIP or an equivalent (https://github.com/hila-chefer/Transformer-MM-Explainability/tree/main/CLIP).
added a function mask_from_relevance for users who would want to display the attention mask in another way than show_cam_on_image.

- deleted modification of clip.clip.MODELS as it is no longer needed (all models are now there by default, see last commit) - there was a repetition of 17 lines of code twice in `interpret`. I replaced it with a function. I also added to this function an assertion that checks if the model is coming from Hila Chefer's version of CLIP or an equivalent (https://github.com/hila-chefer/Transformer-MM-Explainability/tree/main/CLIP). - added a function `mask_from_relevance` for users who would want to display the attention mask in another way than `show_cam_on_image`.

josh-freeman · 2023-01-08T07:40:31Z

Oh I forgot to mention:

I also added a bit of documentation to the interpret function

hila-chefer · 2023-02-02T14:12:01Z

Hi @josh-freeman, thanks for your contribution to this repo! it'll take me some time to review and approve your PR since it contains a significant amount of changes, will get to it ASAP

josh-freeman · 2023-02-02T21:58:13Z

No worries, pretty sure it's mostly a bug coming from something like conversion of CRLF to LF tokens or something; I'm surprised it says I changed that much.

guanhdrmq · 2023-08-31T01:31:45Z

Dear all,

I have one problem for ViLT. So I try to reproduce the VisualBERT for implementing in VILT. Could you point where is the save_visual_results function definition? I use ViLT for multimodal transformer but cannot use num_tokens = image_attn_blocks[0].attn_probs.shape[-1] to set the number of tokens. For example, VILT for VQA task and image is 384*384 size. The number of vision and text mixed token is 185 including cls token, so the vision token is 144 and the text token is 40 (max length).

Thanks very much

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Readability of CLIP notebook #28

Readability of CLIP notebook #28

josh-freeman commented Jan 8, 2023

josh-freeman commented Jan 8, 2023

hila-chefer commented Feb 2, 2023

josh-freeman commented Feb 2, 2023

guanhdrmq commented Aug 31, 2023

Readability of CLIP notebook #28

Are you sure you want to change the base?

Readability of CLIP notebook #28

Conversation

josh-freeman commented Jan 8, 2023

josh-freeman commented Jan 8, 2023

hila-chefer commented Feb 2, 2023

josh-freeman commented Feb 2, 2023

guanhdrmq commented Aug 31, 2023