Review of the article : When and why vision-language models behave like bags-of-words, and what to do about it? by Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, James Zou.
- ARO benchmark
- NegCLIP
- CLIP
- Evaluation of CLIP and NegCLIP on ARO and CIFAR100
- NegCLIP_FTXT - we propose a method derived from NegCLIP to fine-tune CLIP, which would allow better discrimination between captions and their permuted version.
- Evaluation of NegCLIP_FTXT on ARO and CIFAR100
The work is summarized on a poster : poster.pdf