Skip to content
/ MMIU Public

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

Notifications You must be signed in to change notification settings

OpenGVLab/MMIU

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Best Practice

We strongly recommend using VLMEevalKit for its useful features and ready-to-use LVLM implementations.

MMIU

Quick Start | HomePage | arXiv | Dataset | Citation

This repository is the official implementation of MMIU.

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models
Fanqing Meng*, Jin Wang*, Chuanhao Li*, Quanfeng Lu, Hao Tian, Jiaqi Liao, Xizhou Zhu, Jifeng Dai, Yu Qiao, Ping Luo, Kaipeng Zhang#, Wenqi Shao#
* MFQ, WJ and LCH contribute equally.
# SWQ ([email protected]) and ZKP ([email protected]) are correponding authors.

💡 News

Introduction

Multimodal Multi-image Understanding (MMIU) benchmark, a comprehensive evaluation suite designed to assess LVLMs across a wide range of multi-image tasks. MMIU encompasses 7 types of multi-image relationships, 52 tasks, 77K images, and 11K meticulously curated multiple-choice questions, making it the most extensive benchmark of its kind. overview

Evaluation Results Overview

  • The closed-source proprietary model GPT-4o from OpenAI has taken a leading position in MMIU, surpassing other models such as InternVL2-pro, InternVL1.5-chat, Claude3.5-Sonnet, and Gemini1.5 flash. Note that the open-source models InternVL2-pro.

  • Some powerful LVLMs like InternVL1.5 and GLM4V whose pre-training data do not contain multi-image content even outperform many multi-image models which undergo multi-image supervised fine-tuning (SFT), indicating the strong capacity in single-image understanding is the foundation of multi-image comprehension.

  • By comparing performance at the level of image relationships, we conclude that LVLM excels at understanding semantic content in multi-image scenarios but has weaker performance in comprehending temporal and spatial relationships in multi-image contexts.

  • The analysis based on the task map reveals that models perform better on high-level understanding tasks such as video captioning which are in-domain tasks, but struggle with 3D perception tasks such as 3D detection and temporal reasoning tasks such as image ordering which are out-of-domain tasks.

  • By task learning difficulty analysis, tasks involving ordering, retrieval and massive images cannot be overfitted by simple SFT, suggesting that additional pre-training data or training techniques should be incorporated for improvement. taskmap

🏆 Leaderboard

Rank Model Score
1 GPT4o 55.72
2 Gemini 53.41
3 Claude3 53.38
4 InternVL2 50.30
5 Mantis 45.58
6 Gemini1.0 40.25
7 internvl1.5-chat 37.39
8 Llava-interleave 32.37
9 idefics2_8b 27.80
10 glm-4v-9b 27.02
11 deepseek_vl_7b 24.64
12 XComposer2_1.8b 23.46
13 deepseek_vl_1.3b 23.21
14 flamingov2 22.26
15 llava_next_vicuna_7b 22.25
16 XComposer2 21.91
17 MiniCPM-Llama3-V-2_5 21.61
18 llava_v1.5_7b 19.19
19 sharegpt4v_7b 18.52
20 sharecaptioner 16.10
21 qwen_chat 15.92
22 monkey-chat 13.74
23 idefics_9b_instruct 12.84
24 qwen_base 5.16
- Frequency Guess 31.5
- Random Guess 27.4

🚀 Quick Start

Here, we mainly use the VLMEvalKit framework for testing, with some separate tests as well. Specifically, for multi-image models, we include the following models:

transformers == 33.0

  • XComposer2
  • XComposer2_1.8b
  • qwen_base
  • idefics_9b_instruct
  • qwen_chat
  • flamingov2

transformers == 37.0

  • deepseek_vl_1.3b
  • deepseek_vl_7b

transformers == 40.0

  • idefics2_8b

For single-image models, we include the following:

transformers == 33.0

  • sharecaptioner
  • monkey-chat

transformers == 37.0

  • sharegpt4v_7b
  • llava_v1.5_7b
  • glm-4v-9b

transformers == 40.0

  • llava_next_vicuna_7b
  • MiniCPM-Llama3-V-2_5

We use the VLMEvalKit framework for testing. You can refer to the code in VLMEvalKit/test_models.py. Additionally, for closed-source models, please replace the following part of the code by following the example here:

response = model.generate(tmp) # tmp = image_paths + [question]

For other open-source models, we have provided reference code for Mantis and InternVL1.5-chat. For LLava-Interleave, please refer to the original repository.

💐 Acknowledgement

We expressed sincerely gratitude for the projects listed following:

  • VLMEvalKit provides useful out-of-box tools and implements many adavanced LVLMs. Thanks for their selfless dedication.
  • The Team of InternVL for apis.

📧 Contact

If you have any questions, feel free to contact Fanqing Meng with [email protected]

🖊️ Citation

If you feel MMIU useful in your project or research, please kindly use the following BibTeX entry to cite our paper. Thanks!

@article{meng2024mmiu,
  title={MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models},
  author={Meng, Fanqing and Wang, Jin and Li, Chuanhao and Lu, Quanfeng and Tian, Hao and Liao, Jiaqi and Zhu, Xizhou and Dai, Jifeng and Qiao, Yu and Luo, Ping and others},
  journal={arXiv preprint arXiv:2408.02718},
  year={2024}
}

About

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published