Run Luo1,2*, Haonan Zhang3*, Longze Chen1,2*, Ting-En Lin3*,
Xiong Liu3, Yuchuan Wu3, Min Yang1,2🌟, Yongbin Li3🌟,
Minzheng Wang2, Pengpeng Zeng4, Lianli Gao5, Heng Tao Shen4,
Yunshui Li1,2, Xiaobo Xia6, FeiHuang3, Jingkuan Song4🌟,
* Equal contribution 🌟 Corresponding author
1 Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences
2 University of Chinese Academy of Sciences
3 Alibaba Group
4 Tongji University
5 Independent Researcher
6 The University of Sydney
- [11/10]🔥MMEvol is coming! We release the code, models, and data for MMEvol!
- [09/09]🔥MMEvol is coming! We release the paper for MMEvol!
Please follow the instructions below to install the required packages.
- Clone this repository
git clone https://github.com/RainBowLuoCS/MMEvol.git
cd MMEvol
- Install Package
conda create -n llava-next python=3.10 -y
conda activate llava-next
pip install --upgrade pip # enable PEP 660 support
pip install -e .
- Install additional packages for training
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
Both hyperparameters used in pretraining and finetuning are provided below.
Hyperparameter | Global Batch Size | LLM lr | Projector lr | Vision Tower lr | Epochs | Max length | Weight decay |
---|---|---|---|---|---|---|---|
PT | 256 | 0 | 1e-3 | 0 | 1 | 4096 | 0 |
FT | 128 | 2e-5 | 2e-5 | 2e-6 | 1 | 4096 | 0 |
Here are the pretrained weights and instruction tuning weights
Model | Pretrained Projector | Base LLM | PT Data | IT Data | Download |
---|---|---|---|---|---|
MMEvol-Qwen2-7B | mm_projector | Qwen2-7B | LLaVA-Pretrain | MMEvol | ckpt |
MMEvol-LLaMA3-8B | mm_projector | LLaMA3-8B | LLaVA-Pretrain | MMEvol | ckpt |
VLMEvalKit Support (OpenCompass)
Model | MME_C | MMStar | HallBench | MathVista_mini | MMMU_val | AI2D | POPE | BLINK | RWQA |
---|---|---|---|---|---|---|---|---|---|
MMEvol-LLaMA3-8B | 47.8 | 50.1 | 62.3 | 50.0 | 40.8 | 73.9 | 86.8 | 46.4 | 62.6 |
MMEvol-Qwen2-7B | 55.8 | 51.6 | 64.1 | 52.4 | 45.1 | 74.7 | 87.8 | 47.7 | 63.9 |
VLMEvalKit Not Support (VQADataSet)
Model | VQA_v2 | GQA | MIA | MMSInst |
---|---|---|---|---|
MMEvol-LLaMA3-8B | 83.4 | 65.0 | 78.8 | 32.3 |
MMEvol-Qwen2-7B | 83.1 | 65.5 | 77.6 | 41.8 |
Please follow LLaVA to prepare the corresponding images and data.
datasets
├── json
│ ├── allava_vflan.json
│ ├── arxivqa.json
│ ├── cambrain_math_code.json
│ ├── data_engine.json
│ ├── shargpt_40k.json
│ ├── tabmwp.json
│ ├── wizardlm_143k.json
│ ├── mmevol_seed_no_evol_163k.json
│ ├── mmevol_evol_480k.json
│ └── mix_evol_sft.json
├── ai2d
│ ├── abc_images
│ ├── annotations
│ ├── images
│ ├── questions
│ └── categories.json
├── alfword
│ ├── alf-image-id-0
│ ├── alf-image-id-1
│ ├── alf-image-id-2
│ ├── alf-image-id-3
│ └── alf-image-id-4
├── allava_vflan
│ └── images
├── arxivqa
│ └── images
├── chartqa
│ ├── test
│ ├── train
│ └── val
├── coco
│ ├── train2014
│ ├── train2017
│ ├── val2014
│ └── val2017
├── clevr
│ ├── CLEVR_GoGenT_v1.0
│ └── CLEVR_v1.0
├── data_engine
│ ├── partI
│ ├── partII
│ └── partIII
├── design2code
│ └── images
├── docvqa
│ ├── test
│ ├── train
│ └── val
├── dvqa
│ └── images
├── geo170k
│ ├── images/geo3k
│ └── images/geoqa_plus
├── geoqa+
│ └── images
├── gpt4v-dataset
│ └── images
├── gqa
│ └── images
├── hfdata
│ └── ....
├── llava
│ └── llava_pretrain/images
├── llavar
│ └── finetune
├── mathvision
│ └── images
├── ocr_vqa
│ └── images
├── Q-Instruct-DB
│ ├── livefb_liveitw_aigc
│ └── spqa_koniq
├── sam
│ └── images
├── scienceqa
│ └── images
├── share_textvqa
│ └── images
├── synthdog-en
│ └── images
├── tabmwp
│ └── tables
├── textbookqa
│ └── tqa_train_val_test
├── textvqa
│ └── train_images
├── vg
│ ├── VG_100K
│ └── VG_100K_2
├── vizwiz
│ └── train
├── web-celebrity
│ └── images
├── web-landmark
│ └── images
└── wikiart
│ └── images
mmevol_evol_480k.json is the 480k evolution data evolved from the seed data mmevol_seed_no_evol_163k.json. You can freely combine other data such as allava_vflan.json for instruction ftuning (IT) training according to your personal preferences, or directly use our mixed mix_evol_sft.json for training.
Please download the 558K subset of the LAION-CC-SBU dataset with BLIP captions here and organize the data following Preparation before training . Make sure set up the corresponding train script with correct setting (data path, weight path, and hyper-paramaters)
bash scripts/v1_6/train/llama3/pretrain.sh
bash scripts/v1_6/train/qwen2/pretrain.sh
Please make sure you download and organize the data following Preparation before training. Make sure set up the corresponding train script with correct setting (data path, weight path, and hyper-paramaters)
bash scripts/v1_6/train/llama3/finetune.sh
bash scripts/v1_6/train/qwen2/finetune.sh
First, enter the vlmevalkit
directory and install all dependencies:
cd vlmevalkit
pip install -r requirements.txt
Then, run script/run_inference.sh
, which receives three input parameters in sequence: MODELNAME
, DATALIST
, and MODE
. MODELNAME
represents the name of the model, DATALIST
represents the datasets used for inference, and MODE
represents evaluation mode:
chmod +x ./script/run_inference.sh
./script/run_inference.sh $MODELNAME $DATALIST $MODE
The two available choices for MODELNAME
are listed in vlmeval/config.py
:
ungrouped = {
'MMEvol-Llama3-V-1_6': partial(LLaVA_Llama3_V, model_path="checkpoints/xxx/checkpoint-14000"),
'MMEvol-Qwen2-V-1_6': partial(LLaVA_Qwen2_V, model_path="checkpoints/xxx/checkpoint-14000"),
}
All available choices for DATALIST
are listed in vlmeval/utils/dataset_config.py
. While evaluating on a single dataset, call the dataset name directly without quotation marks; while evaluating on multiple datasets, separate the names of different datasets with spaces and add quotation marks at both ends:
$DATALIST="MME MMMU_DEV_VAL MathVista_MINI RealWorldQA MMStar AI2D_TEST HallusionBench POPE BLINK"
While scoring on each benchmark directly, set MODE=all
. If only inference results are required, set MODE=infer
. In order to reproduce the results in the table displayed on the homepage (columns between MME and RealWorldQA), you need to run the script according to the following settings:
# run on all 9 datasets
./script/run_inference.sh MiniCPM-Llama3-V-2_5 "MME MMMU_DEV_VAL MathVista_MINI LLaVABench RealWorldQA MMStar MMVet AI2D_TEST OCRBench HallusionBench POPE BLINK" all
# The following are instructions for running on a single dataset
# MME
./script/run_inference.sh MMEvol-Llama3-V-1_6 MME all
# MMMU_DEV_VAL
./script/run_inference.sh MMEvol-Llama3-V-1_6 MMMU_DEV_VAL all
# MathVista_MINI
./script/run_inference.sh MMEvol-Llama3-V-1_6 MathVista_MINI all
.....
# NOTE you should use llava/eval/blink_eval.py for blink evaluation individually.
python llava/eval/blink_eval.py
For VQA and GQA dataset, please follow LLaVA for evaluation.
For MIA and MMSInst , first download the dataset and then run the following scripts for evaluation
cd mmevol
# test
python llava/eval/model_vqa_mia.py
python llava/eval/model_vqa_mminst.py
# eval
python llava/eval/mia_eval.py
python llava/eval/mminst_eval.py
The Tongyi-ConvAI generates this dataset for multi-modal supervised fine-tuning. This dataset was used to train Evol-Llama3-8B-Instruct and Evol-Qwen2-7B reported in our paper. To create this dataset, we first selected 163K Seed Instruction Tuning Dataset for Evol-Instruct, then we enhance data quality through an iterative process that involves a refined combination of fine-grained perception, cognitive reasoning, and interaction evolution. This process results in the generation of a more complex and diverse image-text instruction dataset, which in turn empowers MLLMs with enhanced capabilities. Below we showcase the detailed data distribution of the SEED-163K, which is prepared for multi-round evolution mentioned above. More details can be found in our paper.
- Release MMEvol-10M
- Release training & evaluation code
- Release model weight
- Release evolved dataset MMEvol-480K
If you find this repo useful for your research, please consider citing the paper
@article{luo2024mmevol,
title={Mmevol: Empowering multimodal large language models with evol-instruct},
author={Luo, Run and Zhang, Haonan and Chen, Longze and Lin, Ting-En and Liu, Xiong and Wu, Yuchuan and Yang, Min and Wang, Minzheng and Zeng, Pengpeng and Gao, Lianli and others},
journal={arXiv preprint arXiv:2409.05840},
year={2024}
}
if you have any question, please consider following concat for help
-
Run Luo — [email protected]
-
Haonan Zhang — [email protected]
- LLaVA: the codebase we built upon. Thanks for their brilliant contributions to the community! We just can't wait to use LLaVA-NeXT.
- VLMEvalKit: the amazing open-sourced suit for evaluating various LMMs!