Run Luo1,2*, Haonan Zhang3*, Longze Chen1,2*, Ting-En Lin3*,
Xiong Liu3, Yuchuan Wu3, Min Yang1,2π, Yongbin Li3π,
Minzheng Wang2, Pengpeng Zeng4, Lianli Gao5, Heng Tao Shen4,
Yunshui Li1,2, Xiaobo Xia6, FeiHuang3, Jingkuan Song4π,
* Equal contribution π Corresponding author
1 Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences
2 University of Chinese Academy of Sciences
3 Alibaba Group
4 Tongji University
5 Independent Researcher
6 The University of Sydney
- [11/10]π₯MMEvol is coming! We release the code, models, and data for MMEvol!
- [09/09]π₯MMEvol is coming! We release the paper for MMEvol!
Please follow the instructions below to install the required packages.
- Clone this repository
git clone https://github.com/RainBowLuoCS/MMEvol.git
cd MMEvol
- Install Package
conda create -n llava-next python=3.10 -y
conda activate llava-next
pip install --upgrade pip # enable PEP 660 support
pip install -e .
- Install additional packages for training
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
Both hyperparameters used in pretraining and finetuning are provided below.
Hyperparameter | Global Batch Size | LLM lr | Projector lr | Vision Tower lr | Epochs | Max length | Weight decay |
---|---|---|---|---|---|---|---|
PT | 256 | 0 | 1e-3 | 0 | 1 | 4096 | 0 |
FT | 128 | 2e-5 | 2e-5 | 2e-6 | 1 | 4096 | 0 |
Here are the pretrained weights and instruction tuning weights
Model | Pretrained Projector | Base LLM | PT Data | IT Data | Download |
---|---|---|---|---|---|
MMEvol-Qwen2-7B | mm_projector | Qwen2-7B | LLaVA-Pretrain | MMEvol | ckpt |
MMEvol-LLaMA3-8B | mm_projector | LLaMA3-8B | LLaVA-Pretrain | MMEvol | ckpt |
VLMEvalKit Support (OpenCompass)
Model | MME_C | MMStar | HallBench | MathVista_mini | MMMU_val | AI2D | POPE | BLINK | RWQA |
---|---|---|---|---|---|---|---|---|---|
MMEvol-LLaMA3-8B | 47.8 | 50.1 | 62.3 | 50.0 | 40.8 | 73.9 | 86.8 | 46.4 | 62.6 |
MMEvol-Qwen2-7B | 55.8 | 51.6 | 64.1 | 52.4 | 45.1 | 74.7 | 87.8 | 47.7 | 63.9 |
VLMEvalKit Not Support (VQADataSet)
Model | VQA_v2 | GQA | MIA | MMSInst |
---|---|---|---|---|
MMEvol-LLaMA3-8B | 83.4 | 65.0 | 78.8 | 32.3 |
MMEvol-Qwen2-7B | 83.1 | 65.5 | 77.6 | 41.8 |
Please follow LLaVA to prepare the corresponding images and data.
datasets
βββ json
β βββ allava_vflan.json
β βββ arxivqa.json
β βββ cambrain_math_code.json
β βββ data_engine.json
β βββ shargpt_40k.json
β βββ tabmwp.json
β βββ wizardlm_143k.json
β βββ mmevol_seed_no_evol_163k.json
β βββ mmevol_evol_480k.json
β βββ mix_evol_sft.json
βββ ai2d
β βββ abc_images
β βββ annotations
β βββ images
β βββ questions
β βββ categories.json
βββ alfword
β βββ alf-image-id-0
β βββ alf-image-id-1
β βββ alf-image-id-2
β βββ alf-image-id-3
β βββ alf-image-id-4
βββ allava_vflan
β βββ images
βββ arxivqa
β βββ images
βββ chartqa
β βββ test
β βββ train
β βββ val
βββ coco
β βββ train2014
β βββ train2017
β βββ val2014
β βββ val2017
βββ clevr
β βββ CLEVR_GoGenT_v1.0
β βββ CLEVR_v1.0
βββ data_engine
β βββ partI
β βββ partII
β βββ partIII
βββ design2code
β βββ images
βββ docvqa
β βββ test
β βββ train
β βββ val
βββ dvqa
β βββ images
βββ geo170k
β βββ images/geo3k
β βββ images/geoqa_plus
βββ geoqa+
β βββ images
βββ gpt4v-dataset
β βββ images
βββ gqa
β βββ images
βββ hfdata
β βββ ....
βββ llava
β βββ llava_pretrain/images
βββ llavar
β βββ finetune
βββ mathvision
β βββ images
βββ ocr_vqa
β βββ images
βββ Q-Instruct-DB
β βββ livefb_liveitw_aigc
β βββ spqa_koniq
βββ sam
β βββ images
βββ scienceqa
β βββ images
βββ share_textvqa
β βββ images
βββ synthdog-en
β βββ images
βββ tabmwp
β βββ tables
βββ textbookqa
β βββ tqa_train_val_test
βββ textvqa
β βββ train_images
βββ vg
β βββ VG_100K
β βββ VG_100K_2
βββ vizwiz
β βββ train
βββ web-celebrity
β βββ images
βββ web-landmark
β βββ images
βββ wikiart
β βββ images
mmevol_evol_480k.json is the 480k evolution data evolved from the seed data mmevol_seed_no_evol_163k.json. You can freely combine other data such as allava_vflan.json for instruction ftuning (IT) training according to your personal preferences, or directly use our mixed mix_evol_sft.json for training.
Please download the 558K subset of the LAION-CC-SBU dataset with BLIP captions here and organize the data following Preparation before training . Make sure set up the corresponding train script with correct setting (data path, weight path, and hyper-paramaters)
bash scripts/v1_6/train/llama3/pretrain.sh
bash scripts/v1_6/train/qwen2/pretrain.sh
Please make sure you download and organize the data following Preparation before training. Make sure set up the corresponding train script with correct setting (data path, weight path, and hyper-paramaters)
bash scripts/v1_6/train/llama3/finetune.sh
bash scripts/v1_6/train/qwen2/finetune.sh
First, enter the vlmevalkit
directory and install all dependencies:
cd vlmevalkit
pip install -r requirements.txt
Then, run script/run_inference.sh
, which receives three input parameters in sequence: MODELNAME
, DATALIST
, and MODE
. MODELNAME
represents the name of the model, DATALIST
represents the datasets used for inference, and MODE
represents evaluation mode:
chmod +x ./script/run_inference.sh
./script/run_inference.sh $MODELNAME $DATALIST $MODE
The two available choices for MODELNAME
are listed in vlmeval/config.py
:
ungrouped = {
'MMEvol-Llama3-V-1_6': partial(LLaVA_Llama3_V, model_path="checkpoints/xxx/checkpoint-14000"),
'MMEvol-Qwen2-V-1_6': partial(LLaVA_Qwen2_V, model_path="checkpoints/xxx/checkpoint-14000"),
}
All available choices for DATALIST
are listed in vlmeval/utils/dataset_config.py
. While evaluating on a single dataset, call the dataset name directly without quotation marks; while evaluating on multiple datasets, separate the names of different datasets with spaces and add quotation marks at both ends:
$DATALIST="MME MMMU_DEV_VAL MathVista_MINI RealWorldQA MMStar AI2D_TEST HallusionBench POPE BLINK"
While scoring on each benchmark directly, set MODE=all
. If only inference results are required, set MODE=infer
. In order to reproduce the results in the table displayed on the homepage (columns between MME and RealWorldQA), you need to run the script according to the following settings:
# run on all 9 datasets
./script/run_inference.sh MiniCPM-Llama3-V-2_5 "MME MMMU_DEV_VAL MathVista_MINI LLaVABench RealWorldQA MMStar MMVet AI2D_TEST OCRBench HallusionBench POPE BLINK" all
# The following are instructions for running on a single dataset
# MME
./script/run_inference.sh MMEvol-Llama3-V-1_6 MME all
# MMMU_DEV_VAL
./script/run_inference.sh MMEvol-Llama3-V-1_6 MMMU_DEV_VAL all
# MathVista_MINI
./script/run_inference.sh MMEvol-Llama3-V-1_6 MathVista_MINI all
.....
# NOTE you should use llava/eval/blink_eval.py for blink evaluation individually.
python llava/eval/blink_eval.py
For VQA and GQA dataset, please follow LLaVA for evaluation.
For MIA and MMSInst , first download the dataset and then run the following scripts for evaluation
cd mmevol
# test
python llava/eval/model_vqa_mia.py
python llava/eval/model_vqa_mminst.py
# eval
python llava/eval/mia_eval.py
python llava/eval/mminst_eval.py
The Tongyi-ConvAI generates this dataset for multi-modal supervised fine-tuning. This dataset was used to train Evol-Llama3-8B-Instruct and Evol-Qwen2-7B reported in our paper. To create this dataset, we first selected 163K Seed Instruction Tuning Dataset for Evol-Instruct, then we enhance data quality through an iterative process that involves a refined combination of fine-grained perception, cognitive reasoning, and interaction evolution. This process results in the generation of a more complex and diverse image-text instruction dataset, which in turn empowers MLLMs with enhanced capabilities. Below we showcase the detailed data distribution of the SEED-163K, which is prepared for multi-round evolution mentioned above. More details can be found in our paper.
- Release MMEvol-10M
- Release training & evaluation code
- Release model weight
- Release evolved dataset MMEvol-480K
If you find this repo useful for your research, please consider citing the paper
@article{luo2024mmevol,
title={Mmevol: Empowering multimodal large language models with evol-instruct},
author={Luo, Run and Zhang, Haonan and Chen, Longze and Lin, Ting-En and Liu, Xiong and Wu, Yuchuan and Yang, Min and Wang, Minzheng and Zeng, Pengpeng and Gao, Lianli and others},
journal={arXiv preprint arXiv:2409.05840},
year={2024}
}
if you have any question, please consider following concat for help
-
Run Luo β [email protected]
-
Haonan Zhang β [email protected]
- LLaVA: the codebase we built upon. Thanks for their brilliant contributions to the community! We just can't wait to use LLaVA-NeXT.
- VLMEvalKit: the amazing open-sourced suit for evaluating various LMMs!