- The LLaVA-PT is from LLaVA.
- The Hybird-FT is from SViT, LVIS, LRV, MIMIC-IT.
- The LLaVA-FT is from LLaVA.
- Download the training annotations. You can download from Baidu Disk, Google Disk, Peking University Disk or Hugging Face
We also provide the processed data as follows. The link is to BaiDu Disk.
Data group | Usage | Link |
---|---|---|
LLaVA-PT | Stage 1 | LLaVA 1.5-558k |
Hybird-FT | Stage 2 | SViT-157k, LVIS-220k, LRV-331k, MIMIC-IT-256k |
LLaVA-FT | Stage 3 | LLaVA 1.5-mix-665k |
For those who can not easily access to BaiDu Disk, you can download data from Hugging Face.
After downloading all of them, organize the data as follows in IMAGE_FOLDER
.
IMAGE_FOLDER
├── llava_image
├── llava_image_tune
├── lvis_tune
├── lrv_tune
├── svit_tune
└── mimicit_tune
└── LA
Specify your IMAGE_FOLDER
and JSON_FOLDER
according to the data preparation.
For training on 384 resolution, we use google/siglip-so400m-patch14-384
as image_tower
. Notably, if you pass the --image_tower google/siglip-so400m-patch14-384
, you should upgrade the version of transformers to 4.37.0.
- Stage 1 pretraining script: pretrain.sh.
- Stage 2 tuning script: finetune.sh.
- Stage 3 moe-tuning script: finetune_moe.sh.
- Stage 1 pretraining script: pretrain.sh.
- Stage 2 tuning script: finetune.sh.
- Stage 3 moe-tuning script: finetune_moe.sh.
- Stage 1 pretraining script: pretrain.sh.
- Stage 2 tuning script: finetune.sh.
- Stage 3 moe-tuning script: finetune_moe.sh.
- Stage 1 pretraining script: pretrain.sh.
- Stage 2 tuning script: finetune.sh.
- Stage 3 moe-tuning script: finetune_moe.sh.