This repository hosts the code, data, and model weight of LLaVA-UHD, a novel framework that enables Large Multimodal Models (LMMs) to efficiently perceive images in any aspect ratio and high resolution. Notably, our model built on LLaVA-1.5 336×336 supports 6 times larger (i.e., 672×1088) resolution images and achieves 5.7 accuracy improvement on TextVQA. Moreover, the model can be efficiently trained in academic settings, within ~1 day on 8 A100 GPUs. Visit our 📃 paper here!
LLaVA-UHD includes three key components to deal with native-resolution images:
-
An image modularization strategy that divides native-resolution images into smaller variable-sized slices for efficient and extensible encoding.
-
A novel compression module (spatially constrained resampler) that further condenses image tokens from visual encoders.
-
A spatial schema to organize slice tokens for LLMs. Comprehensive experiments show that LLaVA-UHD out- performs established LMMs trained with 2-3 orders of magnitude more data on 8 benchmarks.
-
Better and robust performance in limited training datasets
-[2024/07/29] 🔥LLaVA-UHD achieves performance improvement on 8 common benchmarks beyong LLaVA-1.5. Our novel projector, spatially constrained resampler, realizes high feature compression and convergence efficiency. Model checkpoints are available in hugging-face.
-[2024/07/01] 📢LLaVA-UHD is accepted by ECCV2024.
- To reproduce the results of the paper, please set up the Python environment using the following code:
conda create -n llava-uhd python=3.10
conda activate llava-uhd
pip install -r requirements.txt
sh install.sh
- Download the checkpoints of CLIP-ViT-L/14
and Vicuna-13B-v1.5. And put them into
./pretrained_models
. In the checkpoint path of vicuna-13b-v1.5, set 'do_sample' in 'generation_config.json' as 'True', otherwise there is an error when saving training checkpoint.
If something wrong happens, please kindly refer to the issues of LLaVA or submit issues in our repository.
-
Pretraining Data: Download the 558K subset of the LAION-CC-SBU dataset with BLIP captions we use in the paper here. And put the data into
./playground/data
. Also could refer to the documentation of LLaVA for detailed data organization. -
Fine-tuning Data: Please download the annotation of the final mixture our instruction tuning data llava_v1_5_mix665k.json, and download the images from constituting datasets:
- COCO: train2017
- GQA: images
- OCR-VQA: download script, we save all files as .jpg
- TextCaps: train_val_images
- VisualGenome: part1, part2
Download dataset images as in the finetuning process of LLaVA-1.5, place them in the
./playground/data
Please refer to train.sh for pretraining script and fine-tuning script (we comment in the file). If you want to do end-to-end pretraining, fine-tuning and evalutation, please run the following command.
sh train.sh
Evaluation script is in eval.sh, you can run
sh eval.sh dir_name_in_checkpoints_new
# e.g. sh eval.sh llava-uhd-144-13b
# llava-uhd-144-13b is the dir_name stored in the path of ./checkpoints_new
For details of data organization, please refer to here for help. We provide the same script to complete the testing.
If you find LLaVA-UHD useful for your research and applications, please cite using this BibTeX:
@inproceedings{guo2024llava-uhd,
title={{LLaVA-UHD}: an LMM Perceiving Any Aspect Ratio and High-Resolution Images},
author={Guo, Zonghao and Xu, Ruyi and Yao, Yuan and Cui, Junbo and Ni, Zanlin and Ge, Chunjiang and Chua, Tat-Seng and Liu, Zhiyuan and Huang, Gao},
booktitle={ECCV},
year={2024}
}