Awesome Data-Model Co-Development of MLLMs

Welcome to the "Awesome List" for data-model co-development of Multi-Modal Large Language Models (MLLMs), a continually updated resource tailored for the open-source community. This compilation features cutting-edge research, insightful articles focusing on improving MLLMs involving with the data-model co-development of MLLMs, and tagged based on the proposed taxonomy from our data-model co-development survey, as illustrated below.

Soon we will provide a dynamic table of contents to help readers more easily navigate through the materials with features such as search, filter, and sort.

Due to the rapid development in the field, this repository and our paper are continuously being updated and synchronized with each other. Please feel free to make pull requests or open issues to contribute to this list and add more related resources!

Candidate Co-Development Tags

These tags correspond to the taxonomy in our paper, and each work may be assigned with more than one tags.

Data4Model: Scaling

For Scaling Up of MLLMs: Larger Datasets

Section Title	Tag
Data Acquisition
Data Augmentation
Data Diversity

For Scaling Effectiveness of MLLMs: Better Subsets

Section Title	Tag
Data Condensation
Data Mixture
Data Packing
Cross-Modal Alignment

Data4Model: Usability

For Instruction Responsiveness of MLLMs

Section Title	Tag
Prompt Design
ICL Data
Human-Behavior Alignment Data

For Reasoning Ability of MLLMs

Section Title	Tag
Data for Single-Hop Reasoning
Data for Multi-Hop Reasoning

For Ethics of MLLMs

Section Title	Tag
Data Toxicity
Data Privacy and Intellectual Property

For Evaluation of MLLMs

Section Title	Tag
Benchmarks for Multi-Modal Understanding
Benchmarks for Multi-Modal Generation:
Benchmarks for Multi-Modal Retrieval:
Benchmarks for Multi-Modal Reasoning:

Model4Data: Synthesis

Section Title	Tag
Model as a Data Creator
Model as a Data Mapper
Model as a Data Filter
Model as a Data Evaluator

Model4Data: Insights

Section Title	Tag
Model as a Data Navigator
Model as a Data Extractor
Model as a Data Analyzer
Model as a Data Visualizer

Paper List

Title	Tags	Back Reference (In Paper)
No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance		Sec. 1, Sec. 3.1, Sec. 3.1.1, Sec. 3.1.3, Sec. 3.2, Sec. 3.2.4, Sec. 6.2, Sec. 8.2.1, Table 2
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning		Sec. 5.1
Med-MMHL: A Multi-Modal Dataset for Detecting Human- and LLM-Generated Misinformation in the Medical Domain		Sec. 4.3.1
Probing Heterogeneous Pretraining Datasets with Small Curated Datasets
ChartLlama: A Multimodal LLM for Chart Understanding and Generation		Sec. 5.1, Sec. 6.3, Sec. 6.4
VideoChat: Chat-Centric Video Understanding		Sec. 5.1, Sec. 5.2
Aligned with LLM: a new multi-modal training paradigm for encoding fMRI activity in visual cortex		Sec. 5.2
3DMIT: 3D Multi-modal Instruction Tuning for Scene Understanding		Sec. 5.1
GPT4MTS: Prompt-based Large Language Model for Multimodal Time-series Forecasting		Sec. 3.1.1, Sec. 5.2
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation		Sec. 3.1.1
Audio Retrieval with WavText5K and CLAP Training		Sec. 3.1.1, Sec. 3.1.3, Sec. 4.4.3
The Devil is in the Details: A Deep Dive into the Rabbit Hole of Data Filtering		Sec. 3.2.1, Sec. 5.3, Sec. 8.3.3
Demystifying CLIP Data		3.2.2
Learning Transferable Visual Models From Natural Language Supervision		Sec. 2.1, Sec. 3.1.1, Sec. 3.2.2
DataComp: In search of the next generation of multimodal datasets		Sec. 1, Sec. 3.1.1, Sec. 3.1.3, Sec. 3.2, Sec. 3.2.1, Sec. 3.2.4, Sec. 4.4.1, Sec. 5.3, Sec. 8.1, Sec. 8.3.3, Table 2
Beyond neural scaling laws: beating power law scaling via data pruning		Sec. 3.2.1
Flamingo: a visual language model for few-shot learning		Sec. 3.1.3, Sec. 3.2.2
Quality not quantity: On the interaction between dataset design and robustness of clip		Sec. 3.2.2
VBench: Comprehensive Benchmark Suite for Video Generative Models		Sec. 4.4.2
EvalCraftr: Benchmarking and Evaluating Large Video Generation Models		Sec. 4.4.2
Training Compute-Optimal Large Language Models		Sec. 3.1
NExT-GPT: Any-to-Any Multimodal LLM		Sec. 1, Sec. 2.1, Sec. 3.1.1
ChartThinker: A Contextual Chain-of-Thought Approach to Optimized Chart Summarization		Sec. 3.1.1, Sec. 3.2.4
ChartReformer: Natural Language-Driven Chart Image Editing		Sec. 3.1.1, Sec. 6.4
GroundingGPT: Language Enhanced Multi-modal Grounding Model		Sec. 4.1.2
Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic		Sec. 4.1.1
Kosmos-2: Grounding Multimodal Large Language Models to the World		Sec. 4.1.1
Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters		Sec. 3.2.1, Sec. 5.1, Sec. 5.3, Sec. 8.3.3
Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training		Sec. 3.2.1, Sec. 8.3.3
Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation		Sec. 3.1.1, Sec. 3.1.3, Sec. 4.1.3, Sec. 5.1, Sec. 5.4, Sec. 8.2.3, Sec. 8.3.3, Sec. 8.3.4
3DBench: A Scalable 3D Benchmark and Instruction-Tuning Dataset		Sec. 4.4.1
Structured Packing in LLM Training Improves Long Context Utilization		Sec. 3.2.3
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models		Sec. 3.2.3
MoDE: CLIP Data Experts via Clustering		Sec. 3.2.3
Efficient Multimodal Learning from Data-centric Perspective		Sec. 1, Sec. 2.1, Sec. 3.2.1
Improved Baselines for Data-efficient Perceptual Augmentation of LLMs		Sec. 3.1.2
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark		Sec. 4.4.1
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension		Sec. 4.4.1
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models		Sec. 3.1.1
Perception Test: A Diagnostic Benchmark for Multimodal Video Models		Sec. 4.4.2
FunQA: Towards Surprising Video ComprehensionFunQA: Towards Surprising Video Comprehension		Sec. 4.2.1, Sec. 4.4.4
OneChart: Purify the Chart Structural Extraction via One Auxiliary Token		Sec. 4.4.1, Sec. 5.1, Sec. 6.3
ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning		Sec. 4.4.4, Sec. 6.3
StructChart: Perception, Structuring, Reasoning for Visual Chart Understanding		Sec. 3.1.1, Sec. 4.2.1, Sec. 6.3
MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning		Sec. 3.1.1, Sec. 4.4.1
ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for Complicated Chart Reasoning		Sec. 3.1.3, Sec. 4.4.4, Sec. 5.1, Sec. 6.3
WorldGPT: Empowering LLM as Multimodal World Model		Sec. 4.4.2
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs		Sec. 3.1.1, Sec. 3.2.2, Sec. 4.1.2
TextSquare: Scaling up Text-Centric Visual Instruction Tuning		Sec. 3.1.1, Sec. 5.1, Sec. 5.3, Sec. 5.4, Sec. 8.3.3, Table 2
ImplicitAVE: An Open-Source Dataset and Multimodal LLMs Benchmark for Implicit Attribute Value Extraction		Sec. 3.1.1, Sec. 4.4.1
How Does the Textual Information Affect the Retrieval of Multimodal In-Context Learning?		Sec 6.1
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want		Sec. 4.1.1
Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution		Sec. 3.2.3
Fewer Truncations Improve Language Modeling		Sec. 3.2.3
MedThink: Explaining Medical Visual Question Answering via Multimodal Decision-Making Rationale		Sec. 4.2.2, Sec. 5.2
AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception		Sec. 5.2
UNIAA: A Unified Multi-modal Image Aesthetic Data AugmentationAssessment Baseline and Benchmark		Sec. 4.4.1, Sec. 5.1
Improving Composed Image Retrieval via Contrastive Learning with Scaling Positives and Negatives		Sec. 3.1.2, Sec. 5.1
Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation		Sec. 4.1.1, Sec. 4.3.1, Sec. 5.4
TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models		Sec. 3.1.1
The Wolf Within: Covert Injection of Malice into MLLM Societies via an MLLM Operative		Sec. 4.3.1
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs		Sec. 3.1.1, Sec. 5.2
MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria		Sec. 4.1.3, Sec. 4.4.2, Sec. 5.4, Sec. 8.2.3, Table 2
MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models		Sec. 4.3.1, Sec. 4.4.2
Retrieval-augmented Multi-modal Chain-of-Thoughts Reasoning for Large Language Models		Sec. 3.1.3, Sec. 4.1.2, Sec.4.2.2
M3DBench: Let’s Instruct Large Models with Multi-modal 3D Prompts		Sec. 4.4.1
MoqaGPT: Zero-Shot Multi-modal Open-domain Question Answering with Large Language Model		Sec. 5.2
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding		Sec. 3.1.2, Sec. 6.3
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding		Sec. 6.3
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration		Sec 3.1.2
mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model		Sec. 6.3
Open-TransMind: A New Baseline and Benchmark for 1st Foundation Model Challenge of Intelligent Transportation		Sec. 4.4.1, Sec. 4.4.3
On the Adversarial Robustness of Multi-Modal Foundation Models		Sec 4.3.1
What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models		Sec. 4.2.1, Sec. 5.1, Sec. 5.3
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions		Sec. 3.1.1
PaLM-E: An Embodied Multimodal Language Model		Sec. 3.1.3
Multimodal Data Curation via Object Detection and Filter Ensembles		Sec. 3.2.1, Sec. 3.2.4, Sec. 8.3.3
Sieve: Multimodal Dataset Pruning Using Image Captioning Models		Sec. 3.2.1, Sec. 3.2.4, Sec. 8.3.1, Sec. 8.3.3
Towards a statistical theory of data selection under weak supervision		Sec. 3.2.1, Sec. 5.3
𝐷2 Pruning: Message Passing for Balancing Diversity & Difficulty in Data Pruning		Sec. 3.3
UIClip: A Data-driven Model for Assessing User Interface Design		Sec. 3.1.1
CapsFusion: Rethinking Image-Text Data at Scale		Sec. 3.1.2
Improving CLIP Training with Language Rewrites		Sec. 1, Sec. 3.1.2, Sec. 5.2
OpenLEAF: Open-Domain Interleaved Image-Text Generation and Evaluation		Sec. 4.4.2
A Decade's Battle on Dataset Bias: Are We There Yet?		Sec. 3.2.2
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets		Sec 3.2.4
Data Filtering Networks		Sec. 3.2.1, Sec. 3.2.4, Sec. 8.3.3
T-MARS: Improving Visual Representations by Circumventing Text Feature Learning		Sec. 3.2.1, Sec. 3.2.4, Sec. 8.3.3
InstructionGPT-4: A 200-Instruction Paradigm for Fine-Tuning MiniGPT-4		Sec. 3.2.1
Align and Attend: Multimodal Summarization with Dual Contrastive Losses		Table 2
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?		Table 2
Text-centric Alignment for Multi-Modality Learning		Sec. 3.2.4
Noisy Correspondence Learning with Meta Similarity Correction		Sec. 3.2.4
Grounding-Prompter: Prompting LLM with Multimodal Information for Temporal Sentence Grounding in Long Videos		Sec. 4.2.2
Language-Image Models with 3D Understanding		Sec. 4.2.2
Scaling Laws for Generative Mixed-Modal Language Models		Sec. 1
BLINK: Multimodal Large Language Models Can See but Not Perceive		Sec. 4.4.1, Table 2
Visual Hallucinations of Multi-modal Large Language Models		Sec. 4.4.2, Sec. 5.3
DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models		Sec. 4.2.2
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought		Sec. 3.1.1, Sec. 4.2.2, Sec. 5.1
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering		Sec. 3.1.1, Sec. 4.2.2, Table 2
Visual Instruction Tuning		Sec. 3.1.1
ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model		Sec. 2.1, Sec. 3.1.1, Sec. 3.2.4, Sec. 4.1, Sec. 4.1.1, Sec. 4.1.3, Sec. 8.3.1, Table 2
Time-LLM: Time Series Forecasting by Reprogramming Large Language Models		Sec. 4.1.1
On the De-duplication of LAION-2B		Sec 3.2.1
Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding		Sec. 3.1.1, Sec. 3.2.2
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark		Sec. 4.1.3, Sec. 4.4.1, Table 2
LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition		Sec. 6.2
Data Augmentation for Text-based Person Retrieval Using Large Language Models		Sec. 3.1.2, Sec. 5.2
Aligning Actions and Walking to LLM-Generated Textual Descriptions		Sec. 3.1.2, Sec. 5.2
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction		Sec. 3.1.2
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models		Sec. 3.1.3
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability		3.2.4
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling		Sec. 5.1
Probing Multimodal LLMs as World Models for Driving		Sec. 3.1.1, Sec. 4.4.4
Unified Hallucination Detection for Multimodal Large Language Models		Sec. 4.4.2, Sec. 5.2, Sec. 6.2, Table 2
Semdedup: Data-efficient learning at web-scale through semantic deduplication		Sec. 3.2.1, Sec. 3.2.4, Sec. 8.3.3
Automated Multi-level Preference for MLLMs		Sec. 4.1.3
Silkie: Preference distillation for large visual language models		Sec. 4.1.3
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning		Sec. 4.1.3, Table 2
M3it: A large-scale dataset towards multi-modal multilingual instruction tuning		Table 2
Aligning Large Multimodal Models with Factually Augmented RLHF		Sec. 4.1.3
DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback		Sec. 4.1.3
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback		Sec. 4.1.3
MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark		Sec. 4.4.2, Sec. 5.4, Sec. 8.3.3, Sec. 8.3.4, Table 2
MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI		Sec. 4.4.3, Table 2
M3CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought		Sec. 4.4.4, Table 2
ImgTrojan: Jailbreaking Vision-Language Models with ONE Image		Sec. 4.3.1, Sec. 5.4
VL-Trojan: Multimodal Instruction Backdoor Attacks against Autoregressive Visual Language Models		Sec. 4.3.1
Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts		Sec. 4.3.1
Improving Multimodal Datasets with Image Captioning		3.2.1, 3.2.4, 8.2.2, 8.3.3
Bridging Research and Readers: A Multi-Modal Automated Academic Papers Interpretation System		6.3
LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition		6.2
PDFChatAnnotator: A Human-LLM Collaborative Multi-Modal Data Annotation Tool for PDF-Format Catalogs		Sec. 5.2, Sec. 6.2
CiT: Curation in Training for Effective Vision-Language Data		Sec. 2.1, Sec. 8.3.3
InstructPix2Pix: Learning to Follow Image Editing Instructions		Sec. 5.1
Automated Data Visualization from Natural Language via Large Language Models: An Exploratory Study		Sec. 6.4
ModelGo: A Practical Tool for Machine Learning License Analysis		Sec. 4.3.2, Sec. 8.2.1
Scaling Laws of Synthetic Images for Model Training ... for Now		Sec 4.1.1
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs		Sec. 3.1.3
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V		Sec. 4.1.1
Segment Anything		Sec. 1, Sec. 8.3.1
AIM: Let Any Multi-modal Large Language Models Embrace Efficient In-Context Learning		Sec 4.1.2
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning		Sec 4.1.2
All in an Aggregated Image for In-Image Learning		Sec. 4.1.2
Panda-70m: Captioning 70m videos with multiple cross-modality teachers		Table 2
Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved With Text		Table 2
ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning		Table 2

Contribution to This Survey

Due to the rapid development in the field, this repository and our paper are continuously being updated and synchronized with each other. Please feel free to make pull requests or open issues to contribute to this list and add more related resources! You can add the titles of relevant papers to the table above, and (optionally) provide suggested tags along with the corresponding sections if possible.

References

If you find our work useful for your research or development, please kindly cite the following paper.

@article{qin2024synergy,
  title={The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective},
  author={Qin, Zhen and Chen, Daoyuan and Zhang, Wenhao and Liuyi, Yao and Yilun, Huang and Ding, Bolin and Li, Yaliang and Deng, Shuiguang},
  journal={arXiv preprint arXiv:2407.08583},
  year={2024}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

awesome_llm_data.md

awesome_llm_data.md

Awesome Data-Model Co-Development of MLLMs

Candidate Co-Development Tags

Data4Model: Scaling

For Scaling Up of MLLMs: Larger Datasets

For Scaling Effectiveness of MLLMs: Better Subsets

Data4Model: Usability

For Instruction Responsiveness of MLLMs

For Reasoning Ability of MLLMs

For Ethics of MLLMs

For Evaluation of MLLMs

Model4Data: Synthesis

Model4Data: Insights

Paper List

Contribution to This Survey

References

Files

awesome_llm_data.md

Latest commit

History

awesome_llm_data.md

File metadata and controls

Awesome Data-Model Co-Development of MLLMs

Candidate Co-Development Tags

Data4Model: Scaling

For Scaling Up of MLLMs: Larger Datasets

For Scaling Effectiveness of MLLMs: Better Subsets

Data4Model: Usability

For Instruction Responsiveness of MLLMs

For Reasoning Ability of MLLMs

For Ethics of MLLMs

For Evaluation of MLLMs

Model4Data: Synthesis

Model4Data: Insights

Paper List

Contribution to This Survey

References