Skip to content

Latest commit

 

History

History
241 lines (223 loc) · 36.8 KB

awesome_llm_data.md

File metadata and controls

241 lines (223 loc) · 36.8 KB

Awesome Data-Model Co-Development of MLLMs Awesome

Welcome to the "Awesome List" for data-model co-development of Multi-Modal Large Language Models (MLLMs), a continually updated resource tailored for the open-source community. This compilation features cutting-edge research, insightful articles focusing on improving MLLMs involving with the data-model co-development of MLLMs, and tagged based on the proposed taxonomy from our data-model co-development survey, as illustrated below.

Overview of Our Taxonomy Soon we will provide a dynamic table of contents to help readers more easily navigate through the materials with features such as search, filter, and sort.

Due to the rapid development in the field, this repository and our paper are continuously being updated and synchronized with each other. Please feel free to make pull requests or open issues to contribute to this list and add more related resources!

Candidate Co-Development Tags

These tags correspond to the taxonomy in our paper, and each work may be assigned with more than one tags.

Data4Model: Scaling

For Scaling Up of MLLMs: Larger Datasets

Section Title Tag
Data Acquisition
Data Augmentation
Data Diversity

For Scaling Effectiveness of MLLMs: Better Subsets

Section Title Tag
Data Condensation
Data Mixture
Data Packing
Cross-Modal Alignment

Data4Model: Usability

For Instruction Responsiveness of MLLMs

Section Title Tag
Prompt Design
ICL Data
Human-Behavior Alignment Data

For Reasoning Ability of MLLMs

Section Title Tag
Data for Single-Hop Reasoning
Data for Multi-Hop Reasoning

For Ethics of MLLMs

Section Title Tag
Data Toxicity
Data Privacy and Intellectual Property

For Evaluation of MLLMs

Section Title Tag
Benchmarks for Multi-Modal Understanding
Benchmarks for Multi-Modal Generation:
Benchmarks for Multi-Modal Retrieval:
Benchmarks for Multi-Modal Reasoning:

Model4Data: Synthesis

Section Title Tag
Model as a Data Creator
Model as a Data Mapper
Model as a Data Filter
Model as a Data Evaluator

Model4Data: Insights

Section Title Tag
Model as a Data Navigator
Model as a Data Extractor
Model as a Data Analyzer
Model as a Data Visualizer

Paper List

Title Tags Back Reference (In Paper)
No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance Sec. 1, Sec. 3.1, Sec. 3.1.1, Sec. 3.1.3, Sec. 3.2, Sec. 3.2.4, Sec. 6.2, Sec. 8.2.1, Table 2
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning Sec. 5.1
Med-MMHL: A Multi-Modal Dataset for Detecting Human- and LLM-Generated Misinformation in the Medical Domain Sec. 4.3.1
Probing Heterogeneous Pretraining Datasets with Small Curated Datasets
ChartLlama: A Multimodal LLM for Chart Understanding and Generation Sec. 5.1, Sec. 6.3, Sec. 6.4
VideoChat: Chat-Centric Video Understanding Sec. 5.1, Sec. 5.2
Aligned with LLM: a new multi-modal training paradigm for encoding fMRI activity in visual cortex Sec. 5.2
3DMIT: 3D Multi-modal Instruction Tuning for Scene Understanding Sec. 5.1
GPT4MTS: Prompt-based Large Language Model for Multimodal Time-series Forecasting Sec. 3.1.1, Sec. 5.2
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation Sec. 3.1.1
Audio Retrieval with WavText5K and CLAP Training Sec. 3.1.1, Sec. 3.1.3, Sec. 4.4.3
The Devil is in the Details: A Deep Dive into the Rabbit Hole of Data Filtering Sec. 3.2.1, Sec. 5.3, Sec. 8.3.3
Demystifying CLIP Data 3.2.2
Learning Transferable Visual Models From Natural Language Supervision Sec. 2.1, Sec. 3.1.1, Sec. 3.2.2
DataComp: In search of the next generation of multimodal datasets Sec. 1, Sec. 3.1.1, Sec. 3.1.3, Sec. 3.2, Sec. 3.2.1, Sec. 3.2.4, Sec. 4.4.1, Sec. 5.3, Sec. 8.1, Sec. 8.3.3, Table 2
Beyond neural scaling laws: beating power law scaling via data pruning Sec. 3.2.1
Flamingo: a visual language model for few-shot learning Sec. 3.1.3, Sec. 3.2.2
Quality not quantity: On the interaction between dataset design and robustness of clip Sec. 3.2.2
VBench: Comprehensive Benchmark Suite for Video Generative Models Sec. 4.4.2
EvalCraftr: Benchmarking and Evaluating Large Video Generation Models Sec. 4.4.2
Training Compute-Optimal Large Language Models Sec. 3.1
NExT-GPT: Any-to-Any Multimodal LLM Sec. 1, Sec. 2.1, Sec. 3.1.1
ChartThinker: A Contextual Chain-of-Thought Approach to Optimized Chart Summarization Sec. 3.1.1, Sec. 3.2.4
ChartReformer: Natural Language-Driven Chart Image Editing Sec. 3.1.1, Sec. 6.4
GroundingGPT: Language Enhanced Multi-modal Grounding Model Sec. 4.1.2
Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic Sec. 4.1.1
Kosmos-2: Grounding Multimodal Large Language Models to the World Sec. 4.1.1
Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters Sec. 3.2.1, Sec. 5.1, Sec. 5.3, Sec. 8.3.3
Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training Sec. 3.2.1, Sec. 8.3.3
Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation Sec. 3.1.1, Sec. 3.1.3, Sec. 4.1.3, Sec. 5.1, Sec. 5.4, Sec. 8.2.3, Sec. 8.3.3, Sec. 8.3.4
3DBench: A Scalable 3D Benchmark and Instruction-Tuning Dataset Sec. 4.4.1
Structured Packing in LLM Training Improves Long Context Utilization Sec. 3.2.3
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models Sec. 3.2.3
MoDE: CLIP Data Experts via Clustering Sec. 3.2.3
Efficient Multimodal Learning from Data-centric Perspective Sec. 1, Sec. 2.1, Sec. 3.2.1
Improved Baselines for Data-efficient Perceptual Augmentation of LLMs Sec. 3.1.2
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark Sec. 4.4.1
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension Sec. 4.4.1
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models Sec. 3.1.1
Perception Test: A Diagnostic Benchmark for Multimodal Video Models Sec. 4.4.2
FunQA: Towards Surprising Video ComprehensionFunQA: Towards Surprising Video Comprehension Sec. 4.2.1, Sec. 4.4.4
OneChart: Purify the Chart Structural Extraction via One Auxiliary Token Sec. 4.4.1, Sec. 5.1, Sec. 6.3
ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning Sec. 4.4.4, Sec. 6.3
StructChart: Perception, Structuring, Reasoning for Visual Chart Understanding Sec. 3.1.1, Sec. 4.2.1, Sec. 6.3
MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning Sec. 3.1.1, Sec. 4.4.1
ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for Complicated Chart Reasoning Sec. 3.1.3, Sec. 4.4.4, Sec. 5.1, Sec. 6.3
WorldGPT: Empowering LLM as Multimodal World Model Sec. 4.4.2
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs Sec. 3.1.1, Sec. 3.2.2, Sec. 4.1.2
TextSquare: Scaling up Text-Centric Visual Instruction Tuning Sec. 3.1.1, Sec. 5.1, Sec. 5.3, Sec. 5.4, Sec. 8.3.3, Table 2
ImplicitAVE: An Open-Source Dataset and Multimodal LLMs Benchmark for Implicit Attribute Value Extraction Sec. 3.1.1, Sec. 4.4.1
How Does the Textual Information Affect the Retrieval of Multimodal In-Context Learning? Sec 6.1
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want Sec. 4.1.1
Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution Sec. 3.2.3
Fewer Truncations Improve Language Modeling Sec. 3.2.3
MedThink: Explaining Medical Visual Question Answering via Multimodal Decision-Making Rationale Sec. 4.2.2, Sec. 5.2
AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception Sec. 5.2
UNIAA: A Unified Multi-modal Image Aesthetic Data AugmentationAssessment Baseline and Benchmark Sec. 4.4.1, Sec. 5.1
Improving Composed Image Retrieval via Contrastive Learning with Scaling Positives and Negatives Sec. 3.1.2, Sec. 5.1
Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation Sec. 4.1.1, Sec. 4.3.1, Sec. 5.4
TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models Sec. 3.1.1
The Wolf Within: Covert Injection of Malice into MLLM Societies via an MLLM Operative Sec. 4.3.1
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs Sec. 3.1.1, Sec. 5.2
MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria Sec. 4.1.3, Sec. 4.4.2, Sec. 5.4, Sec. 8.2.3, Table 2
MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models Sec. 4.3.1, Sec. 4.4.2
Retrieval-augmented Multi-modal Chain-of-Thoughts Reasoning for Large Language Models Sec. 3.1.3, Sec. 4.1.2, Sec.4.2.2
M3DBench: Let’s Instruct Large Models with Multi-modal 3D Prompts Sec. 4.4.1
MoqaGPT: Zero-Shot Multi-modal Open-domain Question Answering with Large Language Model Sec. 5.2
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding Sec. 3.1.2, Sec. 6.3
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding Sec. 6.3
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration Sec 3.1.2
mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model Sec. 6.3
Open-TransMind: A New Baseline and Benchmark for 1st Foundation Model Challenge of Intelligent Transportation Sec. 4.4.1, Sec. 4.4.3
On the Adversarial Robustness of Multi-Modal Foundation Models Sec 4.3.1
What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models Sec. 4.2.1, Sec. 5.1, Sec. 5.3
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions Sec. 3.1.1
PaLM-E: An Embodied Multimodal Language Model Sec. 3.1.3
Multimodal Data Curation via Object Detection and Filter Ensembles Sec. 3.2.1, Sec. 3.2.4, Sec. 8.3.3
Sieve: Multimodal Dataset Pruning Using Image Captioning Models Sec. 3.2.1, Sec. 3.2.4, Sec. 8.3.1, Sec. 8.3.3
Towards a statistical theory of data selection under weak supervision Sec. 3.2.1, Sec. 5.3
𝐷2 Pruning: Message Passing for Balancing Diversity & Difficulty in Data Pruning Sec. 3.3
UIClip: A Data-driven Model for Assessing User Interface Design Sec. 3.1.1
CapsFusion: Rethinking Image-Text Data at Scale Sec. 3.1.2
Improving CLIP Training with Language Rewrites Sec. 1, Sec. 3.1.2, Sec. 5.2
OpenLEAF: Open-Domain Interleaved Image-Text Generation and Evaluation Sec. 4.4.2
A Decade's Battle on Dataset Bias: Are We There Yet? Sec. 3.2.2
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets Sec 3.2.4
Data Filtering Networks Sec. 3.2.1, Sec. 3.2.4, Sec. 8.3.3
T-MARS: Improving Visual Representations by Circumventing Text Feature Learning Sec. 3.2.1, Sec. 3.2.4, Sec. 8.3.3
InstructionGPT-4: A 200-Instruction Paradigm for Fine-Tuning MiniGPT-4 Sec. 3.2.1
Align and Attend: Multimodal Summarization with Dual Contrastive Losses Table 2
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? Table 2
Text-centric Alignment for Multi-Modality Learning Sec. 3.2.4
Noisy Correspondence Learning with Meta Similarity Correction Sec. 3.2.4
Grounding-Prompter: Prompting LLM with Multimodal Information for Temporal Sentence Grounding in Long Videos Sec. 4.2.2
Language-Image Models with 3D Understanding Sec. 4.2.2
Scaling Laws for Generative Mixed-Modal Language Models Sec. 1
BLINK: Multimodal Large Language Models Can See but Not Perceive Sec. 4.4.1, Table 2
Visual Hallucinations of Multi-modal Large Language Models Sec. 4.4.2, Sec. 5.3
DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models Sec. 4.2.2
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought Sec. 3.1.1, Sec. 4.2.2, Sec. 5.1
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering Sec. 3.1.1, Sec. 4.2.2, Table 2
Visual Instruction Tuning Sec. 3.1.1
ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model Sec. 2.1, Sec. 3.1.1, Sec. 3.2.4, Sec. 4.1, Sec. 4.1.1, Sec. 4.1.3, Sec. 8.3.1, Table 2
Time-LLM: Time Series Forecasting by Reprogramming Large Language Models Sec. 4.1.1
On the De-duplication of LAION-2B Sec 3.2.1
Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding Sec. 3.1.1, Sec. 3.2.2
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark Sec. 4.1.3, Sec. 4.4.1, Table 2
LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition Sec. 6.2
Data Augmentation for Text-based Person Retrieval Using Large Language Models Sec. 3.1.2, Sec. 5.2
Aligning Actions and Walking to LLM-Generated Textual Descriptions Sec. 3.1.2, Sec. 5.2
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction Sec. 3.1.2
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models Sec. 3.1.3
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability 3.2.4
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling Sec. 5.1
Probing Multimodal LLMs as World Models for Driving Sec. 3.1.1, Sec. 4.4.4
Unified Hallucination Detection for Multimodal Large Language Models Sec. 4.4.2, Sec. 5.2, Sec. 6.2, Table 2
Semdedup: Data-efficient learning at web-scale through semantic deduplication Sec. 3.2.1, Sec. 3.2.4, Sec. 8.3.3
Automated Multi-level Preference for MLLMs Sec. 4.1.3
Silkie: Preference distillation for large visual language models Sec. 4.1.3
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning Sec. 4.1.3, Table 2
M3it: A large-scale dataset towards multi-modal multilingual instruction tuning Table 2
Aligning Large Multimodal Models with Factually Augmented RLHF Sec. 4.1.3
DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback Sec. 4.1.3
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback Sec. 4.1.3
MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark Sec. 4.4.2, Sec. 5.4, Sec. 8.3.3, Sec. 8.3.4, Table 2
MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI Sec. 4.4.3, Table 2
M3CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought Sec. 4.4.4, Table 2
ImgTrojan: Jailbreaking Vision-Language Models with ONE Image Sec. 4.3.1, Sec. 5.4
VL-Trojan: Multimodal Instruction Backdoor Attacks against Autoregressive Visual Language Models Sec. 4.3.1
Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts Sec. 4.3.1
Improving Multimodal Datasets with Image Captioning 3.2.1, 3.2.4, 8.2.2, 8.3.3
Bridging Research and Readers: A Multi-Modal Automated Academic Papers Interpretation System 6.3
LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition 6.2
PDFChatAnnotator: A Human-LLM Collaborative Multi-Modal Data Annotation Tool for PDF-Format Catalogs Sec. 5.2, Sec. 6.2
CiT: Curation in Training for Effective Vision-Language Data Sec. 2.1, Sec. 8.3.3
InstructPix2Pix: Learning to Follow Image Editing Instructions Sec. 5.1
Automated Data Visualization from Natural Language via Large Language Models: An Exploratory Study Sec. 6.4
ModelGo: A Practical Tool for Machine Learning License Analysis Sec. 4.3.2, Sec. 8.2.1
Scaling Laws of Synthetic Images for Model Training ... for Now Sec 4.1.1
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs Sec. 3.1.3
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V Sec. 4.1.1
Segment Anything Sec. 1, Sec. 8.3.1
AIM: Let Any Multi-modal Large Language Models Embrace Efficient In-Context Learning Sec 4.1.2
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning Sec 4.1.2
All in an Aggregated Image for In-Image Learning Sec. 4.1.2
Panda-70m: Captioning 70m videos with multiple cross-modality teachers Table 2
Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved With Text Table 2
ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning Table 2

Contribution to This Survey

Due to the rapid development in the field, this repository and our paper are continuously being updated and synchronized with each other. Please feel free to make pull requests or open issues to contribute to this list and add more related resources! You can add the titles of relevant papers to the table above, and (optionally) provide suggested tags along with the corresponding sections if possible.

References

If you find our work useful for your research or development, please kindly cite the following paper.

@article{qin2024synergy,
  title={The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective},
  author={Qin, Zhen and Chen, Daoyuan and Zhang, Wenhao and Liuyi, Yao and Yilun, Huang and Ding, Bolin and Li, Yaliang and Deng, Shuiguang},
  journal={arXiv preprint arXiv:2407.08583},
  year={2024}
}