Awesome-Multimodal-Papers

A curated list of awesome Multimodal studies.

Awesome-Multimodal-Papers
- Multimodal Papers
- Paper Notes

Multimodal Papers

Large Multimodal Model

Title	Venue	Date	Code	Supplement
xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs	arXiv	2024-10-21	-
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation (by deepseek)	arXiv	2024-10-17		-
✨ Video Instruction Tuning With Synthetic Data (LLaVA-Video, LLaVA-NeXT Series)	arXiv	2024-10-03
✨ Emu3: Next-Token Prediction is All You Need	arXiv	2024-09-27
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions	arXiv	2024-09-26	-
Instruction-guided Multi-Granularity Segmentation and Captioning with Large Multimodal Model (MGLMM, Alibaba)	arXiv	2024-09-20
✨ Show-o: One Single Transformer to Unify Multimodal Understanding and Generation	arXiv	2024-08-22
✨ xGen-MM (BLIP-3): A Family of Open Large Multimodal Models	arXiv	2024-08-16
✨ LLaVA-OneVision: Easy Visual Task Transfer (LLaVA-NeXT Series)	arXiv	2024-08-06
Tarsier: Recipes for Training and Evaluating Large Video Description Models (Tarsier, Dream1k, by ByteDance)	arXiv	2024-07-30
✨ InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output	arXiv	2024-07-03		-
TokenPacker: Efficient Visual Projector for Multimodal LLM	arXiv	2024-07-02		-
✨ Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs (Cambrian, Data Rationing)	arXiv	2024-06-24
✨ Long Context Transfer from Language to Vision (LongVA, by Ziwei Liu, Chunyuan Li)	arXiv	2024-06-24
Generative Visual Instruction Tuning	arXiv	2024-06-17		-
✨ VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding	arXiv	2024-06-13
✨ 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities (Apple)	arXiv	2024-06-13
An Image is Worth 32 Tokens for Reconstruction and Generation (TiTok, by ByteDance)	arXiv	2024-06-11
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs	arXiv	2024-06-11		-
Wings: Learning Multimodal LLMs without Text-only Forgetting	arXiv	2024-06-05	-	-
Enhancing Multimodal Large Language Models with Multi-instance Visual Prompt Generator for Visual Representation Enrichment (MIVPG)	arXiv	2024-06-05	-	-
PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM	arXiv	2024-06-05		-
OLIVE: Object Level In-Context Visual Embeddings	ACL 2024	2024-06-02		-
X-VILA: Cross-Modality Alignment for Large Language Model (by NVIDIA)	arXiv	2024-05-29	-
✨ Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models	arXiv	2024-05-27
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models	arXiv	2024-05-24		-
Prompt-Aware Adapter: Towards Learning Adaptive Visual Tokens for Multimodal Large Language Models	arXiv	2024-05-24	-	-
LOVA3: Learning to Visual Question Answering, Asking and Assessment	arXiv	2024-05-23		-
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability	arXiv	2024-05-23
Chameleon: Mixed-Modal Early-Fusion Foundation Models (Meta)	arXiv	2024-05-16
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts	arXiv	2024-05-09
✨ Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers (Lumina-T2X, Flag-DiT) (Text2Any)	arXiv	2024-05-09
ImageInWords: Unlocking Hyper-Detailed Image Descriptions (Google)	arXiv	2024-05-05
✨ What matters when building vision-language models? (Idefics2)	arXiv	2024-05-03
MANTIS: Interleaved Multi-Image Instruction Tuning	arXiv	2024-05-02
Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs	CVPR 2024 Workshop	2024-04-23	-
VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing	-	2024-04-25
✨ SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation	arXiv	2024-04-22		-
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models	arXiv	2024-04-19
MoVA: Adapting Mixture of Vision Experts to Multimodal Context	arXiv	2024-04-19		-
Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models	arXiv	2024-04-18	-
AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception (AesExpert, AesMMIT Dataset)	arXiv	2024-04-15		-
Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models (Ferret-v2)	arXiv	2024-04-11	-	-
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies (MiniCPM series)	arXiv	2024-04-09
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs (Ferret-UI)	arXiv	2024-04-08	-	-
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding	CVPR 2024	2024-04-08
Koala: Key frame-conditioned long video-LLM	CVPR 2024	2024-04-05
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens	arXiv	2024-04-04
LongVLM: Efficient Long Video Understanding via Large Language Models	arXiv	2024-04-04		-
VideoAgent: Long-form Video Understanding with Large Language Model as Agent (key frame)	arXiv	2024-03-15	-	-
✨ MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training (Apple)	arXiv	2024-03-14	-	-
UniCode: Learning a Unified Codebook for Multimodal Large Language Models	arXiv	2024-03-14	-	-
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context	arXiv	2024-03-08	-
Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models	arXiv	2023-03-05		-
RegionGPT: Towards Region Understanding Vision Language Model	CVPR 2024	2024-03-04	-
All in an Aggregated Image for In-Image Learning	arXiv	2024-02-28		-
Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners	CVPR 2024	2024-02-27
TMT: Tri-Modal Translation between Speech, Image, and Text by Processing Different Modalities as Different Languages	arXiv	2024-02-25	-	-
LLMBind: A Unified Modality-Task Integration Framework	arXiv	2024-02-22	-	-
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling	arXiv	2024-02-19
✨ ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model (ALLaVA)	arXiv	2024-02-18
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model	arXiv	2024-02-06		-
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization	arXiv	2024-02-05
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action	arXiv	2023-12-28
MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices	arXiv	2023-12-28		-
Generative Multimodal Models are In-Context Learners (Emu2)	CVPR 2024	2023-12-20
Gemini: A Family of Highly Capable Multimodal Models	arXiv	2023-12-19	-
✨ Osprey: Pixel Understanding with Visual Instruction Tuning	CVPR 2024	2023-12-15		-
VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation	arXiv	2023-12-14		-
✨ VILA: On Pre-training for Visual Language Models (NVIDIA, MIT)	CVPR 2024	2023-12-12		-
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models	arXiv	2023-12-11
Prompt Highlighter: Interactive Control for Multi-Modal LLMs	CVPR 2024	2023-12-07
PixelLM: Pixel Reasoning with Large Multimodal Model	CVPR 2024	2023-12-04
APoLLo : Unified Adapter and Prompt Learning for Vision Language Models	EMNLP 2023	2023-12-04
CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation	arXiv	2023-11-30
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models	arXiv	2023-11-28
LLMGA: Multimodal Large Language Model based Generation Assistant	arXiv	2023-11-27
PG-Video-LLaVA: Pixel Grounding Large Video-Language Models	arXiv	2023-11-22		-
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions	arXiv	2023-11-21
LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge	CVPR 2024	2023-11-20
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection	arXiv	2023-11-16		-
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration	arXiv	2023-11-07		-
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning	arXiv	2023-10-14
EasyGen: Easing Multimodal Generation with a Bidirectional Conditional Diffusion Model and LLMs	arXiv	2023-10-13		-
Ferret: Refer and Ground Anything Anywhere at Any Granularity (Ferret)	ICLR 2024	2023-10-11		-
Mini-DALLE3: Interactive Text to Image by Prompting Large Language Models	arXiv	2023-10-11
Improved Baselines with Visual Instruction Tuning (LLaVA-1.5)	arXiv	2023-10-05
Kosmos-G: Generating Images in Context with Multimodal Large Language Models	ICLR 2024	2023-10-04
✨ MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens	arXiv	2023-10-03
Aligning Large Multimodal Models with Factually Augmented RLHF (LLaVA-RLHF, MMHal-Bench (hallucination))	arXiv	2023-09-25
DreamLLM: Synergistic Multimodal Comprehension and Creation	ICLR 2024	2023-09-20
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning	ICLR 2024	2023-09-14		-
NExT-GPT: Any-to-Any Multimodal LLM	arXiv	2023-09-11
Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization (LaVIT)	ICLR 2024	2023-09-09		-
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond	arXiv	2023-08-24
Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages (VisCPM-Chat/Paint)	ICLR 2024	2023-08-23		-
Planting a SEED of Vision in Large Language Model	ICLR 2024	2023-07-16
Generative Pretraining in Multimodality (Emu1)	ICLR 2024	2023-07-11		-
SVIT: Scaling up Visual Instruction Tuning	arXiv	2023-07-09
Kosmos-2: Grounding Multimodal Large Language Models to the World (Kosmos-2, GrIT Dataset)	arXiv	2023-06-26
M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning	arXiv	2023-06-07	-
Generating Images with Multimodal Language Models (GILL)	NeurIPS 2023	2023-05-26
Any-to-Any Generation via Composable Diffusion (CoDi-1)	NeurIPS 2023	2023-05-19
SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities	EMNLP 2023 (Findings)	2023-05-18
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning	NeurIPS 2023	2023-05-11		-
MultiModal-GPT: A Vision and Language Model for Dialogue with Humans	arXiv	2023-05-08		-
VPGTrans: Transfer Visual Prompt Generator across LLMs	NeurIPS 2023	2023-05-02
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality	arXiv	2023-04-27		-
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models	ICLR 2024	2023-04-20
Visual Instruction Tuning (LLaVA)	NeurIPS 2023	2023-04-17
Language Is Not All You Need: Aligning Perception with Language Models (Kosmos-1)	NeurIPS 2023	2023-02-27		-
Multimodal Chain-of-Thought Reasoning in Language Models	arXiv	2023-02-02		-
Grounding Language Models to Images for Multimodal Inputs and Outputs (FROMAGe)	ICML 2023	2023-01-31
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models	ICML 2023	2023-01-30		-
Flamingo: a Visual Language Model for Few-Shot Learning	NeurIPS 2022	2022-04-29		-

LMM Benchmark

Title	Venue	Date	Code	Supplement
Tarsier: Recipes for Training and Evaluating Large Video Description Models (Tarsier, Dream1k, by ByteDance)	arXiv	2024-07-30
Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning	arXiv	2024-06-18	-
LOVA3: Learning to Visual Question Answering, Asking and Assessment	arXiv	2024-05-23		-
MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI	arXiv	2024-04-24
BLINK: Multimodal Large Language Models Can See but Not Perceive	arXiv	2024-04-18
Ferret: Refer and Ground Anything Anywhere at Any Granularity (Ferret-Bench)	ICLR 2024	2023-10-11		-
Aligning Large Multimodal Models with Factually Augmented RLHF (LLaVA-RLHF, MMHal-Bench (hallucination))	arXiv	2023-09-25
Affective Visual Dialog: A Large-Scale Benchmark for Emotional Reasoning Based on Visually Grounded Conversations (AffectVisDial)	ECCV 2024	2023-08-30
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension	CVPR 2024	2023-07-30		-

Video LMM Benchmark

Title	Venue	Date
VideoVista: A Versatile Benchmark for Video Understanding and Reasoning (VideoVista)	arXiv	2024-06-17
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis (Video-MME)	arXiv	2024-05-31
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark (MVBench)	CVPR 2024 highlight	2023-11-28
Perception Test: A Diagnostic Benchmark for Multimodal Video Models (Perception Test, by Google DeepMind)	NeurIPS 2023	2023-05-23

Multimodal Dialogue

Title	Venue	Date	Code	Supplement
DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation	arXiv	2024-03-13		-
STICKERCONV: Generating Multimodal Empathetic Responses from Scratch	ACL 2024 Main	2024-01-20
VDialogUE: A Unified Evaluation Benchmark for Visually-grounded Dialogue	arXiv	2023-09-14	-	-
PaCE: Unified Multi-modal Dialogue Pre-training with Progressive and Compositional Experts	ACL 2023	2023-05-24
TikTalk: A Multi-Modal Dialogue Dataset for Real-World Chitchat	ACM MM 2023	2023-01-14		Dataset
MMDialog: A Large-scale Multi-turn Dialogue Dataset Towards Multi-modal Open-domain Conversation	ACL 2023	2022-11-10		Dataset
Multimodal Dialogue Response Generation (Divter)	ACL 2022	2021-10-16	-	-
Maria: A Visual Experience Powered Conversational Agent	ACL 2021	2021-05-27		-
Multi-Modal Open-Domain Dialogue	EMNLP 2021	2020-10-02	-	-
Open Domain Dialogue Generation with Latent Images	AAAI 2021	2020-04-04	-	-
Learning to Respond with Stickers: A Framework of Unifying Multi-Modality in Multi-Turn Dialog	WWW 2020	2020-03-10		-

Multimodal Learning

Title	Venue	Date	Code	Supplement
Video as the New Language for Real-World Decision Making	arXiv	2024-02-27	-	-
Tokenize Anything via Prompting	arXiv	2023-12-14		-
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment	ICLR 2024	2023-10-03		-
ImageBind: One Embedding Space To Bind Them All	CVPR 2023	2023-05-09
Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks	CVPR 2023	2022-11-17		-
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks (BEiT-3)	CVPR 2023	2022-08-22		-
BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers	arXiv	2022-08-12		-
BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning	AAAI 2023	2022-06-17
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework	ICML 2022	2022-02-07		-
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation	ICML 2022	2022-01-28		-
Uni-Perceiver: Pre-Training Unified Architecture for Generic Perception for Zero-Shot and Few-Shot Tasks	CVPR 2022	2021-12-02		-
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation (ALBEF)	NeurIPS 2021	2021-07-16
BEiT: BERT Pre-Training of Image Transformers	ICLR 2022	2021-06-15		-
Learning Transferable Visual Models From Natural Language Supervision	ICML 2021	2021-02-26
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision	ICML 2021	2021-02-05		-

Image-to-Text Generation

Title	Venue	Date	Code	Supplement
LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation? (LaDiC)	NAACL 2024	2024-04-16		-

Text/Image-to-Image Generation

Title	Venue	Date	Code	Supplement
✨ OmniGen: Unified Image Generation	arXiv	2024-09-17
✨ Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens (by Kaiming He, DeepMind, MIT)	arXiv	2024-10-17	-	-
Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers (Lumina-T2X, Flag-DiT) (Text2Any)	arXiv	2024-05-09
FreeU: Free Lunch in Diffusion U-Net (FreeU, by Ziwei Liu)	CVPR 2024 Oral	2023-09-20
Lazy Diffusion Transformer for Interactive Image Editing	arXiv	2024-04-18	-
Salient Object-Aware Background Generation using Text-Guided Diffusion Models	CVPR 2024 Workshop	2024-04-15		-
HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing	arXiv	2024-04-15
UNIAA: A Unified Multi-modal Image Aesthetic Assessment Baseline and Benchmark (UNIAA-LLaVA, UNIAA-Bench)	arXiv	2024-04-15	-	-
PMG: Personalized Multimodal Generation with Large Language Models	WWW 2024	2024-04-07	-	-
Identity Decoupling for Multi-Subject Personalization of Text-to-Image Models	arXiv	2024-04-05
Concept Weaver: Enabling Multi-Concept Fusion in Text-to-Image Models	CVPR 2024	2024-04-05	-	-
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction (VAR)	arXiv	2024-04-03
PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation (HuaWei, Enze Xie)	arXiv	2024-03-07
Multi-LoRA Composition for Image Generation	arXiv	2024-02-26
PIXART-δ: Fast and Controllable Image Generation with Latent Consistency Models (HuaWei, Enze Xie)	arXiv	2024-01-10
Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model	AAAI 2024	2023-12-19		-
SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models (Tencent Xintao Wang)	arXiv	2023-12-11
InstructAny2Pix: Flexible Visual Editing via Multimodal Instruction Following	arXiv	2023-12-11
Emu Edit: Precise Image Editing via Recognition and Generation Tasks	arXiv	2023-11-16	-
BeautifulPrompt: Towards Automatic Prompt Engineering for Text-to-Image Synthesis	EMNLP 2023	2023-11-12
AnyText: Multilingual Visual Text Generation And Editing	ICLR 2024	2023-11-06		-
PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis (HuaWei, Enze Xie)	ICLR 2024 Spotlight	2023-09-30
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models	arXiv	2023-08-13
Kosmos-G: Generating Images in Context with Multimodal Large Language Models	arXiv	2023-10-04
Improving Image Generation with Better Captions (DALL-E 3)	OpenAI	2023	-	-
Scaling up GANs for Text-to-Image Synthesis (GigaGAN)	CVPR 2023	2023-05-09
Adding Conditional Control to Text-to-Image Diffusion Models (ControlNet)	ICCV 2023	2023-02-10		-
Scalable Diffusion Models with Transformers (DiT)	ICCV 2023	2022-12-19
InstructPix2Pix: Learning to Follow Image Editing Instructions	CVPR 2023	2022-11-17
All are Worth Words: A ViT Backbone for Diffusion Models (U-ViT, first Diffsuion Transformer) (RUC, Chongxuan Li)	CVPR 2023	2022-09-25		-
DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation	CVPR 2023	2022-08-25
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding (Imagen)	NeurIPS 2022	2022-05-23
Hierarchical Text-Conditional Image Generation with CLIP Latents (DALL-E 2)	OpenAI	2022-04-13		-
High-Resolution Image Synthesis with Latent Diffusion Models (LDM, Stable Diffusion)	CVPR 2022	2021-12-20		-
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models	ICML 2022	2021-12-20		-
NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion	ECCV 2022	2021-11-24		-
SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations	ICLR 2022	2021-08-02
CogView: Mastering Text-to-Image Generation via Transformers	NeurIPS 2021	2021-05-26		-
Zero-Shot Text-to-Image Generation (DALL-E 1)	ICML 2021	2021-02-24
Taming Transformers for High-Resolution Image Synthesis (VQ-GAN)	CVPR 2021	2020-12-17

Video Generation

Title	Venue	Date	Code	Supplement
MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions (Mira)	arXiv	2024-07-08		-
VIMI: Grounding Video Generation through Multi-modal Instruction	arXiv	2024-07-08
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation	arXiv	2024-07-02	-	-
Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers (Lumina-T2X, Flag-DiT) (Text2Any)		2024-05-09
StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text (Long Video Generation)	arXiv	2024-03-21
AnyV2V: A Plug-and-Play Framework For Any Video-to-Video Editing Tasks	arXiv	2024-03-21
FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation (FRESCO) (NTU, Ziwei Liu)	CVPR 2024	2024-03-19
Latte: Latent Diffusion Transformer for Video Generation (Latte) (NTU, Ziwei Liu)	arXiv	2024-01-05
FreeInit: Bridging Initialization Gap in Video Diffusion Models (FreeInit) (NTU, Ziwei Liu)	arXiv	2023-12-12
VideoBooth: Diffusion-based Video Generation with Image Prompts (VideoBooth) (NTU, Ziwei Liu)	arXiv	2023-12-01
VBench: Comprehensive Benchmark Suite for Video Generative Models [Benchmark] (VBench) (NTU, Ziwei Liu)	CVPR 2024	2023-11-29
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets (SVD)	arXiv	2023-11-25
SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction (NTU, Ziwei Liu)	ICLR 2024	2023-10-31
FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling (FreeNoise) (NTU, Ziwei Liu)	ICLR 2024	2023-10-23
LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models (LaVie) (NTU, Ziwei Liu)		2023-09-26

Multimodal Dataset

Title	Venue	Date	Annotation	Source
Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data	arXiv	2024-10-24	-
LVD-2M: A Long-take Video Dataset with Temporally Dense Captions	NeurIPS 2024	2024-10-14
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens	arXiv	2024-06-17
MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions (Mira)	arXiv	2024-07-08	Video Generation
GUIDE: A Guideline-Guided Dataset for Instructional Video Comprehension	IJCAI 2024	2024-06-26	-
CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation	arXiv	2024-06-15		-
TextSquare: Scaling up Text-Centric Visual Instruction Tuning	arXiv	2024-04-19	Visual Instruction Tuning	-
HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing	arXiv	2024-04-15	Instruction Image Editing
AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception (AesExpert, AesMMIT Dataset)	arXiv	2024-04-15	Aesthetic Multi-Modality Instruction Tuning
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers	CVPR 2024	2024-02-29	video-caption
ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model	arXiv	2024-02-18	GPT4V-synthesized Data
STICKERCONV: Generating Multimodal Empathetic Responses from Scratch	ACL 2024 Main	2024-01-20	Multimodal Empathetic Dialogue
SVIT: Scaling up Visual Instruction Tuning	arXiv	2023-07-09	Instruction Tuning
Kosmos-2: Grounding Multimodal Large Language Models to the World (Kosmos-2, GrIT Dataset)	arXiv	2023-06-26	Grounded image-text pairs
M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning	arXiv	2023-06-07	Instruction Tuning
Visual Instruction Tuning (LLaVA)	NeurIPS 2023	2023-04-17	Instruction Tuning
Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with Text	NeurIPS D&B 2023	2023-04-14	Interleaved Image-Text
TikTalk: A Multi-Modal Dialogue Dataset for Real-World Chitchat	ACM MM 2023	2023-01-14	Multimodal Dialogue
MMDialog: A Large-scale Multi-turn Dialogue Dataset Towards Multi-modal Open-domain Conversation	ACL 2023	2022-11-10	Multimodal Dialogue
LAION-5B: An open large-scale dataset for training next generation image-text models	NeurIPS 2022	2022-10-16	Image-Text Pairs
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs	NeurIPS Workshop 2021	2021-11-03	Image-Text Pairs
MMConv: An Environment for Multimodal Conversational Search across Multiple Domains	ACM SIGIR 2021	2021-07	Multimodal Dialogue
PhotoChat: A Human-Human Dialogue Dataset With Photo Sharing Behavior For Joint Image-Text Modeling	ACL 2021	2021-07-06	Open-domain Multimodal Dialogue
Image-Chat: Engaging Grounded Conversations	ACL 2020	2018-11-02	Multimodal Dialogue

Multimodal Summary

Title	Venue	Date	Latest Update
From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding (TikTok)	arXiv	2024-09-27	-
Video Diffusion Models: A Survey	arXiv	2024-05-06	-
Theoretical research on generative diffusion models: an overview	arXiv	2024-04-13	-
A Review of Multi-Modal Large Language and Vision Models	arXiv	2024-03-28	-
The (R)Evolution of Multimodal Large Language Models: A Survey	arXiv	2024-02-19	-
MM-LLMs: Recent Advances in MultiModal Large Language Models	arXiv	2024-01-24	2024-02-20
Multimodal Large Language Models: A Survey	IEEE BigData 2023	2023-11-22	-
Multimodal Foundation Models: From Specialists to General-Purpose Assistants	CVPR 2023	2023-09-18	-
Understanding Deep Learning	-	2023	-
Large Multimodal Models: Notes on CVPR 2023 Tutorial	CVPR 2023	2023-06-26	-
A Survey on Multimodal Large Language Models	arXiv	2023-06-23	2024-04-01
Multimodal Deep Learning	arXiv	2023-01-12	-
Diffusion Models: A Comprehensive Survey of Methods and Applications	ACM Computing Surveys	2022-09-02	2024-02-06
Multimodal Learning with Transformers: A Survey	IEEE TPAMI 2023	2022-01-13	2023-05-10
Multimodal Machine Learning: A Survey and Taxonomy	IEEE PAMI 2019	2017-05-26	2017-08-01

Paper Notes

here

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
notes		notes
LICENSE		LICENSE
Notes_zh.md		Notes_zh.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome-Multimodal-Papers

Multimodal Papers

Large Multimodal Model

LMM Benchmark

Video LMM Benchmark

Multimodal Dialogue

Multimodal Learning

Image-to-Text Generation

Text/Image-to-Image Generation

Video Generation

Multimodal Dataset

Multimodal Summary

Paper Notes

About

Releases

Packages

Languages

License

friedrichor/Awesome-Multimodal-Papers

Folders and files

Latest commit

History

Repository files navigation

Awesome-Multimodal-Papers

Multimodal Papers

Large Multimodal Model

LMM Benchmark

Video LMM Benchmark

Multimodal Dialogue

Multimodal Learning

Image-to-Text Generation

Text/Image-to-Image Generation

Video Generation

Multimodal Dataset

Multimodal Summary

Paper Notes

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages