This is a repository for organizing papers, codes and other resources related to the topic of Avatar (talking-face and talking-body).
If you have any suggestions (missing papers, new papers, key researchers or typos), please feel free to edit and pull a request.
- 2024.09.07: add ASR and TTS tool
- 2024.08.24: add backgrounds for image/video generations
- 2024.08.24: re-organize paper list with table formating
- 2024.08.24: add works about full-body avatar synthesis
- Main paper list
- Researchers list
- Toolbox for avatar
- Add paper link
- Add paper notes
- Add codes if have
- Add project page if have
- Datasets and metrics
- Related links
- NVIDIA Research
- Neural rendering models for human generation: vid2vid NeurIPS'18, fs-vid2vid NeurIPS'19, EG3D CVPR'22;
- Talking-face synthesis: face-vid2vid CVPR'21, Implicit NeurIPS'22, SPACE ICCV'23, One-shot Neural Head Avatar arXiv'23;
- Talking-body synthesis: DreamPose ICCV'23;
- Face enhancement (relighting, restoration, etc): Lumos SIGGRAPH Asia 2022, RANA ICCV'23;
- Authorized use of synthetic videos: Avatar Fingerprinting arXiv'23;
- Aliaksandr Siarohin @ Snap Research
- Neural rendering models for human generation (focus on flow-based generative models): Unsupervised-Volumetric-Animation CVPR'23, 3DAvatarGAN CVPR'23, 3D-SGAN ECCV'22, Articulated-Animation CVPR'21, Monkey-Net CVPR'19, FOMM NeurIPS'19;
- Ziwei Liu @ Nanyang Technological University
- Talking-face synthesis: StyleSync CVPR'23, AV-CAT SIGGRAPH Asia 2022, StyleGANX ICCV'23, StyleSwap ECCV'22, PC-AVS CVPR'21, Speech2Talking-Face IJCAI'21, VToonify SIGGRAPH Asia 2022;
- Talking-body synthesis: MotionDiffuse arXiv'22;
- Face enhancement (relighting, restoration, etc): Relighting4D ECCV'22;
- Xiaodong Cun @ Tencent AI Lab:
- Talking-face synthesis: StyleHEAT ECCV'22, VideoReTalking SIGGRAPH Asia'22, ToolTalking ICCV'23, DPE CVPR'23, CodeTalker CVPR'23, SadTalker CVPR'23;
- Talking-body synthesis: LivelySpeaker ICCV'23;
- Max Planck Institute for Informatics:
- 3D face models (e.g., 3DMM): FLAME SIGGRAPH Asia 2017;
Conference | Paper | Affiliation | Codebase | Notes |
---|---|---|---|---|
CVPR 2021 | Function4D: Real-time Human Volumetric Capture from Very Sparse Consumer RGBD Sensors | Tsinghua University | Dataset | |
ECCV 2022 | HuMMan: Multi-Modal 4D Human Dataset for Versatile Sensing and Modeling | Shanghai Artificial Intelligence Laboratory | Dataset | |
SIGGRAPH 2023 | AvatarReX: Real-time Expressive Full-body Avatars | Tsinghua University | Dataset | |
arXiv 2024 | A Survey on 3D Human Avatar Modeling - From Reconstruction to Generation | The University of Hong Kong | ||
arXiv 2024 | From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations | Meta Reality Labs Research | Code | conversational avatar |
CVPR 2024 | Animatable Gaussians: Learning Pose-dependent Gaussian Maps for High-fidelity Human Avatar Modeling | Tsinghua Univserity | Code | |
CVPR 2024 | 4K4D: Real-Time 4D View Synthesis at 4K Resolution | Zhejiang University | Code | real-time synthesis with 3DGS |
Conference | Paper | Affiliation | Codebase | Notes |
---|---|---|---|---|
ICCV 2021 | AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis | University of Science and Technology of China | Code | |
ECCV 2022 | Learning Dynamic Facial Radiance Fields for Few-Shot Talking Head Synthesis | Tsinghua University | Code | |
ICLR 2023 | GeneFace: Generalized and High-Fidelity Audio-Driven 3D Talking Face Synthesis | Zhejiang University | Code | |
ICCV 2023 | Efficient Region-Aware Neural Radiance Fields for High-Fidelity Talking Portrait Synthesis | Beihang University | Code | |
arXiv 2023 | GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking Face Generation | Zhejiang University | Code | |
CVPR 2024 | SyncTalk: The Devil is in the Synchronization for Talking Head Synthesi | Renmin University of China | Code | |
ECCV 2024 | TalkingGaussian: Structure-Persistent 3D Talking Head Synthesis via Gaussian Splatting | Beihang University | Code |
Audio-Visual Datasets for Enlish Speakers | ||||||
---|---|---|---|---|---|---|
Dataset name | Environment | Year | Resolution | Subject | Duration | Sentence |
VoxCeleb1 | Wild | 2017 | 360p~720p | 1251 | 352 hours | 100k |
VoxCeleb2 | Wild | 2018 | 360p~720p | 6112 | 2442 hours | 1128k |
HDTF | Wild | 2020 | 720p~1080p | 300+ | 15.8 hours | |
LSP | Wild | 2021 | 720p~1080p | 4 | 18 minutes | 100k |
Audio-Visual Datasets for Chinese Speakers | ||||||
Dataset name | Environment | Year | Resolution | Subject | Duration | Sentence |
CMLR | Lab | 2019 | 11 | 102k | ||
MAVD | Lab | 2023 | 1920x1080 | 64 | 24 hours | 12k |
CN-Celeb | Wild | 2020 | 3000 | 1200 hours | ||
CN-Celeb-AV | Wild | 2023 | 1136 | 660 hours | ||
CN-CVS | Wild | 2023 | 2500+ | 300+ hours |
Lip-Sync | ||
---|---|---|
Metric name | Description | Code/Paper |
LMD↓ | Mouth landmark distance | |
LMD↓ | Mouth landmark distance | |
MA↑ | The Insertion-over-Union (IoU) for the overlap between the predicted mouth area and the ground truth area | |
Sync↑ | The confidence score from SyncNet (Sync) | wav2lip |
LSE-C↑ | Lip Sync Error - Confidence | wav2lip |
LSE-D↓ | Lip Sync Error - Distance | wav2lip |
Image Quality (identity preserving) | ||
Metric name | Description | Code/Paper |
MAE↓ | Mean Absolute Error metric for image | mmagic |
MSE↓ | Mean Squared Error metric for image | mmagic |
PSNR↑ | Peak Signal-to-Noise Ratio | mmagic |
SSIM↑ | Structural similarity for image | mmagic |
FID↓ | Frchet Inception Distance | mmagic |
IS↑ | Inception score | mmagic |
NIQE↓ | Natural Image Quality Evaluator metric | mmagic |
CSIM↑ | The cosine similarity of identity embedding | InsightFace |
CPBD↑ | The cumulative probability blur detection | python-cpbd |
Diversity | ||
Metric name | Description | Code/Paper |
Diversity of head motions↑ | A standard deviation of the head motion feature embeddings extracted from the generated frames using Hopenet (Ruiz et al., 2018) is calculated | SadTalker |
Beat Align Score↑ | The alignment of the audio and generated head motions is calculated in Bailando (Siyao et al., 2022) | SadTalker |
- A general toolbox for AIGC, including common metrics and models https://github.com/open-mmlab/mmagic
- face3d: Python tools for processing 3D face https://github.com/yfeng95/face3d
- 3DMM model fitting using Pytorch https://github.com/ascust/3DMM-Fitting-Pytorch
- OpenFace: a facial behavior analysis toolkit https://github.com/TadasBaltrusaitis/OpenFace
- autocrop: Automatically detects and crops faces from batches of pictures https://github.com/leblancfg/autocrop
- OpenPose: Real-time multi-person keypoint detection library for body, face, hands, and foot estimation https://github.com/CMU-Perceptual-Computing-Lab/openpose
- GFPGAN: Practical Algorithm for Real-world Face Restoration https://github.com/TencentARC/GFPGAN
- CodeFormer: Robust Blind Face Restoration https://github.com/sczhou/CodeFormer
- metahuman-stream: Real time interactive streaming digital human https://github.com/lipku/metahuman-stream
- EasyVolcap: a PyTorch library for accelerating neural volumetric video research https://github.com/zju3dv/EasyVolcap
- 3D Model in gradio https://www.gradio.app/guides/how-to-use-3D-model-component
- BELLE-2/Belle-whisper-large-v3-zh https://huggingface.co/BELLE-2/Belle-whisper-large-v3-zh
- SenseVoice (multilingual) https://github.com/FunAudioLLM/SenseVoice 👍👍
- CosyVoice, Alibaba Tongyi SpeechTeam https://github.com/FunAudioLLM/CosyVoice 👍👍
- FireRedTTS, FireReadTeam https://github.com/FireRedTeam/FireRedTTS
- GPT-SoVITS https://github.com/RVC-Boss/GPT-SoVITS?tab=readme-ov-file
- Mini-Omni, Tsinghua University https://github.com/gpt-omni/mini-omni
- Speech To Speech, HuggingFace https://github.com/huggingface/speech-to-speech
If you are interested in avatar and digital human, we would also like to recommend you to check out other related collections:
- awesome digital human https://github.com/weihaox/awesome-digital-human