[ACM MM'23] UMMAFormer: A Universal Multimodal-adaptive Transformer Framework For Temporal Forgery Localization
Temporal Video Inpainting Localization(TVIL) dataset and pytorch training/validation code for UMMAFormer. This is the official repository of our Work accepted to ACM MM'23. If you have any question, please contact zhangrui1997[at]stu.scu.edu.cn. The paper can be found in arxiv or ACM MM.
The emergence of artificial intelligence-generated content~(AIGC) has raised concerns about the authenticity of multimedia content in various fields. Existing research has limitations and is not widely used in industrial settings, as it is only focused on binary classification tasks of complete videos. We propose a novel universal transformer framework for temporal forgery localization (TFL) called UMMAFormer, which predicts forgery segments with multimodal adaptation. We also propose a Temporal Feature Abnormal Attention (TFAA) module based on temporal feature reconstruction to enhance the detection of temporal differences. In addition, we introduce a parallel cross-attention feature pyramid network (PCA-FPN) to optimize the Feature Pyramid Network (FPN) for subtle feature enhancement. To address the lack of available datasets, we introduce a novel temporal video inpainting localization (TVIL) dataset that is specifically tailored for video inpainting scenes. Our experiments demonstrate that our proposed method achieves state-of-the-art performance on benchmark datasets, Lav-DF, TVIL, Psynd surpassing the previous best results significantly.
If you need the TVIL dataset for academic purposes, please download the full data from BaiduYun Disk (8tj1) or OneDrive.
The raw data is coming from Youtube VOS 2018.
We use four different video inpainting methods to create new videos. They are E2FGVI, FGT, FuseFormer, and STTN, respectively. We used XMEM to generate the inpainting mask.
We also provided TSN features (code:8tj1) as used in the paper, specifically extracted by mmaction2==0.24.1.
- Linux
- Python 3.5+
- PyTorch 1.11
- TensorBoard
- CUDA 11.0+
- GCC 4.9+
- NumPy 1.11+
- PyYaml
- Pandas
- h5py
- joblib
- einops
Part of NMS is implemented in C++. The code can be compiled by
cd ./libs/utils
python setup.py install --user
cd ../..
The code should be recompiled every time you update PyTorch.
-
Download Features and Annotations We provided the following features and annotations for download:
annotations and features of Lav-DF from BaiduYun (code:k6jq) or OneDrive
annotations and features of Psynd from BaiduYun (code:m6iq) or OneDrive
annotations and features of TVIL from BaiduYun Disk (8tj1) or OneDrive
These features are the same as those used in our paper and are extracted using the bylo-a and tsn models. They can be directly used for training and testing. The labels, on the other hand, have been converted from their original different forms to fit the format of our code. The ground truth values remain unchanged and are the same as the original ones.
optional
You also can extract features on your own using mmaction==0.24.1 and BYOL-A. First, you need to apply to the official source for the original datasets from Lav-DF and Psynd. Then you need to download mmaction==0.24.1 and BYOL-A and set up the environment following official instructions. Furthermore, you need to extract frames and optical flow from the videos. You can use mmaction for this purpose. In the case of Lav-DF, you also need to separate the corresponding audio from the original video. And The pre-trained model also needs to be downloaded from tsn_rgb and tsn_flow. You can use the following command to generate a video list txt file for Lav-DF and extract visual features.
python tools/gen_lavdf_filelist.py bash tools/gen_tsn_features_lavdf.sh
For audio features, please put the code tools/byola_extract_lavdf.py in the BYOL-A directory and use following command.
python bylo-a/byola_extract_lavdf.py
-
Unpack Features and Annotations
- Unpack the file under ./data (or elsewhere and link to ./data).
- The folder structure should look like
This folder
│ README.md
│ ...
│
└───data/
│ └───lavdf/
│ │ └───annotations
│ │ └───feats
│ │ └───byola
│ │ └───train
│ │ └───dev
│ │ └───test
│ │ └───tsn
│ │ └───flow
│ │ └───train
│ │ └───dev
│ │ └───test
│ │ └───rgb
│ │ └───train
│ │ └───dev
│ │ └───test
│ └───...
|
└───libs
│
│ ...
- Training and Evaluation
Train our UMMAFormer with TSN and BYOL-A features. This will create an experiment folder ./paper_results that stores training config, logs, and checkpoints.
Then you can run the evaluation process using the loaded model and evaluation dataset.
python ./train.py ./configs/UMMAFormer/dataset.yaml
To modify the configuration file for psynd, you need to change the value of "test_split" to the corresponding subset name, such as "test_cellular" or "test_landline." For each subset, you can calculate IoU for each subset using the following command:python ./eval.py ./configs/UMMAFormer/dataset.yaml ./paper_results/dataset/model_best.pth.tar
You need to modify the 'split' variable in the code, as well as the addresses of the labels and results.python tools/test_miou.py
- Evaluating Our Pre-trained Model We also provide a pre-trained models. The following link is for Baidu cloud drive. Considering that some users may not be able to access it, we have additionally provided a OneDrive link.
Dataset | Modal | Config | Pretrained | [email protected] | [email protected] | [email protected] | AR@10 | AR@20 | AR@50 | AR@100 |
---|---|---|---|---|---|---|---|---|---|---|
Lav-DF | V | Yaml | Ckpt | 97.30 | 92.96 | 25.68 | 90.19 | 90.85 | 91.14 | 91.18 |
Lav-DF | V+A | Yaml | Ckpt | 98.83 | 95.54 | 37.61 | 92.10 | 92.42 | 92.47 | 92.48 |
Lav-DF Subset | V | Yaml | Ckpt | 98.83 | 95.95 | 30.11 | 92.32 | 92.65 | 92.74 | 92.75 |
Lav-DF Subset | V+A | Yaml | Ckpt | 98.54 | 94.30 | 37.52 | 91.61 | 91.97 | 92.06 | 92.06 |
TVIL | V | Yaml | Ckpt | 88.68 | 84.70 | 62.43 | 87.09 | 88.21 | 90.43 | 91.16 |
Psynd-Test | A | Yaml | Ckpt | 100.00 | 100.00 | 79.87 | 97.60 | 97.60 | 97.60 | 97.60 |
- Release full code.
- Release TVIL datasets and TSN features.
- Release TSN features and BYOL-A features for Lav-DF and Psynd
- Release our pre-trained model
@inproceedings{DBLP:conf/mm/ZhangWDLZZ23,
author = {Rui Zhang and
Hongxia Wang and
Mingshan Du and
Hanqing Liu and
Yang Zhou and
Qiang Zeng},
title = {UMMAFormer: {A} Universal Multimodal-adaptive Transformer Framework
for Temporal Forgery Localization},
booktitle = {Proceedings of the 31st {ACM} International Conference on Multimedia,
{MM} 2023, Ottawa, ON, Canada, 29 October 2023- 3 November 2023},
pages = {8749--8759},
publisher = {{ACM}},
year = {2023},
url = {https://doi.org/10.1145/3581783.3613767},
doi = {10.1145/3581783.3613767},
}
Thanks for the work of Actionformer. Our code is based on the implementation of them.