Official Pytorch implementation for Egoinstructor at CVPR 2024
Retrieval-Augmented Egocentric Video Captioning
Jilan Xu, Yifei Huang, Junlin Hou, Guo Chen, Yuejie Zhang, Rui Feng, Weidi Xie
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
Given an egocentric video, Egoinstructor automatically retrieves semantically relevant instructional videos (e.g. from HowTo100M) via a pretrained cross-view retrieval model and leverages the visual/textual information to generate the caption of the egocentric video.
- Retrieval code and data released
- Captioning code and data released
- Online Demo
- Pre-trained retrieval checkpoints
- Pre-trained captioning checkpoints
Please refer to env.md
To train a ego-exo crossview retrieval module, please refer to retrieval.
To train a retrieval-augmented egocentric video captioning model, please refer to captioning.
If this work is helpful for your research, please consider citing us.
@article{xu2024retrieval,
title={Retrieval-augmented egocentric video captioning},
author={Xu, Jilan and Huang, Yifei and Hou, Junlin and Chen, Guo and Zhang, Yuejie and Feng, Rui and Xie, Weidi},
journal={arXiv preprint arXiv:2401.00789},
year={2024}
}
This project is released under the MIT License
This project is built upon LaViLA and Otter. Thanks to the contributors of the great codebase.