Skip to content

Latest commit

 

History

History
54 lines (37 loc) · 2.27 KB

README.md

File metadata and controls

54 lines (37 loc) · 2.27 KB

Egoinstructor

Official Pytorch implementation for Egoinstructor at CVPR 2024

Retrieval-Augmented Egocentric Video Captioning
Jilan Xu, Yifei Huang, Junlin Hou, Guo Chen, Yuejie Zhang, Rui Feng, Weidi Xie
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

Paper Project Page

Given an egocentric video, Egoinstructor automatically retrieves semantically relevant instructional videos (e.g. from HowTo100M) via a pretrained cross-view retrieval model and leverages the visual/textual information to generate the caption of the egocentric video.

Roadmap

  • Retrieval code and data released
  • Captioning code and data released
  • Online Demo
  • Pre-trained retrieval checkpoints
  • Pre-trained captioning checkpoints

Prepare environment

Please refer to env.md

Cross-view Retrieval Module

To train a ego-exo crossview retrieval module, please refer to retrieval.

Retrieval-augmented Captioning

To train a retrieval-augmented egocentric video captioning model, please refer to captioning.

Citation

If this work is helpful for your research, please consider citing us.

@article{xu2024retrieval,
  title={Retrieval-augmented egocentric video captioning},
  author={Xu, Jilan and Huang, Yifei and Hou, Junlin and Chen, Guo and Zhang, Yuejie and Feng, Rui and Xie, Weidi},
  journal={arXiv preprint arXiv:2401.00789},
  year={2024}
}

License

This project is released under the MIT License

Acknowledgements

This project is built upon LaViLA and Otter. Thanks to the contributors of the great codebase.