Egoinstructor

Official Pytorch implementation for Egoinstructor at CVPR 2024

Retrieval-Augmented Egocentric Video Captioning
Jilan Xu, Yifei Huang, Junlin Hou, Guo Chen, Yuejie Zhang, Rui Feng, Weidi Xie
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

Given an egocentric video, Egoinstructor automatically retrieves semantically relevant instructional videos (e.g. from HowTo100M) via a pretrained cross-view retrieval model and leverages the visual/textual information to generate the caption of the egocentric video.

Roadmap

Retrieval code and data released
Captioning code and data released
Online Demo
Pre-trained retrieval checkpoints
Pre-trained captioning checkpoints

Prepare environment

Please refer to env.md

Cross-view Retrieval Module

To train a ego-exo crossview retrieval module, please refer to retrieval.

Retrieval-augmented Captioning

To train a retrieval-augmented egocentric video captioning model, please refer to captioning.

Citation

If this work is helpful for your research, please consider citing us.

@article{xu2024retrieval,
  title={Retrieval-augmented egocentric video captioning},
  author={Xu, Jilan and Huang, Yifei and Hou, Junlin and Chen, Guo and Zhang, Yuejie and Feng, Rui and Xie, Weidi},
  journal={arXiv preprint arXiv:2401.00789},
  year={2024}
}

License

This project is released under the MIT License

Acknowledgements

This project is built upon LaViLA and Otter. Thanks to the contributors of the great codebase.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Egoinstructor

Roadmap

Prepare environment

Cross-view Retrieval Module

Retrieval-augmented Captioning

Citation

License

Acknowledgements

Files

README.md

Latest commit

History

README.md

File metadata and controls

Egoinstructor

Roadmap

Prepare environment

Cross-view Retrieval Module

Retrieval-augmented Captioning

Citation

License

Acknowledgements