We introduce the Egocentric Video Understanding Dataset (EVUD), an instruction-tuning dataset for training VLMs on video captioning and question answering tasks specific to egocentric videos.
- The AlanaVLM paper is now on arXiv!
- All the checkpoints developed for this project are available on Hugging Face
- The EVUD dataset is available on Hugging Face
Create and activate virtual environment:
python -m venv env
source venv/bin/activate
pip install -r requirements.txt
Together with our generated data released on HuggingFace, we are also releasing all the scripts to reproduce our data generation pipeline:
The generated data follows the LLaVa JSON format.