This is an official implementation of the visual-language navigation task in InternVideo.
We currently provide evaluation of our pretrained model.
- Please follow https://github.com/jacobkrantz/VLN-CE to install Habitat Simulator and Habitat-lab. We use Python 3.6 in our experiments.
- Follow https://github.com/openai/CLIP to install CLIP.
- Follow https://github.com/jacobkrantz/VLN-CE to download Matterport3D environment to
data/scene_datasets
. Data ahould have the formdata/scene_datasets/mp3d/{scene}/{scene}.glb
. - Download preporcessed VLN-CE dataset from https://pjlab-gvm-data.oss-cn-shanghai.aliyuncs.com/internvideo/vln/dataset.zip to
data/datasets
. Data should have the formdata/datasets/R2R_VLNCE_v1-2_preprocessed_BERTidx/{split}
anddata/datasets/R2R_VLNCE_v1-2_preprocessed/{split}
. - Download pretrained models from https://pjlab-gvm-data.oss-cn-shanghai.aliyuncs.com/internvideo/vln/pretrained.zip to
pretrained
. It should have 6 folders:pretrained/pretrained_models
,pretrained/VideoMAE
,pretrained/wp_pred
,pretrained/ddppo-models
,pretrained/Prevalent
,pretrained/wp_pred
.
Simply run bash eval_**.sh
to start evaluating the agent. Run bash train.bash
to start training (6 gpus).