We provide the off-the-shelf scripts in the scripts folder.
Cache of pretrained weight | Baidu Yun | Google Cloud | Peking University Yun |
---|---|---|---|
Large | Link | Link | Link |
Huge | Link | - | Link |
For example, to train LanguageBind on Depth-Language with 8 GPUs (1 nodes x 8 GPUs).
- First download the cache of pretrained weight above. and specify
CACHE_DIR=path/to/LanguageBind
. - The second step is to develop a path to
ANNOTATION
andDATA
here according to the dataset preparation. - Then you can run
CACHE_DIR="/path/to/LanguageBind"
ANNOTATION="path/to/data"
cd /path/to/LanguageBind
TORCH_DISTRIBUTED_DEBUG=DETAIL HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 torchrun --nnodes=1 --nproc_per_node 8 \
-m main \
--train-data ${ANNOTATION} \
--train-num-samples 3020000 \
--clip-type "dl" --max-depth 10 \
--do_train \
--lock-text --lock-image --text-type "polish_mplug" \
--init-temp 0.07 --learn-temp \
--model "ViT-L-14" --cache-dir ${CACHE_DIR} \
--convert_to_lora --lora_r 2 \
--lr 5e-4 --coef-lr 1e-3 \
--beta1 0.9 --beta2 0.98 --wd 0.2 --eps 1e-6 \
--num-frames 1 --force-patch-dropout 0.5 \
--epochs 1 --batch-size 128 --accum-freq 1 --warmup 200 \
--precision "amp" --workers 10 --video-decode-backend "imgs" \
--save-frequency 1 --log-every-n-steps 20 --report-to "tensorboard" --resume "latest" \
--do_eval \
--val_d_cls_data "NYUV2"
For example, to validate LanguageBind on Depth-Language with 1 GPUs.
- First specify
RESUME
. - The second step is to prepare the downstream dataset.
- Then you can run
CACHE_DIR="/path/to/LanguageBind"
RESUME="thermal_language.pt"
ANNOTATION="path/to/data"
cd /path/to/LanguageBind
TORCH_DISTRIBUTED_DEBUG=DETAIL HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 torchrun --nproc_per_node 1 \
-m main \
--train-data ${ANNOTATION} \
--train-num-samples 3020000 \
--clip-type "dl" --max-depth 10 \
--lock-text --lock-image --text-type "polish_mplug" \
--init-temp 0.07 --learn-temp \
--model "ViT-L-14" --cache-dir ${CACHE_DIR} \
--convert_to_lora --lora_r 2 \
--lr 5e-4 --coef-lr 1e-3 \
--beta1 0.9 --beta2 0.98 --wd 0.2 --eps 1e-6 \
--num-frames 1 --force-patch-dropout 0.5 \
--epochs 1 --batch-size 128 --accum-freq 1 --warmup 200 \
--precision "amp" --workers 10 --video-decode-backend "imgs" \
--save-frequency 1 --log-every-n-steps 20 --report-to "tensorboard" --resume ${RESUME} \
--do_eval \
--val_d_cls_data "NYUV2"
NYU V2 dataset is downloaded from this repo and we reformat them to conform to the standard ImageNet format. We also provide data as follows. Change the data_root
here.
Video datasets are downloaded from this repo and we show the folder structure. Change the data_root
here.
Audio datasets are downloaded from this repo and Audioset from here.We reformat them to conform to the standard ImageNet format. Change the data_root
here1 and here2.
We download LLVIP from official website, and FLIR from here. We reformat them to conform to the standard ImageNet format. Change the data_root
here. We also provide the processed data as follows.
Datasets | Baidu Yun | Google Cloud | Peking University Yun |
---|---|---|---|
LLVIP | Link | Link | Link |
FLIR V1 | Link | Link | Link |
FLIR V2 | Link | Link | Link |
downstream_datasets
βββ Audio
βΒ Β βββ audiocaps
βΒ Β β βββ audio
βΒ Β β βββ test
βΒ Β β βββ train
βΒ Β β βββ val
β βββ audioset
βΒ Β β βββ balanced_train_segments
βΒ Β β βββ eval_segments
βΒ Β β βββ unbalanced_train_segments
βΒ Β β βββ unbalanced_train_segments_part00
βΒ Β β βββ unbalanced_train_segments_part01
βΒ Β β βββ ...
βΒ Β β βββ unbalanced_train_segments_part40
β βββ clotho
βΒ Β β βββ CLOTHO_retrieval_dataset
βΒ Β β βββ evaluation
β βββ esc50
βΒ Β β βββ test
βΒ Β β βββ airplane
βΒ Β β βββ breathing
βΒ Β β βββ ...
βΒ Β β βββ wind
βββ laionaudio
βΒ Β β βββ audios
βΒ Β β βββ freesound_no_overlap
βΒ Β β βββ jsons
βββ vggsound
β βββ test
β βββ air\ conditioning\ noise
β βββ air\ horn
β βββ ...
β βββ zebra\ braying
βββ Depth
βΒ Β βββ nyuv2
βΒ Β βΒ Β βββ data
βΒ Β βΒ Β βΒ Β βββ val
βΒ Β βΒ Β βΒ Β βββ bathroom
βΒ Β βΒ Β βΒ Β βββ bedroom
βΒ Β βΒ Β βΒ Β βββ bookstore
βΒ Β βΒ Β βΒ Β βββ classroom
βΒ Β βΒ Β βΒ Β βββ dining_room
βΒ Β βΒ Β βΒ Β βββ home_office
βΒ Β βΒ Β βΒ Β βββ kitchen
βΒ Β βΒ Β βΒ Β βββ living_room
βΒ Β βΒ Β βΒ Β βββ office
βΒ Β βΒ Β βΒ Β βββ others
βββ Thermal
βΒ Β βββ flirv1
βΒ Β βΒ Β βββ val
βΒ Β βΒ Β βββ bicycle
βΒ Β βΒ Β βββ car
βΒ Β βΒ Β βββ dog
βΒ Β βΒ Β βββ person
βΒ Β βββ flirv2
βΒ Β βΒ Β βββ val
βΒ Β βΒ Β βββ bike
βΒ Β βΒ Β βββ bus
βΒ Β βΒ Β βββ car
βΒ Β βΒ Β βββ hydrant
βΒ Β βΒ Β βββ light
βΒ Β βΒ Β βββ motor
βΒ Β βΒ Β βββ other\ vehicle
βΒ Β βΒ Β βββ person
βΒ Β βΒ Β βββ sign
βΒ Β βΒ Β βββ skateboard
βΒ Β βΒ Β βββ stroller
βΒ Β βΒ Β βββ truck
βΒ Β βββ llvip
βΒ Β βΒ Β βββ train
βΒ Β βΒ Β βΒ Β βββ background
βΒ Β βΒ Β βΒ Β βββ person
βΒ Β βΒ Β βββ val
βΒ Β βΒ Β βββ background
βΒ Β βΒ Β βββ person
βββ VideoTextRetrieval
βββ vtRetdata
βΒ Β βββ ActivityNet
βΒ Β βΒ Β βββ Videos
βΒ Β βΒ Β βββ Activity_Videos
βΒ Β βββ Didemo
βΒ Β βΒ Β βββ videos
βΒ Β βββ MSRVTT
βΒ Β βΒ Β βββ MSRVTT_Videos
βΒ Β βββ MSVD
βΒ Β βββ MSVD_Videos