The AI City Challenge's Naturalistic Driving Action Recognition intends to temporally localize driver actions given multi-view video streams. Our system, the Stargazer system, achieves second-place performance on the public leaderboard and third-place in the final test. Our system is based on the improved multi-scale vision transformers and large-scale pretraining on the Kinetics-700 dataset. Our CVPR workshop paper detailing the designs is here.
If you find this code useful in your research then please cite
@inproceedings{liang2022stargazer,
title={Stargazer: A transformer-based driver action detection system for intelligent transportation},
author={Liang, Junwei and Zhu, He and Zhang, Enwei and Zhang, Jun},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={3160--3167},
year={2022}
}
- ffmpeg >= 3.4 for cutting the videos into clips for training.
- python 3.8, tqdm, decord, opencv, pyav, pytorch>=1.9.0, fairscale
-
Put all videos into a single path under
data/A1_A2_videos/
. There should be 60 ".MP4" under this directory -
Download our processed annotations from here. This annotation simply re-formats the original annotation. Put the file under
data/annotations/
-
Generate files for training on A1 videos.
-
Get the processed annotations and video cutting cmds
$ python scripts/aicity_convert_anno.py data/annotations/annotation_A1.edited.csv \ data/A1_A2_videos/ data/annotations/processed_anno_original.csv \ A1_cut.sh data/A1_clips/ --resolution=-2:540
The
processed_anno_original.csv
should have 1115 lines. -
Cut the videos (you can also directly run bash).
$ mkdir data/A1_clips $ parallel -j 4 < A1_cut.sh
-
Make annotation splits (without empty segments, see paper for details)
$ python scripts/aicity_split_anno.py data/annotations/processed_anno_original.csv \ data/annotations/pyslowfast_anno_na0 --method 1
-
Make annotation splits (with empty segments)
$ python scripts/aicity_split_anno.py data/annotations/processed_anno_original.csv \ data/annotations/pyslowfast_anno_naempty0 --method 2
-
Make annotation files for training on the whole A1 set
$ mkdir data/annotations/pyslowfast_anno_na0/full $ cat data/annotations/pyslowfast_anno_na0/splits_1/train.csv \ data/annotations/pyslowfast_anno_na0/splits_1/val.csv \ > data/annotations/pyslowfast_anno_na0/full/train.csv $ cp data/annotations/pyslowfast_anno_na0/splits_1/val.csv \ data/annotations/pyslowfast_anno_na0/full/
-
download pre-trained K700 checkpoints from here. Put the
k700_train_mvitV2_full_16x4_fromscratch_e200_448.pyth
undermodels/
. This model achieves 71.91 top-1 accuracy on Kinetics700 validation sets.
-
Train using the 16x4, 448 crop K700 pretrained model on A1 videos for 200 epochs, as in the paper. Here we test it with a machine with 3-GPUs (11GB memory per GPU). The code base supports multi-machine training as well.
First we need to add the code file path (root path) to PYTHONPATH:
$ export PYTHONPATH=$PWD/:$PYTHONPATH;
Remove Dashboard_User_id_24026_NoAudio_3.24026.533.535.MP4
from data/annotations/pyslowfast_anno_na0/full/train.csv
.
Train:
$ mkdir -p exps/aicity_train
$ cd exps/aicity_train
$ python ../../tools/run_net.py --cfg ../../configs/MVITV2_FULL_B_16x4_CONV_448.yaml \
TRAIN.CHECKPOINT_FILE_PATH ../../models/k700_train_mvitV2_full_16x4_fromscratch_e200_448.pyth \
DATA.PATH_PREFIX ../../data/A1_clips \
DATA.PATH_TO_DATA_DIR ../../data/annotations/pyslowfast_anno_na0/full \
TRAIN.ENABLE True TRAIN.BATCH_SIZE 3 NUM_GPUS 3 TEST.BATCH_SIZE 3 TEST.ENABLE False \
DATA_LOADER.NUM_WORKERS 8 SOLVER.BASE_LR 0.000005 SOLVER.WARMUP_START_LR 1e-7 \
SOLVER.WARMUP_EPOCHS 30.0 SOLVER.COSINE_END_LR 1e-7 SOLVER.MAX_EPOCH 200 LOG_PERIOD 1000 \
TRAIN.CHECKPOINT_PERIOD 100 TRAIN.EVAL_PERIOD 200 USE_TQDM True \
DATA.DECODING_BACKEND decord DATA.TRAIN_CROP_SIZE 448 DATA.TEST_CROP_SIZE 448 \
TRAIN.AUTO_RESUME True TRAIN.CHECKPOINT_EPOCH_RESET True \
TRAIN.MIXED_PRECISION False MODEL.ACT_CHECKPOINT True \
TENSORBOARD.ENABLE False TENSORBOARD.LOG_DIR tb_log \
MIXUP.ENABLE False MODEL.LOSS_FUNC cross_entropy \
MODEL.DROPOUT_RATE 0.5 MVIT.DROPPATH_RATE 0.4 \
SOLVER.OPTIMIZING_METHOD adamw
The model we used that ranks No.2 on the leaderboard was trained using 2x8 A100 GPUs with a global batch size of 64 and a learning rate of 1e-4 (also with gradient check-pointing but no mixed precision training). So for a 3-GPU train, we use a batch size of 3 and a learning rate of 0.000005 according to the linear scaling rule. However, in order to reproduce our results, a similar number of batch size is recommended.
To run this code on multi-machine with PyTorch DDP, add --init_method "tcp://${MAIN_IP}:${PORT}" --num_shards ${NUM_MACHINE} --shard_id ${INDEX}
to the commands. ${MAIN_IP}
is the IP for the root node. ${INDEX}
is the node's index.
To get submission file for a test dataset, we need the model, threshold file, the videos, and the video_ids.csv.
-
Get the model
Follow the Training process or download our checkpoint from here. Put the models under
models/
. This is the model that achieves No.2 on the A2 leaderboard. -
Get the thresholds. Put them under
thresholds/
. -
Run sliding-window classification (single GPU).
Given a list of video names and the path to the videos, run the model. 16x4, 448 model with batch_size=1 will take 5 GB GPU memory to run.
# cd back to the root path $ python scripts/run_action_classification_temporal_inf.py A2_videos.lst data/A1_A2_videos/ \ models/aicity_train_mvitV2_16x4_fromk700_e200_lr0.0001_yeswarmup_nomixup_dp0.5_dpr0.4_adamw_na0_full_448.pyth \ test/16x4_s16_448_full_na0_A2test \ --model_dataset aicity --frame_length 16 --frame_stride 4 --proposal_length 64 \ --proposal_stride 16 --video_fps 30.0 --frame_size 448 \ --pyslowfast_cfg configs/Aicity/MVITV2_FULL_B_16x4_CONV_448.yaml \ --batch_size 1 --num_cpu_workers 4
-
Run post-processing with the given threshold file to get the submission files.
$ python scripts/aicity_inf.py test/16x4_s16_448_full_na0_A2test thresholds/public_leaderboard_thres.txt \ A2_video_ids.csv test/16x4_s16_448_full_na0_A2test.txt --agg_method avg \ --chunk_sort_base_single_vid score --chunk_sort_base_multi_vid length --use_num_chunk 1
The submission file is
test/16x4_s16_448_full_na0_A2test.txt
. This should get F1=0.3295 as on the leaderboard on A2 test.
This code base heavily adopts the PySlowFast code base.