Modified PySlowFast (with MViT v2) for AI City Challenge

Introduction

The AI City Challenge's Naturalistic Driving Action Recognition intends to temporally localize driver actions given multi-view video streams. Our system, the Stargazer system, achieves second-place performance on the public leaderboard and third-place in the final test. Our system is based on the improved multi-scale vision transformers and large-scale pretraining on the Kinetics-700 dataset. Our CVPR workshop paper detailing the designs is here.

Citations

If you find this code useful in your research then please cite

@inproceedings{liang2022stargazer,
  title={Stargazer: A transformer-based driver action detection system for intelligent transportation},
  author={Liang, Junwei and Zhu, He and Zhang, Enwei and Zhang, Jun},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={3160--3167},
  year={2022}
}

Requirement

ffmpeg >= 3.4 for cutting the videos into clips for training.
python 3.8, tqdm, decord, opencv, pyav, pytorch>=1.9.0, fairscale

Data Preparation

Put all videos into a single path under data/A1_A2_videos/. There should be 60 ".MP4" under this directory
Download our processed annotations from here. This annotation simply re-formats the original annotation. Put the file under data/annotations/

Generate files for training on A1 videos.

Get the processed annotations and video cutting cmds

$ python scripts/aicity_convert_anno.py data/annotations/annotation_A1.edited.csv \
data/A1_A2_videos/ data/annotations/processed_anno_original.csv \
A1_cut.sh data/A1_clips/ --resolution=-2:540

The processed_anno_original.csv should have 1115 lines.

Cut the videos (you can also directly run bash).

$ mkdir data/A1_clips
$ parallel -j 4 < A1_cut.sh

Make annotation splits (without empty segments, see paper for details)

$ python scripts/aicity_split_anno.py data/annotations/processed_anno_original.csv \
data/annotations/pyslowfast_anno_na0 --method 1

Make annotation splits (with empty segments)

$ python scripts/aicity_split_anno.py data/annotations/processed_anno_original.csv \
data/annotations/pyslowfast_anno_naempty0 --method 2

Make annotation files for training on the whole A1 set

$ mkdir data/annotations/pyslowfast_anno_na0/full
$ cat data/annotations/pyslowfast_anno_na0/splits_1/train.csv \
data/annotations/pyslowfast_anno_na0/splits_1/val.csv \
> data/annotations/pyslowfast_anno_na0/full/train.csv
$ cp data/annotations/pyslowfast_anno_na0/splits_1/val.csv \
data/annotations/pyslowfast_anno_na0/full/

download pre-trained K700 checkpoints from here. Put the k700_train_mvitV2_full_16x4_fromscratch_e200_448.pyth under models/. This model achieves 71.91 top-1 accuracy on Kinetics700 validation sets.

Training

Train using the 16x4, 448 crop K700 pretrained model on A1 videos for 200 epochs, as in the paper. Here we test it with a machine with 3-GPUs (11GB memory per GPU). The code base supports multi-machine training as well.

First we need to add the code file path (root path) to PYTHONPATH:

  $ export PYTHONPATH=$PWD/:$PYTHONPATH;

Remove Dashboard_User_id_24026_NoAudio_3.24026.533.535.MP4 from data/annotations/pyslowfast_anno_na0/full/train.csv.

Train:

  $ mkdir -p exps/aicity_train
  $ cd exps/aicity_train
  $ python ../../tools/run_net.py --cfg ../../configs/MVITV2_FULL_B_16x4_CONV_448.yaml \
  TRAIN.CHECKPOINT_FILE_PATH ../../models/k700_train_mvitV2_full_16x4_fromscratch_e200_448.pyth \
  DATA.PATH_PREFIX ../../data/A1_clips \
  DATA.PATH_TO_DATA_DIR ../../data/annotations/pyslowfast_anno_na0/full \
  TRAIN.ENABLE True TRAIN.BATCH_SIZE 3 NUM_GPUS 3 TEST.BATCH_SIZE 3 TEST.ENABLE False \
  DATA_LOADER.NUM_WORKERS 8 SOLVER.BASE_LR 0.000005 SOLVER.WARMUP_START_LR 1e-7 \
  SOLVER.WARMUP_EPOCHS 30.0 SOLVER.COSINE_END_LR 1e-7 SOLVER.MAX_EPOCH 200 LOG_PERIOD 1000 \
  TRAIN.CHECKPOINT_PERIOD 100 TRAIN.EVAL_PERIOD 200 USE_TQDM True \
  DATA.DECODING_BACKEND decord DATA.TRAIN_CROP_SIZE 448 DATA.TEST_CROP_SIZE 448 \
  TRAIN.AUTO_RESUME True TRAIN.CHECKPOINT_EPOCH_RESET True \
  TRAIN.MIXED_PRECISION False MODEL.ACT_CHECKPOINT True \
  TENSORBOARD.ENABLE False TENSORBOARD.LOG_DIR tb_log \
  MIXUP.ENABLE False MODEL.LOSS_FUNC cross_entropy \
  MODEL.DROPOUT_RATE 0.5 MVIT.DROPPATH_RATE 0.4 \
  SOLVER.OPTIMIZING_METHOD adamw

The model we used that ranks No.2 on the leaderboard was trained using 2x8 A100 GPUs with a global batch size of 64 and a learning rate of 1e-4 (also with gradient check-pointing but no mixed precision training). So for a 3-GPU train, we use a batch size of 3 and a learning rate of 0.000005 according to the linear scaling rule. However, in order to reproduce our results, a similar number of batch size is recommended.

To run this code on multi-machine with PyTorch DDP, add --init_method "tcp://${MAIN_IP}:${PORT}" --num_shards ${NUM_MACHINE} --shard_id ${INDEX} to the commands. ${MAIN_IP} is the IP for the root node. ${INDEX} is the node's index.

Inference

To get submission file for a test dataset, we need the model, threshold file, the videos, and the video_ids.csv.

Get the model

Follow the Training process or download our checkpoint from here. Put the models under models/. This is the model that achieves No.2 on the A2 leaderboard.
Get the thresholds. Put them under thresholds/.
- Best public leaderboard threshold from here. (Empirically searched)
- Best general leaderboard threshold from here. (Grid searched)
- A1 pyslowfast_anno_naempty0/splits_1 trained and empirically searched from here.

Run sliding-window classification (single GPU).

Given a list of video names and the path to the videos, run the model. 16x4, 448 model with batch_size=1 will take 5 GB GPU memory to run.

 # cd back to the root path
 $ python scripts/run_action_classification_temporal_inf.py A2_videos.lst data/A1_A2_videos/ \
 models/aicity_train_mvitV2_16x4_fromk700_e200_lr0.0001_yeswarmup_nomixup_dp0.5_dpr0.4_adamw_na0_full_448.pyth \
 test/16x4_s16_448_full_na0_A2test \
 --model_dataset aicity --frame_length 16 --frame_stride 4 --proposal_length 64 \
 --proposal_stride 16 --video_fps 30.0  --frame_size 448 \
 --pyslowfast_cfg configs/Aicity/MVITV2_FULL_B_16x4_CONV_448.yaml \
 --batch_size 1 --num_cpu_workers 4

Run post-processing with the given threshold file to get the submission files.

$ python scripts/aicity_inf.py test/16x4_s16_448_full_na0_A2test thresholds/public_leaderboard_thres.txt \
A2_video_ids.csv test/16x4_s16_448_full_na0_A2test.txt --agg_method avg \
--chunk_sort_base_single_vid score --chunk_sort_base_multi_vid length --use_num_chunk 1

The submission file is test/16x4_s16_448_full_na0_A2test.txt. This should get F1=0.3295 as on the leaderboard on A2 test.

Acknowledgement

This code base heavily adopts the PySlowFast code base.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Modified PySlowFast (with MViT v2) for AI City Challenge

Introduction

Citations

Requirement

Data Preparation

Training

Inference

Acknowledgement

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
configs		configs
figures		figures
scripts		scripts
slowfast		slowfast
tools		tools
A2_video_ids.csv		A2_video_ids.csv
A2_videos.lst		A2_videos.lst
README.md		README.md

JunweiLiang/aicity_action

Folders and files

Latest commit

History

Repository files navigation

Modified PySlowFast (with MViT v2) for AI City Challenge

Introduction

Citations

Requirement

Data Preparation

Training

Inference

Acknowledgement

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages