Skip to content

Code and model for the AI City Challenge (CVPR 2022) Track 3 Action Detection (Naturalistic Driving Action Recognition)

Notifications You must be signed in to change notification settings

JunweiLiang/aicity_action

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Modified PySlowFast (with MViT v2) for AI City Challenge

Introduction

The AI City Challenge's Naturalistic Driving Action Recognition intends to temporally localize driver actions given multi-view video streams. Our system, the Stargazer system, achieves second-place performance on the public leaderboard and third-place in the final test. Our system is based on the improved multi-scale vision transformers and large-scale pretraining on the Kinetics-700 dataset. Our CVPR workshop paper detailing the designs is here.

Citations

If you find this code useful in your research then please cite

@inproceedings{liang2022stargazer,
  title={Stargazer: A transformer-based driver action detection system for intelligent transportation},
  author={Liang, Junwei and Zhu, He and Zhang, Enwei and Zhang, Jun},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={3160--3167},
  year={2022}
}

Requirement

  • ffmpeg >= 3.4 for cutting the videos into clips for training.
  • python 3.8, tqdm, decord, opencv, pyav, pytorch>=1.9.0, fairscale

Data Preparation

  1. Put all videos into a single path under data/A1_A2_videos/. There should be 60 ".MP4" under this directory

  2. Download our processed annotations from here. This annotation simply re-formats the original annotation. Put the file under data/annotations/

  3. Generate files for training on A1 videos.

    • Get the processed annotations and video cutting cmds

      $ python scripts/aicity_convert_anno.py data/annotations/annotation_A1.edited.csv \
      data/A1_A2_videos/ data/annotations/processed_anno_original.csv \
      A1_cut.sh data/A1_clips/ --resolution=-2:540
      

      The processed_anno_original.csv should have 1115 lines.

    • Cut the videos (you can also directly run bash).

      $ mkdir data/A1_clips
      $ parallel -j 4 < A1_cut.sh
      
    • Make annotation splits (without empty segments, see paper for details)

      $ python scripts/aicity_split_anno.py data/annotations/processed_anno_original.csv \
      data/annotations/pyslowfast_anno_na0 --method 1
      
    • Make annotation splits (with empty segments)

      $ python scripts/aicity_split_anno.py data/annotations/processed_anno_original.csv \
      data/annotations/pyslowfast_anno_naempty0 --method 2
      
    • Make annotation files for training on the whole A1 set

      $ mkdir data/annotations/pyslowfast_anno_na0/full
      $ cat data/annotations/pyslowfast_anno_na0/splits_1/train.csv \
      data/annotations/pyslowfast_anno_na0/splits_1/val.csv \
      > data/annotations/pyslowfast_anno_na0/full/train.csv
      $ cp data/annotations/pyslowfast_anno_na0/splits_1/val.csv \
      data/annotations/pyslowfast_anno_na0/full/
      
    • download pre-trained K700 checkpoints from here. Put the k700_train_mvitV2_full_16x4_fromscratch_e200_448.pyth under models/. This model achieves 71.91 top-1 accuracy on Kinetics700 validation sets.

Training

Train using the 16x4, 448 crop K700 pretrained model on A1 videos for 200 epochs, as in the paper. Here we test it with a machine with 3-GPUs (11GB memory per GPU). The code base supports multi-machine training as well.

First we need to add the code file path (root path) to PYTHONPATH:

  $ export PYTHONPATH=$PWD/:$PYTHONPATH;

Remove Dashboard_User_id_24026_NoAudio_3.24026.533.535.MP4 from data/annotations/pyslowfast_anno_na0/full/train.csv.

Train:

  $ mkdir -p exps/aicity_train
  $ cd exps/aicity_train
  $ python ../../tools/run_net.py --cfg ../../configs/MVITV2_FULL_B_16x4_CONV_448.yaml \
  TRAIN.CHECKPOINT_FILE_PATH ../../models/k700_train_mvitV2_full_16x4_fromscratch_e200_448.pyth \
  DATA.PATH_PREFIX ../../data/A1_clips \
  DATA.PATH_TO_DATA_DIR ../../data/annotations/pyslowfast_anno_na0/full \
  TRAIN.ENABLE True TRAIN.BATCH_SIZE 3 NUM_GPUS 3 TEST.BATCH_SIZE 3 TEST.ENABLE False \
  DATA_LOADER.NUM_WORKERS 8 SOLVER.BASE_LR 0.000005 SOLVER.WARMUP_START_LR 1e-7 \
  SOLVER.WARMUP_EPOCHS 30.0 SOLVER.COSINE_END_LR 1e-7 SOLVER.MAX_EPOCH 200 LOG_PERIOD 1000 \
  TRAIN.CHECKPOINT_PERIOD 100 TRAIN.EVAL_PERIOD 200 USE_TQDM True \
  DATA.DECODING_BACKEND decord DATA.TRAIN_CROP_SIZE 448 DATA.TEST_CROP_SIZE 448 \
  TRAIN.AUTO_RESUME True TRAIN.CHECKPOINT_EPOCH_RESET True \
  TRAIN.MIXED_PRECISION False MODEL.ACT_CHECKPOINT True \
  TENSORBOARD.ENABLE False TENSORBOARD.LOG_DIR tb_log \
  MIXUP.ENABLE False MODEL.LOSS_FUNC cross_entropy \
  MODEL.DROPOUT_RATE 0.5 MVIT.DROPPATH_RATE 0.4 \
  SOLVER.OPTIMIZING_METHOD adamw

The model we used that ranks No.2 on the leaderboard was trained using 2x8 A100 GPUs with a global batch size of 64 and a learning rate of 1e-4 (also with gradient check-pointing but no mixed precision training). So for a 3-GPU train, we use a batch size of 3 and a learning rate of 0.000005 according to the linear scaling rule. However, in order to reproduce our results, a similar number of batch size is recommended.

To run this code on multi-machine with PyTorch DDP, add --init_method "tcp://${MAIN_IP}:${PORT}" --num_shards ${NUM_MACHINE} --shard_id ${INDEX} to the commands. ${MAIN_IP} is the IP for the root node. ${INDEX} is the node's index.

Inference

To get submission file for a test dataset, we need the model, threshold file, the videos, and the video_ids.csv.

  1. Get the model

    Follow the Training process or download our checkpoint from here. Put the models under models/. This is the model that achieves No.2 on the A2 leaderboard.

  2. Get the thresholds. Put them under thresholds/.

    • Best public leaderboard threshold from here. (Empirically searched)

    • Best general leaderboard threshold from here. (Grid searched)

    • A1 pyslowfast_anno_naempty0/splits_1 trained and empirically searched from here.

  3. Run sliding-window classification (single GPU).

    Given a list of video names and the path to the videos, run the model. 16x4, 448 model with batch_size=1 will take 5 GB GPU memory to run.

     # cd back to the root path
     $ python scripts/run_action_classification_temporal_inf.py A2_videos.lst data/A1_A2_videos/ \
     models/aicity_train_mvitV2_16x4_fromk700_e200_lr0.0001_yeswarmup_nomixup_dp0.5_dpr0.4_adamw_na0_full_448.pyth \
     test/16x4_s16_448_full_na0_A2test \
     --model_dataset aicity --frame_length 16 --frame_stride 4 --proposal_length 64 \
     --proposal_stride 16 --video_fps 30.0  --frame_size 448 \
     --pyslowfast_cfg configs/Aicity/MVITV2_FULL_B_16x4_CONV_448.yaml \
     --batch_size 1 --num_cpu_workers 4
    
  4. Run post-processing with the given threshold file to get the submission files.

    $ python scripts/aicity_inf.py test/16x4_s16_448_full_na0_A2test thresholds/public_leaderboard_thres.txt \
    A2_video_ids.csv test/16x4_s16_448_full_na0_A2test.txt --agg_method avg \
    --chunk_sort_base_single_vid score --chunk_sort_base_multi_vid length --use_num_chunk 1
    

    The submission file is test/16x4_s16_448_full_na0_A2test.txt. This should get F1=0.3295 as on the leaderboard on A2 test.

Acknowledgement

This code base heavily adopts the PySlowFast code base.

About

Code and model for the AI City Challenge (CVPR 2022) Track 3 Action Detection (Naturalistic Driving Action Recognition)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages