Skip to content
This repository has been archived by the owner on Jun 5, 2024. It is now read-only.

Latest commit

 

History

History
executable file
·
188 lines (142 loc) · 10.3 KB

SPEED.md

File metadata and controls

executable file
·
188 lines (142 loc) · 10.3 KB

Experiments on Inferencing Speed

Experiment 1: All runs get almost exactly the same outputs

Machine 1: 2 GTX 1080 TI, i7, nvme

Machine 2: 3 GTX 1080 TI + 1 TITAN X, E5, nvme

Machine 3: 4 RTX 2080 TI , i9-9900X, SSD

Each run is conducted without other programs running except *.

Machine 1

RunType # Image Image Size # GPU Used runtime (s) GPU Average Utilization Per GPU FPS
1 206268 1920x1080 2 41190.9 65.3% 2.50
2 206268 1920x1080 2 37920.9 62.0% 2.72
3 206268 1920x1080 2 31494.2 54.8% 3.27
2 206268 1920x1080 1 75652.2 70.1% 2.73
3 206268 1920x1080 1 57484.9 68.8% 3.59

Machine 2

RunType # Image Image Size # GPU Used runtime (s) GPU Average Utilization Per GPU FPS
2 206268 1920x1080 4 29354.4 33.8% 1.76
3 206268 1920x1080 4 26454.0 23.5% 1.95
3 206268 1920x1080 2 35975.7 38.0% 2.87
2 206268 1920x1080 1 74065.0 41.4% 2.78
3 206268 1920x1080 1 57880.8 54.8% 3.56
2 206268 1920x1080 4 / 1* 17556.3 46.2% 2.94
3 206268 1920x1080 4 / 1* 14552.5 52.3% 3.54
5 206268 1920x1080 4 / 1* 17027.5 53.2% 3.03
6 206268 1920x1080 4 / 1* 13433.3 61.7% 3.84

Machine 3

RunType # Image Image Size # GPU Used runtime (s) GPU Average Utilization Per GPU FPS
2 206268 1920x1080 1 57834.6 61.2% 3.57
3 206268 1920x1080 1 43510.9 61.2% 4.74
2 206268 1920x1080 4 / 1* 14143.3 62.6% 3.65
3 206268 1920x1080 4 / 1* 10676.3 65.2% 4.83

TODO: Add input queue mechanism to improve GPU utilization.

RunType
1 tf 1.10 (CUDA 9, cudnn 7.1), Variable Model
2 tf 1.13 (CUDA 10.0 cudnn 7.4), Variable Model
3 tf 1.13 (CUDA 10.0 cudnn 7.4), Frozen Graph (.pb)
4 tf 1.13 (CUDA 10.0 cudnn 7.4), Frozen Graph (.pb) -> TensorRT Optimized
5 tf 1.14.0 (CUDA 10.0 cudnn 7.4), Variable Model
6 tf 1.14.0 (CUDA 10.0 cudnn 7.4), Frozen Graph (.pb)

4 / 1 * means that I run 4 single-gpu jobs in parallel, just the same as you would run in the full system.

Conclusions:

  • TF v1.10 -> v1.13 (CUDA 9 & cuDNN v7.1 -> CUDA 10 & cuDNN v7.4) ~ +9% faster
  • Use frozen graph ~ +30% faster
  • GTX 1080 TI -> RTX 2080 TI ~ +30% faster

Note that I didn't have time to run these experiments repeatly so I expect the numbers to have large variances.

Freezing the model

To freeze the model into a .pb file:

$ python main.py nothing nothing --mode pack --pack_model_path obj_v3.pb \
--load_from obj_v3_model/ --num_class 15 \
--diva_class3 --rpn_batch_size 256 --frcnn_batch_size 512 --rpn_test_post_nms_topk 1000 --is_fpn \
--use_dilation --max_size 1920 --short_edge_size 1080

Run testing on the v1-val set:

$ python main.py nothing v1-validate_frames.lst --mode forward --outbasepath \
obj_v3_val_output --num_class 15 --diva_class3 --max_size 1920 --short_edge_size \
1080 --gpu 2 --im_batch_size 2 --load_from obj_v3.pb  --is_load_from_pb --log

Assuming v1-validate_frames.lst contains absolute path of all images. This will output one json in COCO detection format for each image in obj_v3_val_output/. The --log will run nvidia-smi every couple second in a separate thread to record the gpu utilizations.

Experiment 2: TensorRT Optimization (TF v1.14.0)

Use TensorRT to optimize frozen graph (I tried FP32, FP16):

$ python tensorrt_optimize.py obj_v5_tfv1.14.0.pb obj_v5_tfv1.14.0.tensorRT.fp32.pb --precision_mode FP32

Inferencing (run 4 program in parallel):

$ python main.py nothing v1-validate_frames.1.lst --mode forward --outbasepath \
obj_v5_val_output_TRT_FP32 --num_class 15 --diva_class3 --max_size 1920 --short_edge_size \
1080 --gpu 1 --im_batch_size 1 --gpuid_start 0 --load_from obj_v5_tfv1.14.0.tensorRT.fp32.pb  --is_load_from_pb --log

I haven't explored --maximum_cached_engines and INT8 mode yet. And ideally these experiments should be repeated a couple of times.

Experiments:

RunType Model: obj_v5
1 tf 1.14.0 (CUDA 10.0 cudnn 7.4), Frozen Graph (.pb)
2 tf 1.14.0 (CUDA 10.0 cudnn 7.4), Frozen Graph (.pb) -> TRT FP32
3 tf 1.14.0 (CUDA 10.0 cudnn 7.4), Frozen Graph (.pb) -> TRT FP16

Machine 2

RunType # Image Image Size # GPU Used runtime (s) GPU Average Utilization Per GPU FPS
1 206268 1920x1080 4 / 1* 13195.3 62.97% 3.91
2 206268 1920x1080 4 / 1* 13125.5 61.02% 3.93
3 206268 1920x1080 4 / 1* 13261.9 52.63% 3.89

12/2020, multiple-image batch processing

We test the multiple-image batch processing using these commands for the FPN-ResNet50 model. The machine is with GTX 1070 TI, i7-8700K, SSD. The test video is a single 5-minute video and we test detect and track with 1280x720 resolution, frame_gap=8.

RunType Time GPU Median Utilization GPU Average Utilization
b=1 var 06:21 53.00% 54.24%
b=1 frozen 05:06 34.50% 36.30%
b=1 frozen,partial 03:43 57.00% 49.55%
b=4 var 04:35 50.00% 52.48%
b=4 frozen 03:18 64.00% 63.32%
b=4 frozen,partial 03:15 42.00% 48.05%
b=8 var 04:27 00.00% 2.35%
b=8 frozen 03:12 62.00% 53.37%
b=8 frozen,partial 03:07 75.50% 62.11%

Now with multi-threading:

RunType Time GPU Median Utilization GPU Average Utilization
b=4 var,m 03:41 100.00% 83.24%
b=4 frozen 02:22 70.50% 71.50%
b=4 frozen,partial,m 02:14 76.00% 75.46%
b=8 var,m 03:29 100.00% 73.45%
b=8 frozen,m 02:14 67.00% 63.69%
b=8 frozen,partial,m 02:08 99.00% 83.83%

Here is the same test on a better GPU but worse CPU machine: TITAN Xp, i5-4460, SSD. 0% median GPU means the bottleneck is in CPU processing.

RunType Time GPU Median Utilization GPU Average Utilization
b=1 var 05:51 50.00% 44.15%
b=1 frozen 04:58 34.00% 29.55%
b=1 frozen,partial 03:17 42.00% 40.58%
b=4 var 04:33 26.00% 37.00%
b=4 frozen 03:17 12.00% 25.26%
b=4 frozen,partial 03:09 00.00% 17.78%
b=8 var 04:29 8.00% 40.88%
b=8 frozen 03:15 00.00% 34.32%
b=8 frozen,partial 03:07 2.50% 29.11%

Now with multi-threading:

RunType Time GPU Median Utilization GPU Average Utilization
b=4 var,m 03:08 38.00% 46.83%
b=4 frozen 01:53 90.00% 76.00%
b=4 frozen,partial,m 01:40 66.00% 64.56%
b=8 var,m 03:08 31.00% 46.78%
b=8 frozen,m 01:51 46.00% 45.90%
b=8 frozen,partial,m 01:45 94.00% 81.50%

Here we compare the machine-wise utilization graph for "b=1 frozen,partial" and "multi-threading b=8 frozen,partial" on the GTX 1070 TI machine:

b=1 frozen,partial

b=8 frozen,partial,m