resnext , a case study demonstrating the importance of library tuning

MIGraphX allows one to take pre-trained models in ONNX format and deploy them on the AMD ROCm software platform.

As capabilities are added to MIGraphX, additional models are enabled. The ResNext model was one example recently enabled to work with MIGraphX. Our ONNX file example was created using PyTorch 0.4.0 from Remi Cadene's pretrained model collection.

We weren't certain what to expect, but initially the performance of ResNext appeared as it might be slow. This wiki page describes how the analysis was done and how performance of ResNext significantly improved. Along the way, this also provides a case study of the importance of MIOpen library tuning for particular models.

Initial analysis

On initial runs on our test system showed ResNext processing 6 images per second. Actual performance is going to depend on hardware architecture, clock speeds as well as the model itself - however, this was also a much lower rate than other models. We weren't quite certain if this meant the model was more complex and involved, or if something wasn't quite tuned for the hardware.

The first tool to use was the HCC_PROFILE mode of the ROCm platform together with the "rpt" tool in the "hcc" package. When one runs a ROCm program with environment variable HCC_PROFILE=2, a detailed activity log is emitted. A short exerpt is included below:

profile:  kernel;	_ZN12_GLOBAL__N_110hip_fill_nILj256EPjmjEEvT0_T1_T2_;	 184.3 us;	1291568592072;	1291568776381;	#0.0.14;
profile: barrier;	                depcnt=0,acq=acc,rel=acc;	   2.1 us;	1291569091218;	1291569093292;	#0.0.17;
profile: barrier;	                depcnt=0,acq=acc,rel=acc;	   1.6 us;	1291568779493;	1291568781122;	#0.0.15;
profile:  kernel;	                          sp3AsmConvRxSU;	1194.2 us;	1291569612568;	1291570806725;	#0.0.18;
profile:  kernel;	                          gcnAsmConv1x1U;	 770.4 us;	1291574646888;	1291575417312;	#0.0.19;
profile:  kernel;	                           MIOpenConv1x1;	 209.6 us;	1291576427278;	1291576636922;	#0.0.20;
profile: barrier;	                depcnt=0,acq=sys,rel=sys;	   7.7 us;	1291576712952;	1291576720656;	#0.0.21;
profile: copyslo;	                  HostToDevice_sync_slow;	9007.9 us;	1291633572234;	1291642580120;	#0.0.0;	51380224 bytes;	49.
0 MB;	5.4 GB/s;
profile: copyslo;	                  HostToDevice_sync_slow;	  37.9 us;	1291645338014;	1291645375925;	#0.0.0;	36864 bytes;	0.0
 MB;	0.9 GB/s;
profile: barrier;	                depcnt=0,acq=acc,rel=acc;	   4.1 us;	1291653503154;	1291653507303;	#0.0.22;
profile:  kernel;	                                  Im2Col;	  93.0 us;	1291660242434;	1291660335478;	#0.0.23;

This detailed log can be summarized in a more understandable form using

rpt logfile > report.txt

An example of the report produced

ROI_START: DATA         0.000000:      +0.00 copyslo #0.0.0        3: HostToDevice_sync_slow_9633792_bytes 
ROI_STOP : GPU0    280057.649620:      +0.00 barrier #0.0.50205    50624: depcnt=0,acq=sys,rel=sys       
ROI_TIME= 280.058 secs

Resource=GPU0 Showing 20/41 records   97.04% busy
      Total(%)    Time(us)    Calls  Avg(us)  Min(us)  Max(us)  Name                          
        93.44% 261694931.5      979 267308.4    218.1 726547.5  gcnAsmConv1x1U                
         1.94%  5440843.5     4740   1147.9    356.2   1601.4  Cijk_Ailk_Bljk_SB_MT128x064x08_APM01_AF0EM01_AF1EM01_ASEM01_BL1_DTL0_EPS1_FL00_GRVW04_GSU01_ISA906_IU
01_K1_KLA_LPA00_LPB00_LDL01_MGWVW01_NLCA01_NLCB01_PGR1_PLR1_SNLL1_TT08_04_USFGRO00_VAW01_VW04_WG16_16_01_WGM01
         1.65%  4618636.5      162  28510.1  10394.3  85382.0  gap 10000us-100000us          
         0.61%  1695046.1        4 423761.5 128938.8 875528.8  gap >=100000us                
         0.55%  1550620.5     3366    460.7    248.8   2617.9  MIOpenGroupConvUni            
...

The first few lines starting after "GPU0" give several cues that something may be amiss:

In total the GPU0 resource was marked "busy" some 98% of the time. When run with other models, it was more typically in the 50-60% range. (A common reason for a GPU that is not-busy is when intermediate results are copied from GPU memory between kernel execution, MIGraphX can reduce those copies through fusion optimizations, but 98% busy is unusual).
One single type of routine "gcnAsmConv1x1U" dominates the time accounting for 94% of the total. There seems to be a wide range of times of those routines.

Using the tuning mode

With the clues above, one of the first things tried was to update the MIOpen performance database. The MIOpen library comes pre-installed with a database with pretuned values for many models, but tuning was to check possibility that particular kernels used in ResNext might be missing.

There is both a global and a local version of the tuning database. Before tuning, the last lines of the local database (in ~/.config/miopen/gfx906_64.cd.updb.txt) looked like this:

2048-8-8-1x1-384-8-8-32-0x0-1x1-1x1-0-NCHW-FP32-F=ConvAsm1x1U:4,8,4,16,1,4,4,2;ConvOclDirectFwd1x1:1,64,1,1,1,4,16,2048,0
2048-8-8-1x1-448-8-8-32-0x0-1x1-1x1-0-NCHW-FP32-F=ConvAsm1x1U:2,8,2,32,2,4,1,4;ConvOclDirectFwd1x1:1,64,1,1,1,1,16,256,0
2048-8-8-1x1-192-8-8-32-0x0-1x1-1x1-0-NCHW-FP32-F=ConvAsm1x1U:1,8,1,64,2,4,2,2;ConvOclDirectFwd1x1:1,64,1,1,1,0,16,256,0
3-224-224-3x3-64-224-224-16-1x1-1x1-1x1-0-NCHW-FP32-F=ConvOclDirectFwd:8,32,32,32,4,1,4,1,2;ConvOclDirectFwdFused:8,32,32,32,4,1,4,1,2

Multiple options for tuning are described at MIOpen performance database. For our exercise, I ran MIGraphX on the model using the environment variable

MIOPEN_FIND_ENFORCE=3

With this variable, MIOpen searches for an updates the performance database with the fastest kernels found. This can take a while, but MIOpen provides progress updates of the search that give indications the database is updated. These look a bit like this

MIOpen(HIP): Warning [Monitor] 1080/0/9375 0.815999, best within recent 1081: 0.815999 #913 2,16,2,64,3,2,1,1, ETA:23.0684 sec.
MIOpen(HIP): Warning [Monitor] 2151/0/9375 0.815999, best within recent 1071: 0.855104 #2013 2,16,2,64,2,2,2,1, ETA:20.1665 sec.
MIOpen(HIP): Warning [Monitor] 3152/0/9375 0.815999, best within recent 1001: 0.87024 #3085 2,16,4,64,1,2,4,1, ETA:17.7788 sec.
MIOpen(HIP): Warning [Monitor] 4190/0/9375 0.815999, best within recent 1038: 0.830816 #3936 3,16,3,64,2,1,1,2, ETA:14.8569 sec.
MIOpen(HIP): Warning [Monitor] 5274/0/9375 0.747264, best within recent 1084: 0.747264 #4419 3,16,3,64,2,2,1,2, ETA:11.6694 sec.
MIOpen(HIP): Warning [Monitor] 6295/0/9375 0.721632, best within recent 1021: 0.721632 #5777 1,16,1,64,4,2,2,2, ETA:8.81193 sec.
MIOpen(HIP): Warning [Monitor] 7291/0/9375 0.721632, best within recent 996: 0.78912 #7163 3,8,3,64,4,1,1,4, ETA:6.00544 sec.
MIOpen(HIP): Warning [Monitor] 8245/0/9375 0.637984, best within recent 954: 0.637984 #7592 1,16,1,64,4,2,1,4, ETA:3.2907 sec.
MIOpen(HIP): Warning [Monitor] 9107/0/9375 0.637984, best within recent 862: 0.72544 #9060 2,16,2,64,2,2,1,8, ETA:0.794937 sec.
MIOpen(HIP): Warning [GenericSearch] Done: 9375/0/9375, best #7592 0.637984 1,16,1,64,4,2,1,4

A side effect of this exercise was that six additional entries had been added to local performance database.

256-56-56-1x1-256-56-56-16-0x0-1x1-1x1-0-NCHW-FP32-F=ConvAsm1x1U:3,16,3,64,2,2,1,1;ConvOclDirectFwd1x1:1,64,1,1,1,1,64,2048,0;ConvActivAsm1x1U:3,16,5,64,2,1
256-56-56-1x1-512-56-56-16-0x0-1x1-1x1-0-NCHW-FP32-F=ConvAsm1x1U:2,16,2,32,3,2,1,4;ConvOclDirectFwd1x1:1,64,1,1,1,4,64,2048,0;ConvActivAsm1x1U:3,16,3,64,4,1
512-28-28-1x1-512-28-28-16-0x0-1x1-1x1-0-NCHW-FP32-F=ConvAsm1x1U:3,16,5,32,2,2,2,1;ConvOclDirectFwd1x1:1,64,1,1,1,4,16,2048,0;ConvActivAsm1x1U:4,16,5,32,2,2
512-28-28-1x1-1024-28-28-16-0x0-1x1-1x1-0-NCHW-FP32-F=ConvAsm1x1U:3,16,5,16,1,2,2,2;ConvOclDirectFwd1x1:1,64,1,1,1,4,16,2048,0;ConvActivAsm1x1U:3,16,5,32,2,1
1024-14-14-1x1-1024-14-14-16-0x0-1x1-1x1-0-NCHW-FP32-F=ConvAsm1x1U:2,32,7,32,1,1,1,1;ConvOclDirectFwd1x1:1,64,1,1,1,4,64,2048,0
1024-14-14-1x1-2048-14-14-16-0x0-1x1-1x1-0-NCHW-FP32-F=ConvAsm1x1U:4,16,7,16,1,2,2,1;ConvOclDirectFwd1x1:1,64,1,1,1,4,16,2048,0
2048-7-7-1x1-2048-7-7-16-0x0-1x1-1x1-0-NCHW-FP32-F=ConvAsm1x1U:4,16,4,16,1,2,1,2;ConvOclDirectFwd1x1:1,64,1,1,1,4,64,2048,0

Improved performance

Once the performance database was updated, it was time to try the experiment again. The performance now showed 148 images per second, a factor of almost 25x improvement.

In addition, if one looks at the HCC_PROFILE=2 output using rpt, one also sees some overall differences:

ROI_START: DATA         0.000000:      +0.00 copyslo #0.0.0        3: HostToDevice_sync_slow_9633792_bytes 
ROI_STOP : GPU0     17744.207163:      +0.00 barrier #0.0.40105    40524: depcnt=0,acq=sys,rel=sys       
ROI_TIME=  17.744 secs

Resource=GPU0 Showing 20/41 records   60.84% busy
      Total(%)    Time(us)    Calls  Avg(us)  Min(us)  Max(us)  Name                          
        39.11%  6938999.5     6029   1150.9    218.1   2969.3  gcnAsmConv1x1U                
        19.75%  3504600.8      106  33062.3  11560.1  83246.2  gap 10000us-100000us          
         9.47%  1679962.4     3366    499.1    248.8   2599.8  MIOpenGroupConvUni            
         9.14%  1621510.2        4 405377.5 140392.1 710962.0  gap >=100000us                
         3.00%   531562.3     7776     68.4     50.0    100.0  gap 50us-100us                
...

These changes include:

GPU0 is now only 61% busy.
The gcnAsmConv1x1U routines still take the most time, but now account for only 39% of the total. In addition, the maximum time taken by one of those calls has gone down to 2969 us.

Summary

This example highlights the importance tuning MIOpen can have on performance. It highlights tools such as HCC_PROFILE mode that can help in investigation and describes symptoms to look for in a mis-tuned application.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resnext , a case study demonstrating the importance of library tuning

Initial analysis

Using the tuning mode

Improved performance

Summary

Clone this wiki locally