Optimizing RetinaFace inference using TVM.
Having been exploring TVM for a while, after using the ANSOR(auto-scheduler) to generate the cuda kernel here, I want to use it to deploy some face detection model that previously deployed using other framework. After several days trials and errors, finally made it.
First, I compared the performance between the default schedule and the schedule generated by ANSOR.
You can just check schedule/run.py
for exporting the two versions of runtime and cpp
folder for
all the C++ inference benchmark details. In a nutshell, on my machine - a 11th Gen Intel(R) Core(TM) i7-11850H @ 2.50GHz CPU,
the default schedule runtime for 1000 loops is 5324ms and the optimized runtime for 1000 loops is 4853ms - almost 9% speedup.
While the network forward time for 1000 loops using the MNN inference engine is 4485ms - 7.6% better than the optmized schedule. I would say the performance of TVM is amazing. It is on pair with high manually optimized neural network engine while saveing AI engineers lots of time on operator tuning. How this speedup can be applied to other new hardware still remains to be seen.
- Demonstrate how to use tvm runtime library under C++.
- Benchmark tvm scheduler performance.
- Show how to use tvm for object detection task.