DLOP-Bench is an open-source benchmark suite for deep learning operators. It has the following three major features:
- Operators at the deep learning framework level
We focus on the operator at the deep learning framework level (such as torch.convolution) and do not dive into the implementation details of each operator (implicit gemm implementation or winograd implementation and the related algorithm selection). One can easily benchmark the operators on a certain AI accelerator as long as they finish the adaption on a deep learning framework.
- Basic operators and domain-specific long-tail operators
Besides basic operators like convolution, pooling, and normalization, we also collect many representative domain-specific operators mainly from object detection, instance segmentation, and other computer vision directions in OpenMMLab. These operators have no dedicated implementation of deep learning accelerators and have to resort to the Python interpreter. As such, they will always be broken down into large numbers of basic operators. They incur a lot of function calls, as well as data transfer and context switching costs. We name them long-tail operators.
- Benchmarking deep learning accelerators, frameworks, and compilers
From the operator level, this benchmark suite can provide a more microscopic assessment from multiple aspects, including accelerator hardware specifications, deep learning frameworks, and deep learning compilers.
- Execution framework. The main body is an execution engine, compatible with different deep learning frameworks (PyTorch, TensorFlow, JAX, and so on) with different execution modes, such as eager and graph mode.
- 200+ basic operators. We collected the operators from models in OpenMMLab. The input information consists of two parts: input tensor shape and attributes information. We run the models and record the input configurations of each operator. For each input configuration, we save them in CSV format for evaluation.
- 100+ long-tail samples. It has collected 100+ long-tail samples from different deep learning models with representative syntax features, mainly from OpenMMLab, see samples for more detail.
First, download the latest source code:
git clone https://github.com/OpenComputeLab/DLOP-Bench.git
To show the structure of source code, we can use the following command:
cd DLOP-Bench
tree -d -L 1 ./bench
The implementation functions of basic and long tail operators are located in ./bench/samples/.
The code is tested under Python 3, with different deep learning frameworks (PyTorch, TensorFlow, JAX, and so on). You can select a specific version of the framework according to the version of CUDA/cuDNN. For more details please refer to their official websites.
Some samples are dependent on OpenCV2.
pip install opencv-python
pip install opencv-python-headless
Here is a command demo that illustrates how you can use DLOP-Bench to test basic operators.
# config bench PYTHONPATH
cd DLOP-Bench
export PYTHONPATH=./bench:$PYTHONPATH
# If you want to test sample performance using torch backend, you can see the demo as follows:
# prepare pytorch environment, python 3 & torch 1.10 or 1.12 best
...
# run the operator abs using torch backend, more profiling results can refer to profiler_reulsts, reulsts, and time_reulsts
FRAMEWORK=torch python ./bench/api/api.py -c abs -st 1
# run the operator abs and absBP using torch backend
FRAMEWORK=torch python ./bench/api/api.py -c abs,absBP -st 1
# get more usage information
FRAMEWORK=torch python ./bench/api/api.py --help
From long-tail operators, this benchmark suite provides several stages to test their performance as below:
- stage 1 : eager mode.
- stage 2 : graph mode with jit.
This benchmark suite supports the execution of all long-tail operators in stage 1, while some operators fail to run in 2 because they are unsupported in the given deep learning compiler. Here is a command demo to test long-tail operators.
# run the operator bbox2delta using torch backend in eager mode
FRAMEWORK=torch python ./bench/api/api.py -c bbox2delta -st 1
# run the operator bbox2delta using torch backend in both eager mode and graph mode
FRAMEWORK=torch python ./bench/api/api.py -c bbox2delta -st 1,2
# run the operator bbox2delta and l2_loss using torch backend in both eager mode and graph mode
FRAMEWORK=torch python ./bench/api/api.py -c bbox2delta,l2_loss -st 1,2
These apis can also be used in backend torch, tensorflow, or xla, just set corresponding FRAMEWORK environment. While all the operators can be tested using torch backend, some operators may raise an AssertionError in other backends if their corresponding implementation codes have not been added yet. You can wait for our update or add the codes yourself.
If you want to test sample performance using tensorflow, or XLA backend, you can see the demo as follows:
# prepare tensorflow environment
...
# run the operator bbox2offset using tf backend in eager mode
FRAMEWORK=tf TF_XLA_FLAGS=--tf_xla_auto_jit=2 XLA_FLAGS=--xla_gpu_cuda_data_dir=.../cuda-10.1 python ./bench/api/api.py -c bbox2offset -st 1
# run the operator bbox2offset using tf backend in both eager mode and graph mode
FRAMEWORK=tf TF_XLA_FLAGS=--tf_xla_auto_jit=2 XLA_FLAGS=--xla_gpu_cuda_data_dir=.../cuda-10.1 python ./bench/api/api.py -c bbox2offset -st 1,2
- Create a folder named after the operator in the
./bench/samples/basic
directory - Copy the json file of the operator parameter information table generated by the operator acquisition module into the folder
- Create
__init__.py
andtorch_impl.py
files, if you need to test other framework operators, you can refer totorch_impl.py
In__init__.py
, you need to implement two functionsget_sample_config
andgen_np_args
, and then register the two functions usingregister_sample
. Intorch_impl.py
you need to implement the functionargs_adaptor
, which performs data preparation and the operator definition you are going to add. Then,executor_creator
function is needed to register the above two functions into the benchmark.