Spconv 2.x Performance Guide

Short Guide

If you train without Tensor Core (i.e. FP32 training or FP16 training for Pascal or older GPUS), set all algo in convolution/maxpool to ConvAlgo.Native manually. Default Algorithm is ConvAlgo.MaskImplicitGemm, which is SLOWER than ConvAlgo.Native when use float32. this will be fixed in spconv 2.2.
If your GPU support Tensor Core, use FP16 (mixed precision training) if possible.
If you train with mixed precision training (use Tensor Core), you don't need to set algorithm manually.
Currently fast algorithm only support kernel volume (prod of kernel size) <= 32, so don't use large kernel size.
make sure your channel size is multiple of 8 when using fp16. multiple of 32 is better.
spconv 2.x in Windows 10 is 1.5x~2x slower than Linux. use Linux if possible.
If you train with float32 and ampere or later GPUs, you can set spconv.constants.SPCONV_ALLOW_TF32 to enable faster fp32 training. See benchmark for more performance details of different algorithms.
Different CUDA version of spconv may have different performance. Use newest cuda version if possible. For example, spconv-cu117 is faster than spconv-cu114, spconv-cu114 is faster than spconv-cu111.
if your kernel size volume larger than 32, spconv will use a slower (and more inaccurate in fp16) algorithm. to use a faster algo for large kernel size (need time to compile at runtime), use large_kernel_fast_algo=True
use SparseGlobalMaxPool instead of use large kernel size when you need global pool.