ConvTranpose using CUDNN Frontend with NHWC support (#21752)

### Description Added CUDNN Frontend and used it for NHWC ConvTranspose op including option for bias fusion. Similar to this [Conv PR](#19470) ### Backward compatible If ORT is built with cuDNN 8, cuDNN frontend will not be built into binary. Old kernels (using cudnn backend APIs) are used. ### Major Changes For cuDNN 9, we will enable cudnn frontend to fuse data gradient convolution and bias when a provider option fuse_conv_bias=1. ### Potential Issues cuDNN frontend uses TF32 by default. It can be disabled using use_tf32 cuda provider option, but in the case cuDNN frontend encounters issues building an operation graph it will fallback to using TF32. ### Follow ups This is one of the PRs that target to enable NHWC, here the ConvTranspose operation in CUDA EP by default if device supports it. There are other changes will follow up to make it possible. (1) Enable prefer_nhwc by default for device with sm >= 70. (2) Change fuse_conv_bias=1 by default after more testing. (3) Add other NHWC operators (like Resize or UpSample). ### Motivation and Context The new CUDNN Frontend library provides the functionality to fuse operations and provides new heuristics for kernel selection. Here it fuses the convolution data gradient operation (ConvTranspose) with the pointwise bias operation. ### Minor Change In the CUDA convolution operation was a small bug when `GetCudnnConv1dPadToNc1d ` was enabled.
microsoft · Sep 10, 2024 · 20d9464 · 20d9464
1 parent f633caa
commit 20d9464
Show file tree

Hide file tree

Showing 5 changed files with 702 additions and 224 deletions.
diff --git a/onnxruntime/core/providers/cuda/cuda_execution_provider.cc b/onnxruntime/core/providers/cuda/cuda_execution_provider.cc
@@ -2473,7 +2473,8 @@ static bool RNNNeedFallbackToCPU(const onnxruntime::Node& node,
   return false;
 }
 
-static bool ConvTransposeNeedFallbackToCPU(const onnxruntime::Node& node, const logging::Logger& logger,
+static bool ConvTransposeNeedFallbackToCPU([[maybe_unused]] const onnxruntime::Node& node,
+                                           [[maybe_unused]] const logging::Logger& logger,
                                            [[maybe_unused]] const GraphViewer& graph_viewer,
                                            [[maybe_unused]] const bool prefer_nhwc) {
   const auto& node_attributes = node.GetAttributes();

diff --git a/onnxruntime/core/providers/cuda/nn/conv.cc b/onnxruntime/core/providers/cuda/nn/conv.cc
@@ -385,7 +385,7 @@ Status Conv<T, Layout>::UpdateState(OpKernelContext* context, bool bias_expected
       if (cuda_ep->GetCudnnConv1dPadToNc1d()) {
         x_dims_cudnn.insert(x_dims_cudnn.begin() + 2, 1);
         y_dims_cudnn.insert(y_dims_cudnn.begin() + 2, 1);
-        w_dims_cudnn.insert(w_dims.begin() + 2, 1);
+        w_dims_cudnn.insert(w_dims_cudnn.begin() + 2, 1);
         pads.insert(pads.begin() + kernel_rank, 0);
         pads.insert(pads.begin(), 0);
         kernel_shape.insert(kernel_shape.begin(), 1);