Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start enabling the "generic" OpenCL vendor #2019

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 9 additions & 2 deletions cmake/options.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,7 @@ set(DNNL_ENABLE_PRIMITIVE_GPU_ISA "ALL" CACHE STRING
at build time. Regardless of value chosen, reference OpenCL-based
implementations will always be available. Valid values:
- ALL (the default). Includes all ISA to be enabled.
- NONE. Includes no ISAs, just the generic kernels.
Copy link
Contributor

@densamoilov densamoilov Aug 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(random spot)

Thank you for taking the effort to enable a generic vendor for the OpenCL runtime. When it comes to external contributions that impact some aspects of the library's architecture a formal proposal (RFC) is required. Please make yourself familiar with the contribution guidelines and the RFC process.

While going through the changes in this pull-request I found a few things that have to be addressed. Your proposal is expected to cover it.

  • According to the GPU code organization structure the generic OpenCL kernels should reside in gpu/generic/ocl.
  • The OpenCL primitive implementations that you marked as generic may be generic at this point but there is absolutely no guarantee that it will stay that way. The right way to make them generic is to move them to gpu/generic/ocl and enable them for the Intel and Generic vendors.
  • The library architecture requires that one engine per runtime (CPU and GPU) must be enabled. Currently, there is only one OpenCL engine that is Intel specific. When DNNL_GPU_VENDOR is GENERIC and DNNL_GPU_RUNTIME is OCL there are no engines that could be used therefore a new engine for the generic vendor and OpenCL runtime has to be introduced.
  • There shouldn't any vendor specific compile time checks in a vendor specific space. All code that resides in gpu/intel assumes that DNNL_GPU_VENDOR is INTEL.
  • When DNNL_GPU_VENDOR is INTEL and DNNL_GPU_RUNTIME is SYCL the xpu/ocl code must be enabled (the new condition that you introduced seems to break the rule).
  • I think the DNNL_ENABLE_PRIMITIVE_GPU_ISA option should be enabled (and defined) only for the Intel vendor and should be ignored otherwise
  • USM is not part of the OpenCL standard and therefore we cannot assume that all OpenCL vendors support it. The USM utility functions must be conditionally enabled for the vendors that have the support.

As for enabling OpenCL kernels for CPU, we don't have any plans to do that and the library doesn't have any architecture to do that. If you are interested in that you may come up with an architecture design and publish a proposal.

And the last thing, we plan to enable the generic vendor for SYCL runtime in the coming months. As an option you could wait to see what the design will look like to get a better understanding of what it should look like for OpenCL.

- <ISA_NAME>;<ISA_NAME>;... Includes only selected ISA to be enabled.
Possible values are: GEN9, GEN11, XELP, XEHP, XEHPG, XEHPC, XE2.")

Expand Down Expand Up @@ -281,13 +282,13 @@ endif()

set(DNNL_GPU_VENDOR "NONE" CACHE STRING
"When DNNL_GPU_RUNTIME is not NONE DNNL_GPU_VENDOR specifies target GPU
vendor for GPU engines. Can be INTEL (default), NVIDIA or AMD.")
vendor for GPU engines. Can be INTEL (default), GENERIC, NVIDIA or AMD.")

if(NOT DNNL_GPU_RUNTIME STREQUAL "NONE" AND DNNL_GPU_VENDOR STREQUAL "NONE")
Copy link
Contributor

@rjoursler rjoursler Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Random spot to continue the discussion from #2044:

One thing I noted was that the list of potential kernels for a primitive is fixed at compilation time, based purely on the supported engines.

This isn't quite accurate, the list is for potential primitive implementations. As things currently stand, each implementation contains its own dispatcher which can be used to implement more logic which can include calling different OpenCL kernels or even different primitive implementations.

I could easily imagine that iGPU and dGPU would want different sort orders entirely

We have not seen that need practice. Generally, performance differences like this end being handled by the dispatching logic within the primitive implementation.

I admit I don't love depending on a particular CLC compiler.

In general, most of the attributes we rely on enable generic programming in C, things like function overloading and automatic type inference. This was chosen to enable better type-safety as the alternative is using macros everywhere.

What tensions are you finding, and how are you trying to address them?

There are a lot of details to this question, and I will attempt to give a good summary here. To begin with, we currently use two dispatching models within primitive implementations. The first is based on hard-coded heuristics (example). The second method uses a performance model and an implementation list where the best scoring implementation under the model is used. All of the OpenCL kernels currently rely on the hard-coded heuristics. The biggest issue ss these dispatchers are fragile, any change (be it kernel, compiler, runtime, etc.) can cause the heuristics/model to be inaccurate. We have considered adding runtime tuning to avoid this, for example #1764, but concluded it is either too computationally expensive or requires a new API that cannot be used by customers. For the model based implementations, we generally need a tool to fit the model to observed performance, as quickly and accurately predicting performance from the hardware SKU is virtually impossible. As such, updates are relatively easy to handle, we just need to retrain the model (although this does induce code churn on the kernel/model database). On the other hand, hard-coded heuristics do not scale and require developer intervention.

Coming back to your goal of sharing OpenCL implementations, in order to make such a sharing feasible, we would need dispatching models to be easily generated after implementation changes. As such, this requires one of the following: switching current OpenCL kernels to performance based models, inventing a method to fit (the currently hard-coded) heuristics (at which point we are effectively using machine learning to optimize machine learning), or introducing a new methodology. On the other hand, all known methods likely require performance information from the target devices. At the same time, model information is unavailable in the generic context, so a method to handle this is required be it using default model or somehow training a custom model. I expect this to be a lot of work, just forking the existing implementations would be easier, so I guess this reduces to the question of whether the improved code sharing is worth the effort.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the background, that's very helpful and certainly gives me some things to ponder. But one brief thing I do want to touch on:

so I guess this reduces to the question of whether the improved code sharing is worth the effort.

My sense is that there are likely multiple layers at which sharing might make sense. To draw an analogy from experience in Mesa, some classes of hardware commonly lack the same API features, so (for example) once one driver emulates geometry shaders via compute shaders, that emulation tends to get reused on other hardware with the same limitation. I think there's already some level of reusability from the parameterization of the existing CL kernels for eg. GRF size, that would map reasonably to other hardware. I think there is likely some level of reuse that makes sense for automatic dispatch tuning, even if that would presently be an open-ended research project.

But at a more basic level, the acts of enumerating devices, creating contexts and queues, fencing resources, and dispatching work do not fundamentally change across OpenCL device types. And, more importantly, OpenCL already provides interop APIs for managing those operations across devices and platforms. I feel like at least that much code sharing is self-evidently worth the effort, even if different sets of CL devices end up with widely divergent kernels and schedulers they will at least be able to communicate with each other.

set(DNNL_GPU_VENDOR "INTEL")
endif()

if(NOT "${DNNL_GPU_VENDOR}" MATCHES "^(NONE|INTEL|NVIDIA|AMD)$")
if(NOT "${DNNL_GPU_VENDOR}" MATCHES "^(NONE|GENERIC|INTEL|NVIDIA|AMD)$")
message(FATAL_ERROR "Unsupported GPU vendor: ${DNNL_GPU_VENDOR}")
endif()

Expand Down Expand Up @@ -327,6 +328,12 @@ else()
set(DNNL_WITH_SYCL false)
endif()

if(DNNL_GPU_RUNTIME STREQUAL "OCL") # ... OR DNNL_CPU_RUNTIME STREQUAL "OCL")
set(DNNL_WITH_OCL true)
else()
set(DNNL_WITH_OCL false)
endif()

# =============
# Miscellaneous
# =============
Expand Down
8 changes: 4 additions & 4 deletions src/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,10 @@ endif()

add_subdirectory(common)

if(DNNL_WITH_SYCL OR DNNL_GPU_RUNTIME STREQUAL "OCL")
add_subdirectory(xpu)
endif()

if(NOT DNNL_CPU_RUNTIME STREQUAL "NONE")
add_subdirectory(cpu)
endif()
Expand All @@ -133,10 +137,6 @@ if(NOT DNNL_GPU_RUNTIME STREQUAL "NONE")
add_subdirectory(gpu)
endif()

if(DNNL_WITH_SYCL OR DNNL_GPU_RUNTIME STREQUAL "OCL")
add_subdirectory(xpu)
endif()

if(DNNL_WITH_SYCL)
# Enable linking SYCL kernels.
if(DNNL_SYCL_CUDA)
Expand Down
14 changes: 8 additions & 6 deletions src/gpu/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -26,12 +26,14 @@ set_property(GLOBAL APPEND PROPERTY DNNL_LIB_DEPS

add_subdirectory(generic)

if(DNNL_GPU_VENDOR STREQUAL "INTEL")
add_definitions_with_host_compiler(-DNGEN_CPP11)
add_definitions_with_host_compiler(-DNGEN_SAFE)
add_definitions_with_host_compiler(-DNGEN_NEO_INTERFACE)
add_definitions_with_host_compiler(-DNGEN_NO_OP_NAMES)
add_definitions_with_host_compiler(-DNGEN_WINDOWS_COMPAT)
if(DNNL_WITH_OCL)
if(DNNL_GPU_VENDOR STREQUAL "INTEL")
add_definitions_with_host_compiler(-DNGEN_CPP11)
add_definitions_with_host_compiler(-DNGEN_SAFE)
add_definitions_with_host_compiler(-DNGEN_NEO_INTERFACE)
add_definitions_with_host_compiler(-DNGEN_NO_OP_NAMES)
add_definitions_with_host_compiler(-DNGEN_WINDOWS_COMPAT)
endif()
add_subdirectory(intel)
endif()

Expand Down
19 changes: 10 additions & 9 deletions src/gpu/gpu_batch_normalization_list.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -16,13 +16,14 @@

#include "gpu/gpu_impl_list.hpp"

#if DNNL_GPU_VENDOR == DNNL_VENDOR_INTEL
#include "gpu/intel/ocl/bnorm/gen9_batch_normalization.hpp"
#include "gpu/intel/ocl/bnorm/nhwc_batch_normalization.hpp"
#include "gpu/intel/ocl/bnorm/ref_batch_normalization.hpp"
#include "gpu/intel/ocl/bnorm/reusable_bnorm.hpp"
#include "gpu/intel/ocl/bnorm/simple_bnorm.hpp"

#if DNNL_GPU_VENDOR == DNNL_VENDOR_INTEL
#include "gpu/intel/ocl/bnorm/gen9_batch_normalization.hpp"
#include "gpu/intel/ocl/bnorm/nhwc_batch_normalization.hpp"

#ifdef DNNL_DEV_MODE
#include "gpu/intel/ocl/bnorm/nhwc_reusable.hpp"
#endif
Expand Down Expand Up @@ -52,9 +53,9 @@ const std::map<pk_impl_key_t, std::vector<impl_list_item_t>>
GPU_INSTANCE_INTEL_DEVMODE(intel::ocl::nhwc_reusable_batch_normalization_fwd_t)
GPU_INSTANCE_INTEL(intel::ocl::nhwc_batch_normalization_fwd_t)
GPU_INSTANCE_INTEL(intel::ocl::gen9_batch_normalization_fwd_t)
GPU_INSTANCE_INTEL(intel::ocl::simple_batch_normalization_fwd_t)
GPU_INSTANCE_INTEL(intel::ocl::reusable_batch_normalization_fwd_t)
GPU_INSTANCE_INTEL(intel::ocl::ref_batch_normalization_fwd_t)
GPU_INSTANCE_GENERIC(intel::ocl::simple_batch_normalization_fwd_t)
GPU_INSTANCE_GENERIC(intel::ocl::reusable_batch_normalization_fwd_t)
GPU_INSTANCE_GENERIC(intel::ocl::ref_batch_normalization_fwd_t)
GPU_INSTANCE_NVIDIA(nvidia::cudnn_batch_normalization_fwd_t)
GPU_INSTANCE_AMD(amd::miopen_batch_normalization_fwd_t)
GPU_INSTANCE_GENERIC_SYCL(generic::sycl::ref_batch_normalization_fwd_t)
Expand All @@ -64,9 +65,9 @@ const std::map<pk_impl_key_t, std::vector<impl_list_item_t>>
GPU_INSTANCE_INTEL_DEVMODE(intel::ocl::nhwc_reusable_batch_normalization_bwd_t)
GPU_INSTANCE_INTEL(intel::ocl::nhwc_batch_normalization_bwd_t)
GPU_INSTANCE_INTEL(intel::ocl::gen9_batch_normalization_bwd_t)
GPU_INSTANCE_INTEL(intel::ocl::simple_batch_normalization_bwd_t)
GPU_INSTANCE_INTEL(intel::ocl::reusable_batch_normalization_bwd_t)
GPU_INSTANCE_INTEL(intel::ocl::ref_batch_normalization_bwd_t)
GPU_INSTANCE_GENERIC(intel::ocl::simple_batch_normalization_bwd_t)
GPU_INSTANCE_GENERIC(intel::ocl::reusable_batch_normalization_bwd_t)
GPU_INSTANCE_GENERIC(intel::ocl::ref_batch_normalization_bwd_t)
GPU_INSTANCE_NVIDIA(nvidia::cudnn_batch_normalization_bwd_t)
GPU_INSTANCE_AMD(amd::miopen_batch_normalization_bwd_t)
GPU_INSTANCE_GENERIC_SYCL(generic::sycl::ref_batch_normalization_bwd_t)
Expand Down
8 changes: 4 additions & 4 deletions src/gpu/gpu_binary_list.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,11 @@
*******************************************************************************/

#include "gpu/gpu_impl_list.hpp"
#include "gpu/intel/ocl/multi_po_reorder_binary.hpp"
#include "gpu/intel/ocl/simple_binary.hpp"

#if DNNL_GPU_VENDOR == DNNL_VENDOR_INTEL
#include "gpu/intel/ocl/gen9_binary.hpp"
#include "gpu/intel/ocl/multi_po_reorder_binary.hpp"
#include "gpu/intel/ocl/simple_binary.hpp"
#endif

#if DNNL_GPU_VENDOR == DNNL_VENDOR_NVIDIA
Expand All @@ -39,9 +39,9 @@ namespace {

// clang-format off
constexpr impl_list_item_t impl_list[] = REG_BINARY_P({
GPU_INSTANCE_INTEL(intel::ocl::multi_po_reorder_binary)
GPU_INSTANCE_GENERIC(intel::ocl::multi_po_reorder_binary)
GPU_INSTANCE_INTEL(intel::ocl::gen9_binary_t)
GPU_INSTANCE_INTEL(intel::ocl::simple_binary_t)
GPU_INSTANCE_GENERIC(intel::ocl::simple_binary_t)
GPU_INSTANCE_NVIDIA(nvidia::cudnn_binary_t)
GPU_INSTANCE_AMD(amd::miopen_binary_t)
GPU_INSTANCE_GENERIC_SYCL(generic::sycl::ref_binary_t)
Expand Down
12 changes: 6 additions & 6 deletions src/gpu/gpu_concat_list.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -17,12 +17,12 @@
#include "gpu/gpu_impl_list.hpp"

#include "gpu/generic/ref_concat.hpp"

#if DNNL_GPU_VENDOR == DNNL_VENDOR_INTEL
#include "gpu/intel/ocl/gen9_concat.hpp"
#include "gpu/intel/ocl/multi_concat.hpp"
#include "gpu/intel/ocl/reusable_simple_concat.hpp"
#include "gpu/intel/ocl/simple_concat.hpp"

#if DNNL_GPU_VENDOR == DNNL_VENDOR_INTEL
#include "gpu/intel/ocl/gen9_concat.hpp"
#endif

namespace dnnl {
Expand All @@ -33,10 +33,10 @@ namespace {

// clang-format off
constexpr impl_list_item_t impl_list[] = REG_CONCAT_P({
GPU_CONCAT_INSTANCE_INTEL(intel::ocl::reusable_simple_concat_t)
GPU_CONCAT_INSTANCE_INTEL(intel::ocl::simple_concat_t)
GPU_CONCAT_INSTANCE_GENERIC(intel::ocl::reusable_simple_concat_t)
GPU_CONCAT_INSTANCE_GENERIC(intel::ocl::simple_concat_t)
GPU_CONCAT_INSTANCE_INTEL(intel::ocl::gen9_concat_t)
GPU_CONCAT_INSTANCE_INTEL(intel::ocl::multi_concat_t)
GPU_CONCAT_INSTANCE_GENERIC(intel::ocl::multi_concat_t)
GPU_CONCAT_INSTANCE_GENERIC(generic::ref_concat_t)
nullptr,
});
Expand Down
8 changes: 4 additions & 4 deletions src/gpu/gpu_convolution_list.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -15,12 +15,12 @@
*******************************************************************************/

#include "gpu/gpu_impl_list.hpp"
#include "gpu/intel/ocl/ref_convolution.hpp"

#if DNNL_GPU_VENDOR == DNNL_VENDOR_INTEL
#include "gpu/intel/jit/binary_format.hpp"
#include "gpu/intel/jit/conv/gen_convolution.hpp"
#include "gpu/intel/ocl/gen9_wino_convolution.hpp"
#include "gpu/intel/ocl/ref_convolution.hpp"

#ifdef DNNL_DEV_MODE
#include "gpu/intel/jit/v2/conv/gen_convolution.hpp"
Expand Down Expand Up @@ -50,23 +50,23 @@ const std::map<pk_impl_key_t, std::vector<impl_list_item_t>>
GPU_INSTANCE_INTEL_DEVMODE(intel::jit::v2::conv::gen_convolution_fwd_t)
GPU_INSTANCE_INTEL(intel::jit::gen_convolution_fwd_t)
GPU_INSTANCE_INTEL(intel::ocl::gen9_wino_convolution_fwd_t)
GPU_INSTANCE_INTEL_REF(intel::ocl::ref_convolution_fwd_t)
GPU_INSTANCE_REF(intel::ocl::ref_convolution_fwd_t)
GPU_INSTANCE_NVIDIA(nvidia::cudnn_convolution_fwd_t)
GPU_INSTANCE_AMD(amd::miopen_convolution_fwd_t)
nullptr,
}},
{{backward_data}, REG_BWD_D_PK({
GPU_INSTANCE_INTEL_DEVMODE(intel::jit::v2::conv::gen_convolution_bwd_data_t)
GPU_INSTANCE_INTEL(intel::jit::gen_convolution_bwd_data_t)
GPU_INSTANCE_INTEL_REF(intel::ocl::ref_convolution_bwd_data_t)
GPU_INSTANCE_REF(intel::ocl::ref_convolution_bwd_data_t)
GPU_INSTANCE_NVIDIA(nvidia::cudnn_convolution_bwd_data_t)
GPU_INSTANCE_AMD(amd::miopen_convolution_bwd_data_t)
nullptr,
})},
{{backward_weights}, REG_BWD_PK({
GPU_INSTANCE_INTEL_DEVMODE(intel::jit::v2::conv::gen_convolution_bwd_weights_t)
GPU_INSTANCE_INTEL(intel::jit::gen_convolution_bwd_weights_t)
GPU_INSTANCE_INTEL_REF(intel::ocl::ref_convolution_bwd_weights_t)
GPU_INSTANCE_REF(intel::ocl::ref_convolution_bwd_weights_t)
GPU_INSTANCE_NVIDIA(nvidia::cudnn_convolution_bwd_weights_t)
GPU_INSTANCE_AMD(amd::miopen_convolution_bwd_weights_t)
nullptr,
Expand Down
9 changes: 3 additions & 6 deletions src/gpu/gpu_deconvolution_list.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,7 @@
*******************************************************************************/

#include "gpu/gpu_impl_list.hpp"

#if DNNL_GPU_VENDOR == DNNL_VENDOR_INTEL
#include "gpu/intel/ocl/convolution_deconvolution.hpp"
#endif

#if DNNL_GPU_VENDOR == DNNL_VENDOR_NVIDIA
#include "gpu/nvidia/cudnn_deconvolution.hpp"
Expand All @@ -39,14 +36,14 @@ using namespace dnnl::impl::prop_kind;
const std::map<pk_impl_key_t, std::vector<impl_list_item_t>>
impl_list_map REG_DECONV_P({
{{forward}, {
GPU_INSTANCE_INTEL(intel::ocl::convolution_deconvolution_fwd_t)
GPU_INSTANCE_GENERIC(intel::ocl::convolution_deconvolution_fwd_t)
GPU_INSTANCE_NVIDIA(nvidia::cudnn_deconvolution_fwd_t)
GPU_INSTANCE_AMD(amd::miopen_deconvolution_fwd_t)
nullptr,
}},
{{backward}, REG_BWD_PK({
GPU_INSTANCE_INTEL(intel::ocl::convolution_deconvolution_bwd_data_t)
GPU_INSTANCE_INTEL(intel::ocl::convolution_deconvolution_bwd_weights_t)
GPU_INSTANCE_GENERIC(intel::ocl::convolution_deconvolution_bwd_data_t)
GPU_INSTANCE_GENERIC(intel::ocl::convolution_deconvolution_bwd_weights_t)
GPU_INSTANCE_NVIDIA(nvidia::cudnn_deconvolution_bwd_data_t)
GPU_INSTANCE_NVIDIA(nvidia::cudnn_deconvolution_bwd_weights_t)
GPU_INSTANCE_AMD(amd::miopen_deconvolution_bwd_data_t)
Expand Down
6 changes: 3 additions & 3 deletions src/gpu/gpu_eltwise_list.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,10 @@
*******************************************************************************/

#include "gpu/gpu_impl_list.hpp"
#include "gpu/intel/ocl/ref_eltwise.hpp"

#if DNNL_GPU_VENDOR == DNNL_VENDOR_INTEL
#include "gpu/intel/ocl/gen9_eltwise.hpp"
#include "gpu/intel/ocl/ref_eltwise.hpp"
#endif

#if DNNL_GPU_VENDOR == DNNL_VENDOR_NVIDIA
Expand All @@ -42,15 +42,15 @@ const std::map<pk_impl_key_t, std::vector<impl_list_item_t>>
impl_list_map REG_ELTWISE_P({
{{forward}, {
GPU_INSTANCE_INTEL(intel::ocl::gen9_eltwise_fwd_t)
GPU_INSTANCE_INTEL(intel::ocl::ref_eltwise_fwd_t)
GPU_INSTANCE_GENERIC(intel::ocl::ref_eltwise_fwd_t)
GPU_INSTANCE_NVIDIA(nvidia::cudnn_eltwise_fwd_t)
GPU_INSTANCE_AMD(amd::miopen_eltwise_fwd_t)
GPU_INSTANCE_GENERIC_SYCL(generic::sycl::ref_sycl_eltwise_fwd_t)
nullptr,
}},
{{backward}, REG_BWD_PK({
GPU_INSTANCE_INTEL(intel::ocl::gen9_eltwise_bwd_t)
GPU_INSTANCE_INTEL(intel::ocl::ref_eltwise_bwd_t)
GPU_INSTANCE_GENERIC(intel::ocl::ref_eltwise_bwd_t)
GPU_INSTANCE_NVIDIA(nvidia::cudnn_eltwise_bwd_t)
GPU_INSTANCE_AMD(amd::miopen_eltwise_bwd_t)
GPU_INSTANCE_GENERIC_SYCL(generic::sycl::ref_sycl_eltwise_bwd_t)
Expand Down
6 changes: 3 additions & 3 deletions src/gpu/gpu_gemm_list.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -17,14 +17,14 @@
#include "common/compiler_workarounds.hpp"

#include "gpu/gpu_impl_list.hpp"
#include "gpu/intel/ocl/gemm/gemm_with_post_ops.hpp"
#include "gpu/intel/ocl/gemm/ref_gemm.hpp"

#if DNNL_GPU_VENDOR == DNNL_VENDOR_INTEL
#include "gpu/intel/jit/binary_format.hpp"

#include "gpu/intel/jit/gemm/gen_gemm.hpp"
#include "gpu/intel/jit/gemm/xe_hp_systolic_gemm.hpp"
#include "gpu/intel/ocl/gemm/gemm_with_post_ops.hpp"
#include "gpu/intel/ocl/gemm/ref_gemm.hpp"

#ifdef DNNL_DEV_MODE
#include "gpu/intel/ocl/gemm/conv_gemm.hpp"
Expand All @@ -44,7 +44,7 @@ constexpr impl_list_item_t impl_list[] = {
GPU_INSTANCE_INTEL(intel::jit::xe_hp_systolic_gemm_t)
GPU_INSTANCE_INTEL(intel::ocl::gemm_with_post_ops_t)
GPU_INSTANCE_INTEL(intel::jit::gen_gemm_t)
GPU_INSTANCE_INTEL_REF(intel::ocl::ref_gemm_t)
GPU_INSTANCE_REF(intel::ocl::ref_gemm_t)
nullptr,
};
// clang-format on
Expand Down
7 changes: 2 additions & 5 deletions src/gpu/gpu_group_normalization_list.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,7 @@
*******************************************************************************/

#include "gpu/gpu_impl_list.hpp"

#if DNNL_GPU_VENDOR == DNNL_VENDOR_INTEL
#include "gpu/intel/ocl/ref_group_normalization.hpp"
#endif

namespace dnnl {
namespace impl {
Expand All @@ -31,12 +28,12 @@ using namespace dnnl::impl::prop_kind;
const std::map<pk_impl_key_t, std::vector<impl_list_item_t>>
impl_list_map REG_GNORM_P({
{{forward}, {
GPU_INSTANCE_INTEL(intel::ocl::ref_group_normalization_fwd_t)
GPU_INSTANCE_GENERIC(intel::ocl::ref_group_normalization_fwd_t)
nullptr,
}
},
{{backward}, REG_BWD_PK({
GPU_INSTANCE_INTEL(intel::ocl::ref_group_normalization_bwd_t)
GPU_INSTANCE_GENERIC(intel::ocl::ref_group_normalization_bwd_t)
nullptr,
})
},
Expand Down
5 changes: 2 additions & 3 deletions src/gpu/gpu_impl_list.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -150,10 +150,9 @@ namespace gpu {

// Instance macros that are enabled only when REF is disabled
#ifdef DNNL_DISABLE_GPU_REF_KERNELS
#define GPU_INSTANCE_INTEL_REF(...)
#define GPU_INSTANCE_REF(...)
#else
#define GPU_INSTANCE_INTEL_REF(...) \
DNNL_GPU_INTEL_ONLY(GPU_INSTANCE(__VA_ARGS__))
#define GPU_INSTANCE_REF(...) GPU_INSTANCE(__VA_ARGS__)
#endif

#define DECLARE_IMPL_LIST(kind) \
Expand Down
10 changes: 5 additions & 5 deletions src/gpu/gpu_inner_product_list.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -15,12 +15,12 @@
*******************************************************************************/

#include "gpu/gpu_impl_list.hpp"
#include "gpu/intel/ocl/convolution_inner_product.hpp"
#include "gpu/intel/ocl/ref_inner_product.hpp"

#if DNNL_GPU_VENDOR == DNNL_VENDOR_INTEL
#include "gpu/intel/ocl/convolution_inner_product.hpp"
#include "gpu/intel/ocl/gemm_inner_product.hpp"
#include "gpu/intel/ocl/gemm_post_ops_inner_product.hpp"
#include "gpu/intel/ocl/ref_inner_product.hpp"
#endif

#if DNNL_GPU_VENDOR == DNNL_VENDOR_NVIDIA
Expand All @@ -45,7 +45,7 @@ const std::map<pk_impl_key_t, std::vector<impl_list_item_t>>
{{forward}, {
GPU_INSTANCE_INTEL(intel::ocl::gemm_inner_product_fwd_t)
GPU_INSTANCE_INTEL(intel::ocl::convolution_inner_product_fwd_t)
GPU_INSTANCE_INTEL_REF(intel::ocl::ref_inner_product_fwd_t)
GPU_INSTANCE_REF(intel::ocl::ref_inner_product_fwd_t)
GPU_INSTANCE_NVIDIA(nvidia::cudnn_gemm_inner_product_fwd_t)
GPU_INSTANCE_NVIDIA(nvidia::cudnn_conv_inner_product_fwd_t)
GPU_INSTANCE_AMD(amd::miopen_gemm_inner_product_fwd_t)
Expand All @@ -54,8 +54,8 @@ const std::map<pk_impl_key_t, std::vector<impl_list_item_t>>
{{backward}, REG_BWD_PK({
GPU_INSTANCE_INTEL(intel::ocl::gemm_inner_product_bwd_data_t)
GPU_INSTANCE_INTEL(intel::ocl::gemm_inner_product_bwd_weights_t)
GPU_INSTANCE_INTEL_REF(intel::ocl::ref_inner_product_bwd_data_t)
GPU_INSTANCE_INTEL_REF(intel::ocl::ref_inner_product_bwd_weights_t)
GPU_INSTANCE_REF(intel::ocl::ref_inner_product_bwd_data_t)
GPU_INSTANCE_REF(intel::ocl::ref_inner_product_bwd_weights_t)
GPU_INSTANCE_NVIDIA(nvidia::cudnn_gemm_inner_product_bwd_data_t)
GPU_INSTANCE_NVIDIA(nvidia::cudnn_gemm_inner_product_bwd_weights_t)
GPU_INSTANCE_NVIDIA(nvidia::cudnn_conv_inner_product_bwd_data_t)
Expand Down
Loading