-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Start enabling the "generic" OpenCL vendor #2019
base: main
Are you sure you want to change the base?
Changes from all commits
c2f590a
e94de41
7091644
7849405
f4df758
cbbc00e
bab6042
89ffefa
96da725
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -146,6 +146,7 @@ set(DNNL_ENABLE_PRIMITIVE_GPU_ISA "ALL" CACHE STRING | |
at build time. Regardless of value chosen, reference OpenCL-based | ||
implementations will always be available. Valid values: | ||
- ALL (the default). Includes all ISA to be enabled. | ||
- NONE. Includes no ISAs, just the generic kernels. | ||
- <ISA_NAME>;<ISA_NAME>;... Includes only selected ISA to be enabled. | ||
Possible values are: GEN9, GEN11, XELP, XEHP, XEHPG, XEHPC, XE2.") | ||
|
||
|
@@ -281,13 +282,13 @@ endif() | |
|
||
set(DNNL_GPU_VENDOR "NONE" CACHE STRING | ||
"When DNNL_GPU_RUNTIME is not NONE DNNL_GPU_VENDOR specifies target GPU | ||
vendor for GPU engines. Can be INTEL (default), NVIDIA or AMD.") | ||
vendor for GPU engines. Can be INTEL (default), GENERIC, NVIDIA or AMD.") | ||
|
||
if(NOT DNNL_GPU_RUNTIME STREQUAL "NONE" AND DNNL_GPU_VENDOR STREQUAL "NONE") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Random spot to continue the discussion from #2044:
This isn't quite accurate, the list is for potential primitive implementations. As things currently stand, each implementation contains its own dispatcher which can be used to implement more logic which can include calling different OpenCL kernels or even different primitive implementations.
We have not seen that need practice. Generally, performance differences like this end being handled by the dispatching logic within the primitive implementation.
In general, most of the attributes we rely on enable generic programming in C, things like function overloading and automatic type inference. This was chosen to enable better type-safety as the alternative is using macros everywhere.
There are a lot of details to this question, and I will attempt to give a good summary here. To begin with, we currently use two dispatching models within primitive implementations. The first is based on hard-coded heuristics (example). The second method uses a performance model and an implementation list where the best scoring implementation under the model is used. All of the OpenCL kernels currently rely on the hard-coded heuristics. The biggest issue ss these dispatchers are fragile, any change (be it kernel, compiler, runtime, etc.) can cause the heuristics/model to be inaccurate. We have considered adding runtime tuning to avoid this, for example #1764, but concluded it is either too computationally expensive or requires a new API that cannot be used by customers. For the model based implementations, we generally need a tool to fit the model to observed performance, as quickly and accurately predicting performance from the hardware SKU is virtually impossible. As such, updates are relatively easy to handle, we just need to retrain the model (although this does induce code churn on the kernel/model database). On the other hand, hard-coded heuristics do not scale and require developer intervention. Coming back to your goal of sharing OpenCL implementations, in order to make such a sharing feasible, we would need dispatching models to be easily generated after implementation changes. As such, this requires one of the following: switching current OpenCL kernels to performance based models, inventing a method to fit (the currently hard-coded) heuristics (at which point we are effectively using machine learning to optimize machine learning), or introducing a new methodology. On the other hand, all known methods likely require performance information from the target devices. At the same time, model information is unavailable in the generic context, so a method to handle this is required be it using default model or somehow training a custom model. I expect this to be a lot of work, just forking the existing implementations would be easier, so I guess this reduces to the question of whether the improved code sharing is worth the effort. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you for the background, that's very helpful and certainly gives me some things to ponder. But one brief thing I do want to touch on:
My sense is that there are likely multiple layers at which sharing might make sense. To draw an analogy from experience in Mesa, some classes of hardware commonly lack the same API features, so (for example) once one driver emulates geometry shaders via compute shaders, that emulation tends to get reused on other hardware with the same limitation. I think there's already some level of reusability from the parameterization of the existing CL kernels for eg. GRF size, that would map reasonably to other hardware. I think there is likely some level of reuse that makes sense for automatic dispatch tuning, even if that would presently be an open-ended research project. But at a more basic level, the acts of enumerating devices, creating contexts and queues, fencing resources, and dispatching work do not fundamentally change across OpenCL device types. And, more importantly, OpenCL already provides interop APIs for managing those operations across devices and platforms. I feel like at least that much code sharing is self-evidently worth the effort, even if different sets of CL devices end up with widely divergent kernels and schedulers they will at least be able to communicate with each other. |
||
set(DNNL_GPU_VENDOR "INTEL") | ||
endif() | ||
|
||
if(NOT "${DNNL_GPU_VENDOR}" MATCHES "^(NONE|INTEL|NVIDIA|AMD)$") | ||
if(NOT "${DNNL_GPU_VENDOR}" MATCHES "^(NONE|GENERIC|INTEL|NVIDIA|AMD)$") | ||
message(FATAL_ERROR "Unsupported GPU vendor: ${DNNL_GPU_VENDOR}") | ||
endif() | ||
|
||
|
@@ -327,6 +328,12 @@ else() | |
set(DNNL_WITH_SYCL false) | ||
endif() | ||
|
||
if(DNNL_GPU_RUNTIME STREQUAL "OCL") # ... OR DNNL_CPU_RUNTIME STREQUAL "OCL") | ||
set(DNNL_WITH_OCL true) | ||
else() | ||
set(DNNL_WITH_OCL false) | ||
endif() | ||
|
||
# ============= | ||
# Miscellaneous | ||
# ============= | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(random spot)
Thank you for taking the effort to enable a generic vendor for the OpenCL runtime. When it comes to external contributions that impact some aspects of the library's architecture a formal proposal (RFC) is required. Please make yourself familiar with the contribution guidelines and the RFC process.
While going through the changes in this pull-request I found a few things that have to be addressed. Your proposal is expected to cover it.
gpu/generic/ocl
.gpu/generic/ocl
and enable them for the Intel and Generic vendors.DNNL_GPU_VENDOR
isGENERIC
andDNNL_GPU_RUNTIME
isOCL
there are no engines that could be used therefore a new engine for the generic vendor and OpenCL runtime has to be introduced.gpu/intel
assumes thatDNNL_GPU_VENDOR
isINTEL
.DNNL_GPU_VENDOR
isINTEL
andDNNL_GPU_RUNTIME
isSYCL
thexpu/ocl
code must be enabled (the new condition that you introduced seems to break the rule).DNNL_ENABLE_PRIMITIVE_GPU_ISA
option should be enabled (and defined) only for the Intel vendor and should be ignored otherwiseAs for enabling OpenCL kernels for CPU, we don't have any plans to do that and the library doesn't have any architecture to do that. If you are interested in that you may come up with an architecture design and publish a proposal.
And the last thing, we plan to enable the generic vendor for SYCL runtime in the coming months. As an option you could wait to see what the design will look like to get a better understanding of what it should look like for OpenCL.