Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start enabling the "generic" OpenCL vendor #2019

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

nwnk
Copy link
Contributor

@nwnk nwnk commented Jul 31, 2024

This adds a GENERIC vendor to the set of choices for OpenCL, and treats it as the Intel OpenCL driver minus the gen-specific jit kernels. It still effectively requires the NEO driver at the moment, but the goal is to run on any OpenCL device that supports the necessary OpenCL version and extensions. It's not working yet but it's not far off:

91% tests passed, 8 tests failed out of 86

Total Test time (real) = 13492.32 sec

The following tests FAILED:
	 34 - test_iface_attr_quantization_gpu (NUMERICAL)
	 35 - test_iface_attr_quantization_buffer_gpu (NUMERICAL)
	 50 - test_iface_wino_convolution_gpu (Failed)
	 51 - test_iface_wino_convolution_buffer_gpu (Failed)
	 56 - test_inner_product_forward_gpu (Failed)
	 57 - test_inner_product_forward_buffer_gpu (Failed)
	 62 - test_matmul_gpu (SEGFAULT)
         63 - test_matmul_buffer_gpu (SEGFAULT)

The segfaults are where I killed the test because it ran for over an hour without completing. Some random observations and thoughts about where I'd like to go with this:

  • Ideally I think I would like to build the OpenCL support for both Intel and "generic", in the sense that the Intel-specific jit stuff would still get built and just disable itself on platforms where it doesn't work. I'm not sure how invasive to get as far as changing how dispatch works.
  • I think most of the OpenCL driver probably belongs up in src/xpu
  • We should be able to use OpenCL for the CPU device too

Feedback welcomed.

@nwnk nwnk requested review from a team as code owners July 31, 2024 16:18
@nwnk nwnk changed the title RFC: Start enabling the "generic" OpenCL vendor rfc: Start enabling the "generic" OpenCL vendor Jul 31, 2024
@@ -146,6 +146,7 @@ set(DNNL_ENABLE_PRIMITIVE_GPU_ISA "ALL" CACHE STRING
at build time. Regardless of value chosen, reference OpenCL-based
implementations will always be available. Valid values:
- ALL (the default). Includes all ISA to be enabled.
- NONE. Includes no ISAs, just the generic kernels.
Copy link
Contributor

@densamoilov densamoilov Aug 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(random spot)

Thank you for taking the effort to enable a generic vendor for the OpenCL runtime. When it comes to external contributions that impact some aspects of the library's architecture a formal proposal (RFC) is required. Please make yourself familiar with the contribution guidelines and the RFC process.

While going through the changes in this pull-request I found a few things that have to be addressed. Your proposal is expected to cover it.

  • According to the GPU code organization structure the generic OpenCL kernels should reside in gpu/generic/ocl.
  • The OpenCL primitive implementations that you marked as generic may be generic at this point but there is absolutely no guarantee that it will stay that way. The right way to make them generic is to move them to gpu/generic/ocl and enable them for the Intel and Generic vendors.
  • The library architecture requires that one engine per runtime (CPU and GPU) must be enabled. Currently, there is only one OpenCL engine that is Intel specific. When DNNL_GPU_VENDOR is GENERIC and DNNL_GPU_RUNTIME is OCL there are no engines that could be used therefore a new engine for the generic vendor and OpenCL runtime has to be introduced.
  • There shouldn't any vendor specific compile time checks in a vendor specific space. All code that resides in gpu/intel assumes that DNNL_GPU_VENDOR is INTEL.
  • When DNNL_GPU_VENDOR is INTEL and DNNL_GPU_RUNTIME is SYCL the xpu/ocl code must be enabled (the new condition that you introduced seems to break the rule).
  • I think the DNNL_ENABLE_PRIMITIVE_GPU_ISA option should be enabled (and defined) only for the Intel vendor and should be ignored otherwise
  • USM is not part of the OpenCL standard and therefore we cannot assume that all OpenCL vendors support it. The USM utility functions must be conditionally enabled for the vendors that have the support.

As for enabling OpenCL kernels for CPU, we don't have any plans to do that and the library doesn't have any architecture to do that. If you are interested in that you may come up with an architecture design and publish a proposal.

And the last thing, we plan to enable the generic vendor for SYCL runtime in the coming months. As an option you could wait to see what the design will look like to get a better understanding of what it should look like for OpenCL.

@vpirogov vpirogov added the platform:gpu-generic Codeowner: @oneapi-src/onednn-gpu-generic label Aug 1, 2024
@vpirogov vpirogov changed the title rfc: Start enabling the "generic" OpenCL vendor Start enabling the "generic" OpenCL vendor Aug 2, 2024
@vpirogov vpirogov marked this pull request as draft August 2, 2024 17:51
@@ -281,13 +282,13 @@ endif()

set(DNNL_GPU_VENDOR "NONE" CACHE STRING
"When DNNL_GPU_RUNTIME is not NONE DNNL_GPU_VENDOR specifies target GPU
vendor for GPU engines. Can be INTEL (default), NVIDIA or AMD.")
vendor for GPU engines. Can be INTEL (default), GENERIC, NVIDIA or AMD.")

if(NOT DNNL_GPU_RUNTIME STREQUAL "NONE" AND DNNL_GPU_VENDOR STREQUAL "NONE")
Copy link
Contributor

@rjoursler rjoursler Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Random spot to continue the discussion from #2044:

One thing I noted was that the list of potential kernels for a primitive is fixed at compilation time, based purely on the supported engines.

This isn't quite accurate, the list is for potential primitive implementations. As things currently stand, each implementation contains its own dispatcher which can be used to implement more logic which can include calling different OpenCL kernels or even different primitive implementations.

I could easily imagine that iGPU and dGPU would want different sort orders entirely

We have not seen that need practice. Generally, performance differences like this end being handled by the dispatching logic within the primitive implementation.

I admit I don't love depending on a particular CLC compiler.

In general, most of the attributes we rely on enable generic programming in C, things like function overloading and automatic type inference. This was chosen to enable better type-safety as the alternative is using macros everywhere.

What tensions are you finding, and how are you trying to address them?

There are a lot of details to this question, and I will attempt to give a good summary here. To begin with, we currently use two dispatching models within primitive implementations. The first is based on hard-coded heuristics (example). The second method uses a performance model and an implementation list where the best scoring implementation under the model is used. All of the OpenCL kernels currently rely on the hard-coded heuristics. The biggest issue ss these dispatchers are fragile, any change (be it kernel, compiler, runtime, etc.) can cause the heuristics/model to be inaccurate. We have considered adding runtime tuning to avoid this, for example #1764, but concluded it is either too computationally expensive or requires a new API that cannot be used by customers. For the model based implementations, we generally need a tool to fit the model to observed performance, as quickly and accurately predicting performance from the hardware SKU is virtually impossible. As such, updates are relatively easy to handle, we just need to retrain the model (although this does induce code churn on the kernel/model database). On the other hand, hard-coded heuristics do not scale and require developer intervention.

Coming back to your goal of sharing OpenCL implementations, in order to make such a sharing feasible, we would need dispatching models to be easily generated after implementation changes. As such, this requires one of the following: switching current OpenCL kernels to performance based models, inventing a method to fit (the currently hard-coded) heuristics (at which point we are effectively using machine learning to optimize machine learning), or introducing a new methodology. On the other hand, all known methods likely require performance information from the target devices. At the same time, model information is unavailable in the generic context, so a method to handle this is required be it using default model or somehow training a custom model. I expect this to be a lot of work, just forking the existing implementations would be easier, so I guess this reduces to the question of whether the improved code sharing is worth the effort.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the background, that's very helpful and certainly gives me some things to ponder. But one brief thing I do want to touch on:

so I guess this reduces to the question of whether the improved code sharing is worth the effort.

My sense is that there are likely multiple layers at which sharing might make sense. To draw an analogy from experience in Mesa, some classes of hardware commonly lack the same API features, so (for example) once one driver emulates geometry shaders via compute shaders, that emulation tends to get reused on other hardware with the same limitation. I think there's already some level of reusability from the parameterization of the existing CL kernels for eg. GRF size, that would map reasonably to other hardware. I think there is likely some level of reuse that makes sense for automatic dispatch tuning, even if that would presently be an open-ended research project.

But at a more basic level, the acts of enumerating devices, creating contexts and queues, fencing resources, and dispatching work do not fundamentally change across OpenCL device types. And, more importantly, OpenCL already provides interop APIs for managing those operations across devices and platforms. I feel like at least that much code sharing is self-evidently worth the effort, even if different sets of CL devices end up with widely divergent kernels and schedulers they will at least be able to communicate with each other.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
platform:gpu-generic Codeowner: @oneapi-src/onednn-gpu-generic
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants