Start enabling the "generic" OpenCL vendor #2019

nwnk · 2024-07-31T16:18:04Z

This adds a GENERIC vendor to the set of choices for OpenCL, and treats it as the Intel OpenCL driver minus the gen-specific jit kernels. It still effectively requires the NEO driver at the moment, but the goal is to run on any OpenCL device that supports the necessary OpenCL version and extensions. It's not working yet but it's not far off:

91% tests passed, 8 tests failed out of 86

Total Test time (real) = 13492.32 sec

The following tests FAILED:
	 34 - test_iface_attr_quantization_gpu (NUMERICAL)
	 35 - test_iface_attr_quantization_buffer_gpu (NUMERICAL)
	 50 - test_iface_wino_convolution_gpu (Failed)
	 51 - test_iface_wino_convolution_buffer_gpu (Failed)
	 56 - test_inner_product_forward_gpu (Failed)
	 57 - test_inner_product_forward_buffer_gpu (Failed)
	 62 - test_matmul_gpu (SEGFAULT)
         63 - test_matmul_buffer_gpu (SEGFAULT)

The segfaults are where I killed the test because it ran for over an hour without completing. Some random observations and thoughts about where I'd like to go with this:

Ideally I think I would like to build the OpenCL support for both Intel and "generic", in the sense that the Intel-specific jit stuff would still get built and just disable itself on platforms where it doesn't work. I'm not sure how invasive to get as far as changing how dispatch works.
I think most of the OpenCL driver probably belongs up in src/xpu
We should be able to use OpenCL for the CPU device too

Feedback welcomed.

No functional change, but this puts the base classes first.

We're going to want these kernels for the generic OpenCL support anyway.

NFC but shows where the fault lines are.

XXX need to go back and promote INTEL_REF to REF

densamoilov · 2024-08-01T03:32:24Z

cmake/options.cmake

@@ -146,6 +146,7 @@ set(DNNL_ENABLE_PRIMITIVE_GPU_ISA "ALL" CACHE STRING
    at build time. Regardless of value chosen, reference OpenCL-based
    implementations will always be available. Valid values:
    - ALL (the default). Includes all ISA to be enabled.
+    - NONE. Includes no ISAs, just the generic kernels.


(random spot)

Thank you for taking the effort to enable a generic vendor for the OpenCL runtime. When it comes to external contributions that impact some aspects of the library's architecture a formal proposal (RFC) is required. Please make yourself familiar with the contribution guidelines and the RFC process.

While going through the changes in this pull-request I found a few things that have to be addressed. Your proposal is expected to cover it.

According to the GPU code organization structure the generic OpenCL kernels should reside in gpu/generic/ocl.

The OpenCL primitive implementations that you marked as generic may be generic at this point but there is absolutely no guarantee that it will stay that way. The right way to make them generic is to move them to gpu/generic/ocl and enable them for the Intel and Generic vendors.

The library architecture requires that one engine per runtime (CPU and GPU) must be enabled. Currently, there is only one OpenCL engine that is Intel specific. When DNNL_GPU_VENDOR is GENERIC and DNNL_GPU_RUNTIME is OCL there are no engines that could be used therefore a new engine for the generic vendor and OpenCL runtime has to be introduced.

There shouldn't any vendor specific compile time checks in a vendor specific space. All code that resides in gpu/intel assumes that DNNL_GPU_VENDOR is INTEL.

When DNNL_GPU_VENDOR is INTEL and DNNL_GPU_RUNTIME is SYCL the xpu/ocl code must be enabled (the new condition that you introduced seems to break the rule).

I think the DNNL_ENABLE_PRIMITIVE_GPU_ISA option should be enabled (and defined) only for the Intel vendor and should be ignored otherwise

USM is not part of the OpenCL standard and therefore we cannot assume that all OpenCL vendors support it. The USM utility functions must be conditionally enabled for the vendors that have the support.

As for enabling OpenCL kernels for CPU, we don't have any plans to do that and the library doesn't have any architecture to do that. If you are interested in that you may come up with an architecture design and publish a proposal.

And the last thing, we plan to enable the generic vendor for SYCL runtime in the coming months. As an option you could wait to see what the design will look like to get a better understanding of what it should look like for OpenCL.

rjoursler · 2024-08-29T15:58:40Z

cmake/options.cmake

@@ -281,13 +282,13 @@ endif()

 set(DNNL_GPU_VENDOR "NONE" CACHE STRING
    "When DNNL_GPU_RUNTIME is not NONE DNNL_GPU_VENDOR specifies target GPU
-    vendor for GPU engines. Can be INTEL (default), NVIDIA or AMD.")
+    vendor for GPU engines. Can be INTEL (default), GENERIC, NVIDIA or AMD.")

 if(NOT DNNL_GPU_RUNTIME STREQUAL "NONE" AND DNNL_GPU_VENDOR STREQUAL "NONE")


Random spot to continue the discussion from #2044:

One thing I noted was that the list of potential kernels for a primitive is fixed at compilation time, based purely on the supported engines.

This isn't quite accurate, the list is for potential primitive implementations. As things currently stand, each implementation contains its own dispatcher which can be used to implement more logic which can include calling different OpenCL kernels or even different primitive implementations.

I could easily imagine that iGPU and dGPU would want different sort orders entirely

We have not seen that need practice. Generally, performance differences like this end being handled by the dispatching logic within the primitive implementation.

I admit I don't love depending on a particular CLC compiler.

In general, most of the attributes we rely on enable generic programming in C, things like function overloading and automatic type inference. This was chosen to enable better type-safety as the alternative is using macros everywhere.

What tensions are you finding, and how are you trying to address them?

There are a lot of details to this question, and I will attempt to give a good summary here. To begin with, we currently use two dispatching models within primitive implementations. The first is based on hard-coded heuristics (example). The second method uses a performance model and an implementation list where the best scoring implementation under the model is used. All of the OpenCL kernels currently rely on the hard-coded heuristics. The biggest issue ss these dispatchers are fragile, any change (be it kernel, compiler, runtime, etc.) can cause the heuristics/model to be inaccurate. We have considered adding runtime tuning to avoid this, for example #1764, but concluded it is either too computationally expensive or requires a new API that cannot be used by customers. For the model based implementations, we generally need a tool to fit the model to observed performance, as quickly and accurately predicting performance from the hardware SKU is virtually impossible. As such, updates are relatively easy to handle, we just need to retrain the model (although this does induce code churn on the kernel/model database). On the other hand, hard-coded heuristics do not scale and require developer intervention.

Coming back to your goal of sharing OpenCL implementations, in order to make such a sharing feasible, we would need dispatching models to be easily generated after implementation changes. As such, this requires one of the following: switching current OpenCL kernels to performance based models, inventing a method to fit (the currently hard-coded) heuristics (at which point we are effectively using machine learning to optimize machine learning), or introducing a new methodology. On the other hand, all known methods likely require performance information from the target devices. At the same time, model information is unavailable in the generic context, so a method to handle this is required be it using default model or somehow training a custom model. I expect this to be a lot of work, just forking the existing implementations would be easier, so I guess this reduces to the question of whether the improved code sharing is worth the effort.

Thank you for the background, that's very helpful and certainly gives me some things to ponder. But one brief thing I do want to touch on:

so I guess this reduces to the question of whether the improved code sharing is worth the effort.

My sense is that there are likely multiple layers at which sharing might make sense. To draw an analogy from experience in Mesa, some classes of hardware commonly lack the same API features, so (for example) once one driver emulates geometry shaders via compute shaders, that emulation tends to get reused on other hardware with the same limitation. I think there's already some level of reusability from the parameterization of the existing CL kernels for eg. GRF size, that would map reasonably to other hardware. I think there is likely some level of reuse that makes sense for automatic dispatch tuning, even if that would presently be an open-ended research project.

But at a more basic level, the acts of enumerating devices, creating contexts and queues, fencing resources, and dispatching work do not fundamentally change across OpenCL device types. And, more importantly, OpenCL already provides interop APIs for managing those operations across devices and platforms. I feel like at least that much code sharing is self-evidently worth the effort, even if different sets of CL devices end up with widely divergent kernels and schedulers they will at least be able to communicate with each other.

nwnk added 9 commits July 29, 2024 18:34

xpu: Build before cpu and gpu

c2f590a

No functional change, but this puts the base classes first.

gpu: Generalise GPU_INSTANCE_INTEL_REF to GPU_INSTANCE_REF

e94de41

We're going to want these kernels for the generic OpenCL support anyway.

wip: cmake: Allow GENERIC GPU vendor

7091644

wip: cmake: Start doing DNNL_WITH_OCL

7849405

gpu: Always include headers for generic kernels

f4df758

NFC but shows where the fault lines are.

wip: gpu: build generic kernels generically

cbbc00e

XXX need to go back and promote INTEL_REF to REF

hack: gpu/ocl: make GENERIC link

bab6042

WITH_OCL fixup

89ffefa

xpu: intel usm isn't really intel-specific

96da725

nwnk requested review from a team as code owners July 31, 2024 16:18

nwnk changed the title ~~RFC: Start enabling the "generic" OpenCL vendor~~ rfc: Start enabling the "generic" OpenCL vendor Jul 31, 2024

densamoilov reviewed Aug 1, 2024

View reviewed changes

vpirogov added the platform:gpu-generic Codeowner: @oneapi-src/onednn-gpu-generic label Aug 1, 2024

vpirogov changed the title ~~rfc: Start enabling the "generic" OpenCL vendor~~ Start enabling the "generic" OpenCL vendor Aug 2, 2024

vpirogov marked this pull request as draft August 2, 2024 17:51

nwnk mentioned this pull request Aug 6, 2024

gpu/ocl: Allow building out the ngen kernel support #2025

Closed

rjoursler mentioned this pull request Aug 27, 2024

Proposal on enabling generic GPU vendor #2044

Merged

rjoursler reviewed Aug 29, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Start enabling the "generic" OpenCL vendor #2019

Start enabling the "generic" OpenCL vendor #2019

nwnk commented Jul 31, 2024

densamoilov Aug 1, 2024 •

edited

Loading

rjoursler Aug 29, 2024 •

edited

Loading

nwnk Sep 6, 2024

Start enabling the "generic" OpenCL vendor #2019

Are you sure you want to change the base?

Start enabling the "generic" OpenCL vendor #2019

Conversation

nwnk commented Jul 31, 2024

densamoilov Aug 1, 2024 • edited Loading

Choose a reason for hiding this comment

rjoursler Aug 29, 2024 • edited Loading

Choose a reason for hiding this comment

nwnk Sep 6, 2024

Choose a reason for hiding this comment

densamoilov Aug 1, 2024 •

edited

Loading

rjoursler Aug 29, 2024 •

edited

Loading