Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal on enabling generic GPU vendor #2044

Merged

Conversation

densamoilov
Copy link
Contributor

@densamoilov densamoilov added the RFC A design document label Aug 20, 2024
@vpirogov vpirogov added this to the v3.6 milestone Aug 20, 2024
@nwnk
Copy link
Contributor

nwnk commented Aug 20, 2024

I think I'd prefer to think of accelerator support not in terms of vendors, but interfaces or backends. You're bridging SYCL to either CUDA, or ROCm, or OpenCL, or Vulkan, or Level Zero, or... Each of those backends potentially supports N devices from M vendors. It happens that the OpenCL bit of oneDNN is positioned to support more than one vendor sooner than the other ones, but at the moment there is really just one vendor per backend, and that makes it tempting to use "vendor" when I think I wish we'd say "API".

SYCL devices have to be created specifically for particular types.

oneDNN supports two engine kinds: CPU and GPU. According to the UXFL strategy the
generic kernels are expected to run on different GPUs and accelerators.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if there are kernel support differences between GPUs and Accelerators in SYCL (e.g. memory model, functionalities not mandated, ...)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generic SYCL is meant to be a common denominator for all devices so I don't think there should be any differences in the SYCL programming model. However, different devices may have different capabilities and therefore some of the SYCL features may or may not be available for them. For example, USM memory model may not be supported by some devices so the users have to check the corresponding device aspect to query the capability prior to allocating USM memory.
We should also be careful with those cases inside the library, e.g. we always allocate a SYCL buffer for the scratchpad, which is always supported, but if we decide to use USM memory for it we will need to do that properly.

@densamoilov
Copy link
Contributor Author

densamoilov commented Aug 22, 2024

I think I'd prefer to think of accelerator support not in terms of vendors, but interfaces or backends. You're bridging SYCL to either CUDA, or ROCm, or OpenCL, or Vulkan, or Level Zero, or... Each of those backends potentially supports N devices from M vendors. It happens that the OpenCL bit of oneDNN is positioned to support more than one vendor sooner than the other ones, but at the moment there is really just one vendor per backend, and that makes it tempting to use "vendor" when I think I wish we'd say "API".

I'm not sure I understand what you mean by the "API" here. Maybe you can elaborate on that a bit more?
Meanwhile, let me clarify our vision regarding accelerator (and GPU) support. oneDNN supports two GPU runtimes: OpenCL and SYCL. It also provides interoperability API for both of them (OpenCL and SYCL). We don't want to provide any vendor-specific API as this would defeat our goal of generalizing APIs across vendors.

Both of the GPU runtimes (OpenCL and SYCL) may support multiple vendors, and we want to provide the external contributors with a way to enable them. The generic kernels are meant to be a reference and to close the functional gap for the newly enabled vendors. The generic kernels are not meant to provide performance benefits. The reason we add the GENERIC vendor is to give the external contributors something to start with.

@nwnk
Copy link
Contributor

nwnk commented Aug 23, 2024

I'm not sure I understand what you mean by the "API" here. Maybe you can elaborate on that a bit more? Meanwhile, let me clarify our vision regarding accelerator (and GPU) support. oneDNN supports two GPU runtimes: OpenCL and SYCL. It also provides interoperability API for both of them (OpenCL and SYCL). We don't want to provide any vendor-specific API as this would defeat our goal of generalizing APIs across vendors.

I think there are two things here: the interface oneDNN exposes to applications, and the interfaces oneDNN consumes in order to accelerate its own operation. For the former, yes, OpenCL and SYCL are the interop APIs and adding more would be an error, on this we agree.

For the latter...

Both of the GPU runtimes (OpenCL and SYCL) may support multiple vendors, and we want to provide the external contributors with a way to enable them. The generic kernels are meant to be a reference and to close the functional gap for the newly enabled vendors. The generic kernels are not meant to provide performance benefits. The reason we add the GENERIC vendor is to give the external contributors something to start with.

The issue I have is that what we are presently calling the "Intel" OpenCL engine is like 96% generic already. As far as I can tell the only thing that truly requires an Intel GPU is when ngen is set to emit assembly instead of clc. At heart the requirement is not that it's an Intel GPU, it's that the OpenCL device support cl_intel_subgroups and cl_intel_unified_shared_memory and maybe something else I'm forgetting. Admittedly, the set of OpenCL implementations that meet those criteria is quite small. But Mesa isn't far off, and has drivers for the vast majority of contemporary GPUs. Mesa is central to my OS's GPU support already, so a little effort spent on Mesa to enable one engine in oneDNN for N vendors is more attractive to me than an unknown amount of effort for N vendor paths through oneDNN.

As a potential external contributor, who is considering how to integrate oneDNN into my operating system, I like the idea of having an engine that generically supports many GPUs well. The existing OpenCL engine seems to me to be the best path to that breadth of support, as virtually all of it is written to APIs that already have broad support (or seem easy to add to rusticl); so much so that I would almost suggest renaming src/gpu/intel to src/gpu/ocl. Maybe those kernels are "just a reference", but if the OCL engine wasn't artificially limited to Intel GPUs, those kernels will very likely run faster on any GPU we might encounter than on the CPU it'd be next to, so, that's where I would like to start.

@rjoursler
Copy link
Contributor

rjoursler commented Aug 24, 2024

@nwnk, Just to add feedback as someone helping develop OpenCL kernels, I agree with you on many points, but I don't think we can expose the implementations en masse in quite the way you are proposing. The key issue is that performance is a requirement within oneDNN. In particular, the more targets an implementation supports, the harder it is to modify and validate performance. This induces a contention between code maintainability, where fewer implementations are better, and performance maintainability, where a smaller validation space is better. Splitting on device vendors provides a natural mechanism to narrow scope, as we expect significant divergence in performance characteristics between hardware from different vendors and, in some cases, we can rely on vendor libraries such as cuDNN. I expect shared OpenCL infrastructure will be helpful (and I am looking forward to your upcoming RFC), but we need to be careful in how we balance the code/performance contention across a broad suite of developers, hardware, goals, requirements, etc.. It is already a challenge to maintain this balance within only Intel's existing GPUs.

In case it helps, here are some pointers for the current implementation:

The issue I have is that what we are presently calling the "Intel" OpenCL engine is like 96% generic already.

You are right that a lot of the OpenCL code should be semi-portable. One of the key issues is that kernel implementations are better viewed as code generators. At a high level, we have an implementation dispatcher which takes as input a problem and outputs a configuration. The OpenCL code generators then JIT compile the configuration, before it is eventually executed. I expect there is a lot of OpenCL code generator infrastructure that can be shared. The dispatcher is a problem though. By necessity, it can be dependent on the target device, the target compiler, and the OpenCL code generator. Any design with shared OpenCL infrastructure needs to take this into account so shared infrastructure changes do not adversely interacting with dispatchers.

At heart the requirement is not that it's an Intel GPU, it's that the OpenCL device support cl_intel_subgroups and cl_intel_unified_shared_memory

The Intel specific features we rely on should be listed in device_info.hpp (if you find any missing, please let us know). For kernel development, the cl_intel_subgroup device features is the most commonly used. As we are only validating the intel-compute-runtime, you may find that that we are missing some feature guards which always work with the intel-compute-runtime. Another portability concern is that many OpenCL kernels rely on clang attributes.

@densamoilov
Copy link
Contributor Author

densamoilov commented Aug 26, 2024

@nwnk, I think Roy covered the part about managing the kernels and requirements to them pretty well. I just want to add a few things. I understand the desire to reuse the existing kernels to avoid investing additional time and effort in development and maintenance however, oneDNN has parts that are developed and maintained by external contributors more or less independently therefore we have to provide them with a certain level of autonomy. This is why we've decided to prioritize providing an independent space for the development over minimizing the duplication via the vendor mechanism.

@nwnk
Copy link
Contributor

nwnk commented Aug 26, 2024

@rjoursler:

Splitting on device vendors provides a natural mechanism to narrow scope, as we expect significant divergence in performance characteristics between hardware from different vendors and, in some cases, we can rely on vendor libraries such as cuDNN. I expect shared OpenCL infrastructure will be helpful (and I am looking forward to your upcoming RFC), but we need to be careful in how we balance the code/performance contention across a broad suite of developers, hardware, goals, requirements, etc.. It is already a challenge to maintain this balance within only Intel's existing GPUs.

Maybe those challenges could inform how we progress here. What tensions are you finding, and how are you trying to address them?

One thing I noted was that the list of potential kernels for a primitive is fixed at compilation time, based purely on the supported engines. This does allow you to put fast paths first, but it means the list itself has to be fast-path-sorted for every device it might encounter. I could easily imagine that iGPU and dGPU would want different sort orders entirely, and one way to address that might be to build the list at engine setup time instead, with different lists for UMA vs PCIE, or whatever other differentiator matters most. That kind of flexibility would almost be required (I feel) for supporting non-Intel GPUs in the OCL backend, and if it would help the Intel case as well, so much the better.

Another portability concern is that many OpenCL kernels rely on clang attributes.

I admit I don't love depending on a particular CLC compiler. But, in the event, rusticl uses clang as well. So to the extent that those extensions are implemented through to clang's spirv output, it should just be a matter of handling them in rusticl. I've already seen a few we don't entirely handle, and it'll be a release or two before Mesa is really capable running oneDNN out of the box I suspect, but I'm not seeing clang tell me it doesn't know how to handle an attribute so I might be safe there. Once I get to enabling the SYCL runtime I'm pretty sure there will be more divergence to handle between Intel's LLVM and upstream, but I'm keeping my sights on just OpenCL for now.

@rjoursler
Copy link
Contributor

rjoursler commented Aug 27, 2024

@nwnk, to try and keep discussions relevant to the build interface proposed in this RFC, I plan to answer most your questions about implementation details in #2019 and in this comment I want to check that we have addressed your concerns in regard to this RFC.

As a potential external contributor, who is considering how to integrate oneDNN into my operating system, I like the idea of having an engine that generically supports many GPUs well.

From my understanding of your comment, it seems to me like DNNL_GPU_VENDOR=GENERIC and DNNL_GPU_RUNTIME=OCL generally covers this case, and, while there are a lot of implementation considerations for what OpenCL infrastructure should be shared, that discussion is largely orthogonal to the build interface proposed.

Mesa is central to my OS's GPU support already, so a little effort spent on Mesa to enable one engine in oneDNN for N vendors is more attractive to me than an unknown amount of effort for N vendor paths through oneDNN.

To me, this sounds like we would need to enabled multi-vendor builds. This is different from the current model as most frameworks integrating with oneDNN create a new library per device vendor. There is definitely a discussion to be had here. To me, while a generic vendor does add a question on which implementation to use for a device fitting multiple vendors, it seems like a simple priority list suffices. Because of this, I do not see any issues with the proposed interface, but that future work will be required.

But, in the event, rusticl uses

We currently do no have a concept for multiple runtime implementations within oneDNN. This sounds like a new concept for a runtime vendor that I am not sure how best to integrate and likely requires another RFC to discuss.

Overall, my understanding is that there is definitely more design/enablement work to support your use case, but that the specific proposal here is not problematic. Does this seem right to you?

@nwnk
Copy link
Contributor

nwnk commented Aug 28, 2024

Overall, my understanding is that there is definitely more design/enablement work to support your use case, but that the specific proposal here is not problematic. Does this seem right to you?

Yes, I have no objection to this proposal, it seems like good progress towards a goal we all share. Thank you (all) for indulging my discussion here, it's been very helpful.

@densamoilov
Copy link
Contributor Author

densamoilov commented Aug 28, 2024

Thanks everyone for the discussion. The changes proposed in this RFC have been implemented and are available in main and targeting v3.6.

@densamoilov densamoilov merged commit a913d51 into oneapi-src:rfcs Aug 28, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
RFC A design document
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants