Add CUDA wrapper capability. #714

bpachev · 2024-09-13T22:41:59Z

This pull request adds the optional capability to generate CUDA-wrapped versions of the tabulate tensor functions. The wrappers are called at runtime to get the modified source code for each tabulate tensor, which can then be passed to the NVIDIA runtime compiler. This was originally developed by James Trotter (https://www.sciencedirect.com/science/article/pii/S0167819123000571),
and then modified to work with the current version of ffcx.

This capability supports assembly on GPUs, implemented here as an add-on package (https://github.com/bpachev/cuda-dolfinx).

Would love any feedback or suggestions on how to do this in a better way. Thanks!

ffcx/codegeneration/ufcx.h

ffcx/options.py

jhale · 2024-09-14T11:50:50Z

We should look at what CI options we have to keep this working long term.

bpachev · 2024-09-16T19:55:43Z

CUDA Toolkit can be installed without a GPU on the device, so we can test if the generated CUDA code compiles without needing a GPU in CI.

ffcx/codegeneration/ufcx.h

ffcx/codegeneration/C/integrals_template.py

ffcx/codegeneration/C/integrals.py

jhale · 2024-09-17T13:22:14Z

In terms of testing infrastructure, we have:

https://github.com/FEniCS/ffcx/blob/main/demo/test_demos.py

but this is aimed at ahead of time compilers. Can we do something similar with NVRTC?

ffcx/codegeneration/C/integrals_template.py

bpachev · 2024-09-17T23:08:10Z

We should be able to do something similar with NVRTC, we'll just need to have an extra step where we compile using the regular C++ compiler, and then call the generated wrapper functions to get the input for NVRTC.

…ts clarifying the use of NVRTC and the need for typedefs in generated CUDA source code.

bpachev · 2024-09-19T00:50:41Z

I've updated the PR with the requested renaming and explanatory comments. How should I go about adding CI for this? A separate PR after this is merged, or as part of this one?

drew-parsons · 2024-09-20T16:16:13Z

More general question here, perhaps to motivate another PR later: cuda tends to be nvidia only, and closed source, unless that's changed recently. How much work would it take to enable other GPUs with open source tools (sycl, ROCm) ?

bpachev · 2024-09-20T18:14:10Z

On the ffcx side with the current wrapper approach, honestly not too much. Downstream in the GPU assembly routines more refactoring effort will be required. This is my next development priority along with support for multi-GPU assembly.

garth-wells · 2024-09-23T19:32:03Z

I have reservations about this change. We know from our own work that the standard FFCx tabulate kernels are not performant on GPUs. I don't want a change that encourages slow/sub-optimal code.

We should looks at how FFCx can be made extensible to allow user-modified generated code.

bpachev · 2024-09-23T23:39:09Z

Garth, thanks for weighing in on this. I would love to see FFCx have easier options for customization. I agree that assembly with high-order FFCx tabulate kernels is not efficient on GPUs. However, in my testing, it's quite performant for order-1 and even order-2 elements, which cover a large percentage of use cases. The rationale behind this change is to start getting people to use FEniCS with GPU acceleration, which will hopefully spur development and result in delivering a mature and optimized GPU capability.

That said, in the short term, all I really need is a way to map the generated body of an FFCx tabulate tensor to the corresponding ufcx_form object in C++. Thoughts about how to best implement that?

bpachev · 2024-10-07T21:20:48Z

To provide some specific examples of performance, using Lagrange elements for assembly of a Poisson mass matrix results in 70% memory throughput on an NVIDIA GH200 with a cubic tetrahedral mesh (6M elements). If the form has enough terms, it can also get a good compute throughput. This is the case for applying the Streamline Upwind Petrov-Galerkin (SUPG) method to the shallow water equations, which resulted in a 60% compute throughput for Jacobian matrix assembly on an NVIDIA A100 (also Lagrange, but triangular elements).

These represent pretty high levels of GPU resource utilization. The speedups relative to parallel CPU assembly are correspondingly high.

Add CUDA wrapper capability.

fe9c170

jhale reviewed Sep 14, 2024

View reviewed changes

ffcx/codegeneration/ufcx.h Outdated Show resolved Hide resolved

ffcx/options.py Show resolved Hide resolved

jhale reviewed Sep 17, 2024

View reviewed changes

ffcx/codegeneration/ufcx.h Show resolved Hide resolved

jhale reviewed Sep 17, 2024

View reviewed changes

ffcx/codegeneration/C/integrals_template.py Outdated Show resolved Hide resolved

jhale reviewed Sep 17, 2024

View reviewed changes

ffcx/codegeneration/C/integrals.py Outdated Show resolved Hide resolved

jhale reviewed Sep 17, 2024

View reviewed changes

ffcx/codegeneration/C/integrals_template.py Show resolved Hide resolved

jhale requested changes Sep 17, 2024

View reviewed changes

ffcx/codegeneration/C/integrals_template.py Show resolved Hide resolved

Rename tabulate_tensor_cuda to tabulate_tensor_cuda_nvrtc. Add commen…

3369f07

…ts clarifying the use of NVRTC and the need for typedefs in generated CUDA source code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CUDA wrapper capability. #714

Add CUDA wrapper capability. #714

bpachev commented Sep 13, 2024

jhale commented Sep 14, 2024

bpachev commented Sep 16, 2024

jhale commented Sep 17, 2024

bpachev commented Sep 17, 2024

bpachev commented Sep 19, 2024

drew-parsons commented Sep 20, 2024

bpachev commented Sep 20, 2024

garth-wells commented Sep 23, 2024

bpachev commented Sep 23, 2024

bpachev commented Oct 7, 2024

Add CUDA wrapper capability. #714

Are you sure you want to change the base?

Add CUDA wrapper capability. #714

Conversation

bpachev commented Sep 13, 2024

jhale commented Sep 14, 2024

bpachev commented Sep 16, 2024

jhale commented Sep 17, 2024

bpachev commented Sep 17, 2024

bpachev commented Sep 19, 2024

drew-parsons commented Sep 20, 2024

bpachev commented Sep 20, 2024

garth-wells commented Sep 23, 2024

bpachev commented Sep 23, 2024

bpachev commented Oct 7, 2024