Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CUDA wrapper capability. #714

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Add CUDA wrapper capability. #714

wants to merge 2 commits into from

Conversation

bpachev
Copy link

@bpachev bpachev commented Sep 13, 2024

This pull request adds the optional capability to generate CUDA-wrapped versions of the tabulate tensor functions. The wrappers are called at runtime to get the modified source code for each tabulate tensor, which can then be passed to the NVIDIA runtime compiler. This was originally developed by James Trotter (https://www.sciencedirect.com/science/article/pii/S0167819123000571),
and then modified to work with the current version of ffcx.

This capability supports assembly on GPUs, implemented here as an add-on package (https://github.com/bpachev/cuda-dolfinx).

Would love any feedback or suggestions on how to do this in a better way. Thanks!

ffcx/codegeneration/ufcx.h Outdated Show resolved Hide resolved
ffcx/options.py Show resolved Hide resolved
@jhale
Copy link
Member

jhale commented Sep 14, 2024

We should look at what CI options we have to keep this working long term.

@bpachev
Copy link
Author

bpachev commented Sep 16, 2024

CUDA Toolkit can be installed without a GPU on the device, so we can test if the generated CUDA code compiles without needing a GPU in CI.

@jhale
Copy link
Member

jhale commented Sep 17, 2024

In terms of testing infrastructure, we have:

https://github.com/FEniCS/ffcx/blob/main/demo/test_demos.py

but this is aimed at ahead of time compilers. Can we do something similar with NVRTC?

@bpachev
Copy link
Author

bpachev commented Sep 17, 2024

We should be able to do something similar with NVRTC, we'll just need to have an extra step where we compile using the regular C++ compiler, and then call the generated wrapper functions to get the input for NVRTC.

…ts clarifying the use of NVRTC and the need for typedefs in generated CUDA source code.
@bpachev
Copy link
Author

bpachev commented Sep 19, 2024

I've updated the PR with the requested renaming and explanatory comments. How should I go about adding CI for this? A separate PR after this is merged, or as part of this one?

@drew-parsons
Copy link

More general question here, perhaps to motivate another PR later: cuda tends to be nvidia only, and closed source, unless that's changed recently. How much work would it take to enable other GPUs with open source tools (sycl, ROCm) ?

@bpachev
Copy link
Author

bpachev commented Sep 20, 2024

On the ffcx side with the current wrapper approach, honestly not too much. Downstream in the GPU assembly routines more refactoring effort will be required. This is my next development priority along with support for multi-GPU assembly.

@garth-wells
Copy link
Member

I have reservations about this change. We know from our own work that the standard FFCx tabulate kernels are not performant on GPUs. I don't want a change that encourages slow/sub-optimal code.

We should looks at how FFCx can be made extensible to allow user-modified generated code.

@bpachev
Copy link
Author

bpachev commented Sep 23, 2024

Garth, thanks for weighing in on this. I would love to see FFCx have easier options for customization. I agree that assembly with high-order FFCx tabulate kernels is not efficient on GPUs. However, in my testing, it's quite performant for order-1 and even order-2 elements, which cover a large percentage of use cases. The rationale behind this change is to start getting people to use FEniCS with GPU acceleration, which will hopefully spur development and result in delivering a mature and optimized GPU capability.

That said, in the short term, all I really need is a way to map the generated body of an FFCx tabulate tensor to the corresponding ufcx_form object in C++. Thoughts about how to best implement that?

@bpachev
Copy link
Author

bpachev commented Oct 7, 2024

To provide some specific examples of performance, using Lagrange elements for assembly of a Poisson mass matrix results in 70% memory throughput on an NVIDIA GH200 with a cubic tetrahedral mesh (6M elements). If the form has enough terms, it can also get a good compute throughput. This is the case for applying the Streamline Upwind Petrov-Galerkin (SUPG) method to the shallow water equations, which resulted in a 60% compute throughput for Jacobian matrix assembly on an NVIDIA A100 (also Lagrange, but triangular elements).

These represent pretty high levels of GPU resource utilization. The speedups relative to parallel CPU assembly are correspondingly high.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants