WIP: MPI support for openMP and GPU. #109

bevanwsjones · 2024-07-14T20:37:44Z

Main tasks:

Make sure multiple openMP threads don't start a communication (i.e. only one thread should initiate the cross-rank communication, typically that is).
For Kokkos execution spaces, we need to have buffers which allow for different memory spaces. So comm buffers will need to be created on both host and device, this needs to be done seamlessly.
Some kind of MPI init which ensures that the configurations are correct -> i.e. GPU aware MPI support.

@MarcelKoch might need some help on point 3 (and the rest 😉 ).

…enmp-and-gpus

MarcelKoch · 2024-07-16T06:39:14Z

include/NeoFOAM/core/mpi/halfDuplexCommBuffer.hpp

+    Kokkos::View<char*, MemorySpace> rankBufferKokkos_; // duplication for now - will replace above
+    Kokkos::View<std::size_t*, MemorySpace>
+        rankOffsetKokkos_; // duplication for now - will replace above


For MPI these arrays needs to stay on the host. Only the send or recv buffer may be on the device, everything else has to be on the host.

But this would be a problem then, because you would have to copy from host to device to make the MPI call no? Or does MPI realise that the buffer is on the device and the rest on the host?

You can't make MPI calls on the device (as in during kernels), if that is what you are asking. Other than that, the ranks, sizes, offsets, etc., are always on the host, regardless of where the buffer memory is located. MPI can automatically determine if a buffer is on the device or not.

MarcelKoch · 2024-07-16T06:45:17Z

Make sure multiple openMP threads don't start a communication (i.e. only one thread should initiate the cross-rank communication, typically that is).

I think this would not work for the tests, due to the additional communication thread on rank 0.

For Kokkos execution spaces, we need to have buffers which allow for different memory spaces. So comm buffers will need to be created on both host and device, this needs to be done seamlessly.

Shouldn't this be done just through the executors? You can communicate a buffer on host to a buffer on device without issues, if MPI supports device buffers. So the general question is rather if device buffers are supported at all.

Some kind of MPI init which ensures that the configurations are correct -> i.e. GPU aware MPI support.

This is a very hard problem to tackle. I think the most robust approach would be to do what petsc does and have a test that just checks if it can send device buffers or not. But you can't easily query that through some environment variables or similar.

bevanwsjones · 2024-07-16T20:00:17Z

I think this would not work for the tests, due to the additional communication thread on rank 0.

Yeah, I did not think about that. But we will need a way, correct me if I am wrong, to ensure that 'equal' numbers of calls are made between ranks otherwise we risk a thread dead-locking? I would think though that the limiting of communication to a single 'thread' should be done locally and not globally. So the buffer class will ensure it only calls MPI operations from a single thread, not the wrapped MPI functions contained in mpi/operators.hpp. I.e., when the Communicator calls for synchronization, there is some sort of openMP reduce operation across threads to ensure all threads have written to the buffer. Once all give the 'ok' signal, the 'last thread' initiates the communication. 'Thread 0' can always post the receive.

Shouldn't this be done just through the executors? You can communicate a buffer on host to a buffer on device without issues, if MPI supports device buffers. So the general question is rather if device buffers are supported at all.

So I would think you need to have a device and host buffer to prevent copying device Fields to host and then into a host buffer for MPI communication? A single buffer should either be host or device (just to limit creating redundant memory everywhere, since the buffers never shrinks in size). Considering this, a single buffer will service a 'memory space'. When a Field is passed for synchronization, the Communicator class will look at the Field's executor and find its memory space and assign it to the correct communication buffer (i.e., one with the same memory space). This also means, for example, a CPU executor and an OpenMP executor can share a buffer since they typically use the same memory space. Since the buffers are created on demand when a new buffer is requested, it will take the memory space of the Field which needs to be synchronized.

This is a very hard problem to tackle. I think the most robust approach would be to do what PETSc does and have a test that just checks if it can send device buffers or not. But you can't easily query that through some environment variables or similar.

Ok yeah - I have never tried but then perhaps it makes sense to do the PETSc approach? Or we can start with a 'host' only MPI approach and copy from device all the time. Then later try to expand to direct GPU.

bevanwsjones · 2024-07-16T20:38:00Z

I changed the approach. Code is still 'prototyping' presently - the memory spaces would follow a similar approach to the executors.

…er_. - FullDuplexCommBuffer is now also a template class.

…enmp-and-gpus

…te of the class. - some kokkos updates. - Update halfDuplexCommBuffer.hpp class doc comment. - removed the memory.hpp header

- started adding memory spaces to buffer.

e214c0e

bevanwsjones linked an issue Jul 14, 2024 that may be closed by this pull request

MPI support for OpenMP and GPUs. #93

Open

bevanwsjones added 2 commits July 14, 2024 22:54

- messing around with updateDataSize

b970741

Merge remote-tracking branch 'origin/main' into 93-mpi-support-for-op…

00017f9

…enmp-and-gpus

MarcelKoch reviewed Jul 16, 2024

View reviewed changes

- still prototyping.

07c4301

bevanwsjones added 9 commits July 18, 2024 19:45

- going with template approach, replaced all rankOffset_ and rankBuff…

610ad7b

…er_. - FullDuplexCommBuffer is now also a template class.

Merge remote-tracking branch 'origin/main' into 93-mpi-support-for-op…

726aa97

…enmp-and-gpus

- isComplete changed to isActive to more accurately represent the sta…

fdcd8e8

…te of the class. - some kokkos updates. - Update halfDuplexCommBuffer.hpp class doc comment. - removed the memory.hpp header

- removed cpp for halfDuplexCommBuffer.cpp, class is now templated.

1362d4d

- Fixed bugs which were associated with the isActive() change.

bfb087a

- kokkos view now implemented as buffer container.

0fcc6f9

- renamed rankOffset_ to rankOffsetMPI_ in halfDuplexCommBuffer.

9dcbfa5

- duplicated offsets on Host and Space pass tests for host space.

c825291

- pushes before hackathon

b56900c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: MPI support for openMP and GPU. #109

WIP: MPI support for openMP and GPU. #109

bevanwsjones commented Jul 14, 2024

MarcelKoch Jul 16, 2024

bevanwsjones Jul 16, 2024

MarcelKoch Jul 17, 2024

MarcelKoch commented Jul 16, 2024

bevanwsjones commented Jul 16, 2024

bevanwsjones commented Jul 16, 2024

WIP: MPI support for openMP and GPU. #109

Are you sure you want to change the base?

WIP: MPI support for openMP and GPU. #109

Conversation

bevanwsjones commented Jul 14, 2024

MarcelKoch Jul 16, 2024

Choose a reason for hiding this comment

bevanwsjones Jul 16, 2024

Choose a reason for hiding this comment

MarcelKoch Jul 17, 2024

Choose a reason for hiding this comment

MarcelKoch commented Jul 16, 2024

bevanwsjones commented Jul 16, 2024

bevanwsjones commented Jul 16, 2024