Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: MPI support for openMP and GPU. #109

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

bevanwsjones
Copy link
Collaborator

Main tasks:

  1. Make sure multiple openMP threads don't start a communication (i.e. only one thread should initiate the cross-rank communication, typically that is).
  2. For Kokkos execution spaces, we need to have buffers which allow for different memory spaces. So comm buffers will need to be created on both host and device, this needs to be done seamlessly.
  3. Some kind of MPI init which ensures that the configurations are correct -> i.e. GPU aware MPI support.

@MarcelKoch might need some help on point 3 (and the rest 😉 ).

@bevanwsjones bevanwsjones linked an issue Jul 14, 2024 that may be closed by this pull request
Comment on lines 221 to 223
Kokkos::View<char*, MemorySpace> rankBufferKokkos_; // duplication for now - will replace above
Kokkos::View<std::size_t*, MemorySpace>
rankOffsetKokkos_; // duplication for now - will replace above
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For MPI these arrays needs to stay on the host. Only the send or recv buffer may be on the device, everything else has to be on the host.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this would be a problem then, because you would have to copy from host to device to make the MPI call no? Or does MPI realise that the buffer is on the device and the rest on the host?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't make MPI calls on the device (as in during kernels), if that is what you are asking. Other than that, the ranks, sizes, offsets, etc., are always on the host, regardless of where the buffer memory is located. MPI can automatically determine if a buffer is on the device or not.

@MarcelKoch
Copy link
Collaborator

  1. Make sure multiple openMP threads don't start a communication (i.e. only one thread should initiate the cross-rank communication, typically that is).

I think this would not work for the tests, due to the additional communication thread on rank 0.

  1. For Kokkos execution spaces, we need to have buffers which allow for different memory spaces. So comm buffers will need to be created on both host and device, this needs to be done seamlessly.

Shouldn't this be done just through the executors? You can communicate a buffer on host to a buffer on device without issues, if MPI supports device buffers. So the general question is rather if device buffers are supported at all.

  1. Some kind of MPI init which ensures that the configurations are correct -> i.e. GPU aware MPI support.

This is a very hard problem to tackle. I think the most robust approach would be to do what petsc does and have a test that just checks if it can send device buffers or not. But you can't easily query that through some environment variables or similar.

@bevanwsjones
Copy link
Collaborator Author

I think this would not work for the tests, due to the additional communication thread on rank 0.

Yeah, I did not think about that. But we will need a way, correct me if I am wrong, to ensure that 'equal' numbers of calls are made between ranks otherwise we risk a thread dead-locking? I would think though that the limiting of communication to a single 'thread' should be done locally and not globally. So the buffer class will ensure it only calls MPI operations from a single thread, not the wrapped MPI functions contained in mpi/operators.hpp. I.e., when the Communicator calls for synchronization, there is some sort of openMP reduce operation across threads to ensure all threads have written to the buffer. Once all give the 'ok' signal, the 'last thread' initiates the communication. 'Thread 0' can always post the receive.

Shouldn't this be done just through the executors? You can communicate a buffer on host to a buffer on device without issues, if MPI supports device buffers. So the general question is rather if device buffers are supported at all.

So I would think you need to have a device and host buffer to prevent copying device Fields to host and then into a host buffer for MPI communication? A single buffer should either be host or device (just to limit creating redundant memory everywhere, since the buffers never shrinks in size). Considering this, a single buffer will service a 'memory space'. When a Field is passed for synchronization, the Communicator class will look at the Field's executor and find its memory space and assign it to the correct communication buffer (i.e., one with the same memory space). This also means, for example, a CPU executor and an OpenMP executor can share a buffer since they typically use the same memory space. Since the buffers are created on demand when a new buffer is requested, it will take the memory space of the Field which needs to be synchronized.

This is a very hard problem to tackle. I think the most robust approach would be to do what PETSc does and have a test that just checks if it can send device buffers or not. But you can't easily query that through some environment variables or similar.

Ok yeah - I have never tried but then perhaps it makes sense to do the PETSc approach? Or we can start with a 'host' only MPI approach and copy from device all the time. Then later try to expand to direct GPU.

@bevanwsjones
Copy link
Collaborator Author

I changed the approach. Code is still 'prototyping' presently - the memory spaces would follow a similar approach to the executors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MPI support for OpenMP and GPUs.
2 participants