-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: MPI support for openMP and GPU. #109
base: main
Are you sure you want to change the base?
Conversation
Kokkos::View<char*, MemorySpace> rankBufferKokkos_; // duplication for now - will replace above | ||
Kokkos::View<std::size_t*, MemorySpace> | ||
rankOffsetKokkos_; // duplication for now - will replace above |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For MPI these arrays needs to stay on the host. Only the send or recv buffer may be on the device, everything else has to be on the host.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this would be a problem then, because you would have to copy from host to device to make the MPI call no? Or does MPI realise that the buffer is on the device and the rest on the host?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't make MPI calls on the device (as in during kernels), if that is what you are asking. Other than that, the ranks, sizes, offsets, etc., are always on the host, regardless of where the buffer memory is located. MPI can automatically determine if a buffer is on the device or not.
I think this would not work for the tests, due to the additional communication thread on rank 0.
Shouldn't this be done just through the executors? You can communicate a buffer on host to a buffer on device without issues, if MPI supports device buffers. So the general question is rather if device buffers are supported at all.
This is a very hard problem to tackle. I think the most robust approach would be to do what petsc does and have a test that just checks if it can send device buffers or not. But you can't easily query that through some environment variables or similar. |
Yeah, I did not think about that. But we will need a way, correct me if I am wrong, to ensure that 'equal' numbers of calls are made between ranks otherwise we risk a thread dead-locking? I would think though that the limiting of communication to a single 'thread' should be done locally and not globally. So the buffer class will ensure it only calls MPI operations from a single thread, not the wrapped MPI functions contained in mpi/operators.hpp. I.e., when the
So I would think you need to have a device and host buffer to prevent copying device
Ok yeah - I have never tried but then perhaps it makes sense to do the PETSc approach? Or we can start with a 'host' only MPI approach and copy from device all the time. Then later try to expand to direct GPU. |
I changed the approach. Code is still 'prototyping' presently - the memory spaces would follow a similar approach to the executors. |
…er_. - FullDuplexCommBuffer is now also a template class.
…te of the class. - some kokkos updates. - Update halfDuplexCommBuffer.hpp class doc comment. - removed the memory.hpp header
Main tasks:
@MarcelKoch might need some help on point 3 (and the rest 😉 ).