Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use GPU-GPU transfers directly into ghost buffer space in fields #28

Open
JamieJQuinn opened this issue Jan 22, 2024 · 1 comment
Open
Labels
cuda Related to CUDA backend

Comments

@JamieJQuinn
Copy link
Collaborator

Currently ghost cells are stored in separate buffer from the main field. This leads to complexity in subroutines like der_univ_dist in kernels_dist.f90. We should test:

  1. storing the ghost cell buffer directly inside the field array, eliminating the separate boundary calculations
  2. using CUDA-aware MPI to transfer this buffer region directly between field arrays (currently the buffer is copied into a separate array still on the GPU, then transferred)

At first glance, definitely need to:

  1. change allocator to output arrays that include the ghost cells
  2. update transpose functions to understand new memory layout
  3. update send/recv ghost cell functions to send/recv directly into arrays
  4. update kernels which use ghost cells
@semi-h
Copy link
Member

semi-h commented Jan 22, 2024

  1. This will simplify the distributed tridiagonal solver implementations in all backends. Basically the big chunk that deals with the first 4 entries in the domain will simplify down to a quarter of its current size with a simple loop around it. Also, we obviously won't need any buffer arrays or pass them to the distributed kernels, further simplifying the process. On the performance side though this probably won't have an impact.
    Apart from simplifying the distributed kernels (also future thomas kernels), this will have an impact in some regions in the codebase. For example the non-ghost regions or the actual region that belong to the rank we're at will look like u(:, 5:n+4, :). Will need to investigate this further to have a better idea.

  2. I think we'll use strided vector stuff as explained in [1]. This is surely supported on CPU's so I did a quick look whether cuda-aware MPI supports this or not. Luckily this is something people looked into [2,3]. But my undersanding from [4] is that an MPI library sorts this out by copying the region you defined in the strided vector type into a buffer array, and then initiates the communication. And [4] states that a bad implementation may require as much space as the large array the strided vector lives in. However a good MPI library can make the performance a bit better as we won't be calling two seperate functions but only one.

Its worth looking into this in detail. Please share your thougths, especially if you have experience with strided data stuff in MPI @Nanoseb @pbartholomew08 @mathrack @rfj82982 @slaizet.
@rfj82982, the strided MPI send/recv support can be a good idea for 2DECOMP&FFT as well, what do you think?

[1] https://www.dcs.ed.ac.uk/home/trollius/www.osc.edu/Lam/mpi/mpi_datatypes.html
[2] https://icl.utk.edu/files/publications/2016/icl-utk-877-2016.pdf
[3] https://web.cels.anl.gov/~thakur/papers/jenkins_cluster12.pdf
[4] https://carlpearson.net/pdf/20210420_pearson_phd.pdf

@Nanoseb Nanoseb added the cuda Related to CUDA backend label Feb 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda Related to CUDA backend
Projects
None yet
Development

No branches or pull requests

3 participants