Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement MPI functions for ArborX #53

Open
5 of 7 tasks
cwpearson opened this issue May 10, 2024 · 19 comments
Open
5 of 7 tasks

Implement MPI functions for ArborX #53

cwpearson opened this issue May 10, 2024 · 19 comments

Comments

@cwpearson
Copy link
Collaborator

cwpearson commented May 10, 2024

@aprokop

  • MPI_Alltoall
  • MPI_Barrier
  • MPI_Allgather
    • in-place
  • MPI_Irecv
  • MPI_Isend
  • MPI Derived Datatypes

probably skip these for a bare MPI wrapper

  • MPI_Comm_dup / MPI_Comm_free
  • MPI_Init
  • MPI_Finalize
@aprokop
Copy link
Contributor

aprokop commented May 10, 2024

This sounds right. Would be nice to also do something about getting rank and size from a communicator, but that's not a big deal.

Another pretty annoying thing that we have to do is to avoid passing --help to non-zero ranks, as otherwise Kokkos would print help on all ranks. We do it like this:

  // Strip "--help" and "--kokkos-help" from the flags passed to Kokkos if we
  // are not on MPI rank 0 to prevent Kokkos from printing the help message
  // multiply.
  if (comm_rank != 0)
  {
    auto *help_it = std::find_if(argv, argv + argc, [](std::string const &x) {
      return x == "--help" || x == "--kokkos-help";
    });
    if (help_it != argv + argc)
    {
      std::swap(*help_it, *(argv + argc - 1));
      --argc;
    }
  }
  Kokkos::initialize(argc, argv);

@aprokop
Copy link
Contributor

aprokop commented May 10, 2024

Another thing that could be worth trying to address at some point is to be able to run with GPU-aware MPI implementations, rather than doing everything on the host. We have a pathway in ArborX for it (even though we never saw it running faster, but it was like 4 years ago, maybe things are better now).

@cwpearson
Copy link
Collaborator Author

I expect GPU-aware MPI to be pretty high on the priority list once we get some of these initial functions implemented!

@janciesko
Copy link
Collaborator

GPU-aware MPI installations should be supported out of the box. Note that the communication itself is still host-triggered.

@dutkalex
Copy link
Collaborator

dutkalex commented May 13, 2024

GPU-aware MPI installations should be supported out of the box. Note that the communication itself is still host-triggered.

GPU-aware communications work out of the box indeed, but a nice to have thing in my experience is a fallback mode which deep copies onto the host, sends and deep copies back upon reception, for non GPU-aware MPI installations (or badly installed supposedly GPU-aware installations which just crash in practice...). It can easily be implemented with a few #ifdefs, and it enables users to run on GPUs with a slight communication overhead instead of crashing with cryptic segfaults

@dssgabriel
Copy link
Collaborator

So how do we handle this?

  1. Simply call the MPI function and pass the view's pointer to GPU memory (as we can with a GPU-aware MPI implementation)
  2. Deep copy on a host view and initiate the communication from the host

@dssgabriel
Copy link
Collaborator

dssgabriel commented May 13, 2024

@dutkalex wrote:

It can easily be implemented with a few #ifdefs, and it enables users to run on GPUs in a slight communication overhead instead of crashing with cryptic segfaults

We should implement this with if constexpr and let users define a pre-processor variable that tells us if (a) they know their MPI implementation is GPU-aware and want the raw MPI call, or (b) they do not know if it is GPU-aware/want the view to be deep-copied onto the host and the communication is initiated from the host.

Users will always know what to expect, and it will let us remove the fallback mechanisms from the critical path once the code is compiled.

@dutkalex
Copy link
Collaborator

dutkalex commented May 13, 2024

So how do we handle this?

To me, it's as simple as:

void send(device_view, ...){
#ifdef KOKKOSCOMM_GPU_AWARE
  send(device_view, ...);
#else
  auto mirror = Kokkos::create_mirror_view(device_view);
  Kokkos::deep_copy(mirror, device_view);
  send(mirror, ...);
#endif
}

void recv(device_view, ...){
#ifdef KOKKOSCOMM_GPU_AWARE
  recv(device_view, ...);
#else
  auto mirror = Kokkos::create_mirror_view(device_view);
  recv(mirror, ...);
  Kokkos::deep_copy(device_view, mirror);
#endif
}

You can replace the #ifdefs by if constexpr branches if you prefer but the result will be the same @dssgabriel

@aprokop
Copy link
Contributor

aprokop commented May 13, 2024

I would argue that GPU-aware call should be a runtime and not compile time decision.

@dutkalex
Copy link
Collaborator

I would argue that GPU-aware call should be a runtime and not compile time decision.

Why would you want to have both compiled? When linking against an mpi implementation, gpu-aware communications either work or they don't...
I guess however that having this be a runtime switch wouldn't incur any measurable overhead, so in that regard why not, but I don't see a use case for this... Maybe to measure the overhead of this degraded mode? 🤔

@aprokop
Copy link
Contributor

aprokop commented May 13, 2024

A user may want to dispatch the kernel on the host or the device depending on the size of the data, or the MPI call. I think if a view is on the host, it would make sense to use host MPI routines, while if it's on the device, GPU-aware routines if available, and transfer to host if not.

@dutkalex
Copy link
Collaborator

dutkalex commented May 13, 2024

I agree, but the compile time solution already enables this: if the view is on the host you get a regular mpi host call regardless of the branch, because the pointer passed to mpi is a host pointer and because the deep copy becomes a no-op

@cedricchevalier19
Copy link
Member

So how do we handle this?

To me, it's as simple as:

...

It is more complicated for asynchronous communications, and I am slightly annoyed by the silent extra allocation.
However, we should handle this case. This scenario is one of the motivating examples for this project.

@dutkalex
Copy link
Collaborator

dutkalex commented May 27, 2024

It is more complicated for asynchronous communications

Agreed. Async semantics require hooking the recv deep_copy to the wait() call and making sure that the host views' lifetimes are managed properly. Nothing undoable, but it's a little bit more work.

and I am slightly annoyed by the silent extra allocation.

Agreed, this is suboptimal. I don't see an obvious solution to this problem, but I bet we can come up with something better. Or at the very least more transparent to the user...

@aprokop
Copy link
Contributor

aprokop commented Jun 11, 2024

I have an ArborX branch for kokkos-comm. Currently:

@cwpearson
Copy link
Collaborator Author

How do you want the custom datatypes to look? Is it a Kokkos::View of things that are described by custom datatypes?

@aprokop
Copy link
Contributor

aprokop commented Jun 12, 2024

How do you want the custom datatypes to look? Is it a Kokkos::View of things that are described by custom datatypes?

It is Kokkos::View<Box*>, where Box is a struct of pods.

@cwpearson
Copy link
Collaborator Author

Do you ship the whole Box or just some of the fields?

@aprokop
Copy link
Contributor

aprokop commented Jun 12, 2024

The whole thing. Right now, just through n * sizeof(Box) and MPI_BYTE.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants