Reorganize for MPI / NCCL transport layers #109

cwpearson · 2024-07-10T18:35:51Z

First part of #100

rebase after Don't flatten include hierarchy on install #110
rebase after KokkosComm_version.hpp -> KokkosComm_config.hpp #111
rebase after Reduce CI runs, skip docs formatting, improve CI names #112
rebase after Don't uselessly extend lifetimes of unmanaged views #113

Introduce `Transport` `CommunicationSpace` concept

This is all modeled after Kokkos Core execution spaces. Each will have its own subdirectory under src (like Kokkos' backends). The interface that each comm space needs to implement is a struct. For example, Irecv, from KokkosComm_fwd.hpp:

template <KokkosView RecvView, KokkosExecutionSpace ExecSpace = Kokkos::DefaultExecutionSpace,
          CommunicationSpace CommSpace = DefaultCommunicationSpace>
struct Irecv;

Why a struct? Because in practice what this means is that the comm space will have a partial specialization (not allowed for functions) that looks like this (from mpi/KokkosComm_mpi_irecv.hpp):

template <KokkosExecutionSpace ExecSpace, KokkosView RecvView>
struct Irecv<RecvView, ExecSpace, Mpi> { ... };

Where Mpi is a struct in the KokkosComm namespace, analogous to e.g. Kokkos::Cuda. A Serial transport would have a corresponding

template <KokkosExecutionSpace ExecSpace, KokkosView RecvView>
struct Irecv<RecvView, ExecSpace, Serial> { ... };

NCCL would have

template <KokkosExecutionSpace ExecSpace, KokkosView RecvView>
struct Irecv<RecvView, ExecSpace, NCCL> { ... };

and so on.

To support this, a lot of the include structure needs to be refined and adjusted to more closely match how Kokkos Core does it.

Future work:

can we have two enabled transports? If so, do we have a preferred and a fallback? Does choosing between them need to happen at runtime?

Introduce a `KokkosComm::Handle`

This hides the CommunicationSpace details from the KokkosComm interface. It's specialized on each CommunicationSpace. For MPI, this holds a Kokkos execution space instance and an MPI communicator.

Reduce main interface to 3 functions:

KokkosComm::Req KokkosComm::isend(Handle...);
KokkosComm::Req KokkosComm::irecv(Handle...);
void KokkosComm::barrier(Handle...);

Move MPI-specific wrappers to `mpi` namespace

This is so we can still have useful MPI interop stuff but it's not part of the "blessed" interface

Update unit tests and perf tests for new layout

Many unit-tests are now MPI specific. All perf_tests are only built if MPI is built (this could be revisited when other Transports are done).

Move `CommMode` and related interfaces to `mpi` namespace

This is MPI-specific stuff. I also made the CommMode stuff more canonically C++

switched e.g. ReadyCommMode -> CommModeReady
Made the comm mode the last argument, and simplified how it's used in some places

Add some additional support for non-contiguous views along existing lines

A Req<Mpi> knows how to keep track of some lambda functions it can call after MPI_Wait (e.g. to unpack the result of an MPI_IRecv).

Grab-bag

req.wait -> KokkosComm::wait(req);
add wait_any and wait_all.

dssgabriel · 2024-07-10T22:07:11Z

I haven't looked into the code in detail, but a few notes:

I think the mpi namespace belongs in Impl, and we shouldn't expose it to users directly (same for nccl and other transports we may implement). We only provide a "basic" API: send/receive, collectives, and other needed utilities (barriers, waits, etc...), and don't expose the underlying specialization, similarly to how Kokkos' parallel_* functions aren't prefixed by an ExecSpace namespace;
We should remove the i in front of isend/irecv to part ways with the MPI naming scheme. It will be clearer, as we won't provide blocking versions anyway, and the non-blocking behavior is clearly defined in the docs. Also matches NCCL better! 🙂
As discussed during last week's meeting, we should drop the Serial mode entirely. Users should be able to emulate it using only one (MPI) process. If not, their code is probably broken;
A random idea I just had, but what about CommunicationSpace instead of Transport? It sounds similar to Kokkos' ExecutionSpace and is modeled similarly as far as I can see here.

Thank you for the work on this big PR; things are progressing nicely!

cwpearson · 2024-07-11T15:10:56Z

I think the mpi namespace belongs in Impl, and we shouldn't expose it to users directly (same for nccl and other transports we may implement). We only provide a "basic" API: send/receive, collectives, and other needed utilities (barriers, waits, etc...), and don't expose the underlying specialization, similarly to how Kokkos' parallel_* functions aren't prefixed by an ExecSpace namespace;

I was debating this myself - I was thinking about the MPI interop angle. I think it makes sense for us to actually provide stable-ish MPI wrappers since that is one of the motivations for this library and what our current users want from it. Impl suggests they are just an implementation detail. I wanted to tuck them away in mpi though so it's clear they're not part of the "blessed" "core" API where we might do some new stuff. Does it seem okay?

We should remove the i in front of isend/irecv to part ways with the MPI naming scheme. It will be clearer, as we won't provide blocking versions anyway, and the non-blocking behavior is clearly defined in the docs. Also matches NCCL better! 🙂

Will do

As discussed during last week's meeting, we should drop the Serial mode entirely. Users should be able to emulate it using only one (MPI) process. If not, their code is probably broken;

This doesn't actually implement anything except MPI, so no danger here 😄 . We could implement Serial (or not) in the future as needed.

A random idea I just had, but what about CommunicationSpace instead of Transport? It sounds similar to Kokkos' ExecutionSpace and is modeled similarly as far as I can see here.

Good idea.

cwpearson · 2024-07-11T21:27:48Z

@dssgabriel I'm proposing some modifications to how the comm mode is used 9126092 this could potentially be a separate PR

dssgabriel · 2024-07-12T11:56:55Z

I think it makes sense for us to actually provide stable-ish MPI wrappers since that is one of the motivations for this library and what our current users want from it. Impl suggests they are just an implementation detail. I wanted to tuck them away in mpi though so it's clear they're not part of the "blessed" "core" API where we might do some new stuff.

I fear that if we provide direct MPI wrappers people will just start using those instead of our "core" API, as it may be easier for them to adopt. This won't add any value to their code, and they may as well keep the MPI interop they already have in place.
Given that we are trying to design an API that is "transport-agnostic", I think we should push users towards it as much as possible.

Some others questions/remarks:

The interface that each comm space needs to implement is a struct.

Do we really need structs or could we circumvent that and only provide functions? If Kokkos handles the dispatch of parallel_* patterns on different execution space without structs, could we try a similar approach? Genuine question as I don't know how it is implemented in Kokkos.
Can we have two enabled transports? If so, do we have a preferred and a fallback? Does choosing between them need to happen at runtime?

I think this is mandatory. NCCL does not explicitly require MPI to be initialized, but it is much easier to do with it.
IMHO, MPI should be our fallback, but perhaps we can make it the default when running on a host exec space vs NCCL is the default when on a (CUDA) device?
Regarding whether we can choose the backend at runtime is still unclear, I think we need to write some code to explore how it should work.

Let's discuss this PR during next week's meeting, we may want to split it up even further. I'll give a better overview of what NCCL offers and how it should interact with this PR. 👍

cwpearson · 2024-07-12T16:06:29Z

It's done with structs in Kokkos because you can't do partial function specialization - e.g. you CAN'T do

template <typename View, typename CommunicationSpace>
void send(View v);

// KokkosComm implementation for MPI in here
template <typename View>
void send<View, Mpi>(View v) {...};

but you CAN do

template <typename View, typename CommunicationSpace>
struct send;

// KokkosComm implementation for MPI in here
template <typename View>
struct send<View, Mpi> {...};

cwpearson · 2024-07-12T19:23:47Z

I fear that if we provide direct MPI wrappers people will just start using those instead of our "core" API, as it may be easier for them to adopt. This won't add any value to their code, and they may as well keep the MPI interop they already have in place.
Given that we are trying to design an API that is "transport-agnostic", I think we should push users towards it as much as possible.

The value of MPI wrappers for the users is two places

we maintain the nitpicky non-contiguous-data-handling code rather than them, and perhaps transparently provide better performance.
We make MPI / execution space instance syncing harder to mess up.

There is demonstrated interest in these. They can be realized in the near term, and provide value to the overall effort:

We establish some cred in the community for when we propose new ideas
People integrate the library with their apps, so we have a ready-made set of people to test the new ideas.

Like you (?), I think the core API should be more thoughtful and novel. However, if what we come up with is such a small step forward that people find MPI-Kokkos interop more compelling, then they weren't going to use the new thing anyway.

cwpearson · 2024-07-24T19:53:44Z

@dssgabriel how are we feeling about this? We have an external party asking about writing a backend to support their research, plus our own NCCL work.

dssgabriel · 2024-07-25T15:40:55Z

I think this is on the right track!

A few points of discussion that we probably need to figure out before merging:

NCCL does not have the concept of tags: do we consider removing them from our "core/blessed" API? Maybe we can provide an overload that has the tag (so we can just ignore it in the NCCL specialization).
NCCL does not have the concept of communication modes: do we drop them entirely? I am not sure they have actual use cases in the applications we target.
Alternatively, we could keep them as part of the MPI-interop API only (the one in the KokkosComm::mpi namespace).
Does the PR in its current form let us have multiple CommunicationSpaces enabled at the same time? I feel like this will be a requirement (e.g., NCCL is far easier to initialize if MPI is available, I don't know about other transport layers).
IMO, the OSU perf test should be in a dedicated mpi/ subdirectory (similar to some of the unit tests), so as not to mix tests that use the "core/blessed" API and the MPI-specific stuff.

As this is a big PR that moves/renames a lot of files, we might as well really think about our project layout/structure.
I am quite a fan of system includes that are prefixed with the project name (e.g. boost, fmt, etc...):

#include <KokkosComm/KokkosComm.hpp>
#include <KokkosComm/mpi/comm_modes.hpp>
...

Rearranging the layout would let us remove some of the prefixing in the filenames, the project structure specifies things (this is only a proposition, it is to be discussed/reworked!):

.
├── src
|   ├── KokkosComm
|   |   ├── mpi
|   │   |   ├── collectives
|   │   │   |   ├── allgather.hpp
|   │   │   |   ├── reduce.hpp
|   │   │   |   └── ...
|   │   |   ├── point_to_point
|   │   │   |   ├── recv.hpp
|   │   │   |   └── send.hpp
|   │   |   ├── utilities
|   │   │   |   ├── test_all.hpp
|   |   |   |   ├── wait.hpp
|   │   │   |   └── ...
|   │   |   ├── interop
|   │   |   │   └── ...
|   │   |   ├── comm_modes.hpp
|   │   |   └── ...
|   |   ├── nccl
|   │   |   └── ... # similar structure as `mpi`
|   |   └── whatever_backend
|   |       └── ... # same
|   ├── concepts.hpp
|   ├── KokkosComm.hpp
|   └── ...
├── unit_tests
|   ├── core
|   ├── mpi
|   └── ...
└── perf_tests
    ├── core
    ├── mpi
    |   └── osu
    └── ...

I will review the code once again tomorrow so that we can discuss it on next monday's meeting and quickly merge it afterwards. 👍

cwpearson · 2024-07-25T16:25:04Z

* NCCL does not have the concept of tags: do we consider removing them from our "core/blessed" API? Maybe we can provide an overload that has the tag (so we can just ignore it in the NCCL specialization).

I tried reading the NCCL docs but I couldn't figure out what the semantics are if I do two ncclSend/ncclRecv in a row to the same destination. Do you know if they are non-overtaking like MPI? I think there needs to be some way to distinguish multiple in-flight messages with the same source and dest, but keeping messages ordered may be enough., I'll drop the tags for now and we can discuss.

* NCCL does not have the concept of communication modes: do we drop them entirely? I am not sure they have actual use cases in the applications we target.
  Alternatively, we could keep them as part of the MPI-interop API only (the one in the `KokkosComm::mpi` namespace).

I think just keep them for MPI since they are basically an MPI thing.

* Does the PR in its current form let us have multiple CommunicationSpaces enabled at the same time? I feel like this will be a requirement (e.g., NCCL is far easier to initialize if MPI is available, I don't know about other transport layers).

I was thinking we could support Preferred/Fallback - if the Preferred CommunicationSpace implements an API, it will be used, otherwise the fallback would be used (MPI would be the fallback for the forseeable future).
It may be okay for a preferred CS to require a specific fallback CS (or a fallback CS at all) to be enabled (maybe NCCL falls into this camp, it needs a fallback to help initialize itself). Or maybe just have the NCCL CS actually be a hybrid NCCL / MPI CS. I don't think we want to allow people to say "oh, my backend requires X/Y/Z CS also be enabled", too complicated. Something that complicated should just be its own CS that manages / uses X/Y/Z internally

* IMO, the OSU perf test should be in a dedicated `mpi/` subdirectory (similar to some of the unit tests), so as not to mix tests that use the "core/blessed" API and the MPI-specific stuff.

I think you're right about this, let me take another stab at it.

* As this is a big PR that moves/renames a lot of files, we might as well really think about our project layout/structure.
  I am quite a fan of system includes that are prefixed with the project name (e.g. boost, fmt, etc...):
  ```c++
  #include <KokkosComm/KokkosComm.hpp>
  #include <KokkosComm/mpi/comm_modes.hpp>
  ...
  ```

I prefer this too, I was just following Kokkos Core and Kernels. Let me see if I can make this change. Then we'll really be touching every file 😄

dssgabriel · 2024-07-26T09:06:20Z

I tried reading the NCCL docs but I couldn't figure out what the semantics are if I do two ncclSend/ncclRecv in a row to the same destination.

As far as I understand, NCCL enqueues communication operations on a CUDA stream. If two successive NCCL calls are on the same stream, they are non-overtaking. However, I think two communications enqueued on different streams may be able to overtake each other? I believe some folks at CEA know the lead developer of NCCL, so we'll try to set up a meeting with them and get answers to all our semantics-related questions.

I think just keep [CommModes] for MPI since they are basically an MPI thing.

LGTM 👍

I don't think we want to allow people to say "oh, my backend requires X/Y/Z CS also be enabled", too complicated. Something that complicated should just be its own CS that manages/uses X/Y/Z internally.

I agree. Maybe we can automatically enable MPI if users compiled their code with -DKokkosComm_ENABLE_NCCL=ON?
To be fair, NCCL only needs to broadcast a unique identifier to all participating processes, so it's a rather "lightweight" setup. Broadcasting is doable using UNIX sockets but it's much easier if we have MPI_Bcast available.

I prefer this too, I was just following Kokkos Core and Kernels. Let me see if I can make this change. Then we'll really be touching every file 😄

Great! I'm looking forward to reviewing every file 😁 Thank you for all the work on this PR!

cwpearson · 2024-08-12T18:08:07Z

I ended up moving all the perf tests to an mpi subdirectory because the perf test driver uses MPI. We should probably re-write it in our core API as a follow-up.

cwpearson · 2024-08-22T16:12:13Z

@dssgabriel I think all future work is waiting behind this PR, how are we feeling?

cedricchevalier19

I am not particularly in favor of adding a public KokkosComm::mpi namespace. Can we put it in an Experimental namespace?

Else, I am in favor of merging this PR as it really improves code architecture., thanks a lot!

cedricchevalier19 · 2024-09-04T15:17:40Z

src/KokkosComm/mpi/mpi.hpp

@@ -0,0 +1,48 @@
+//@HEADER


I am not sure naming this file "mpi.hpp" is a good idea.

cedricchevalier19 · 2024-09-04T15:18:17Z

src/KokkosComm/mpi/mpi.hpp

+
+#include <type_traits>
+
+#include "../concepts.hpp"


Try to avoid relative paths in "public" includes.

dssgabriel · 2024-09-06T15:59:32Z

Do we want to go ahead and just merge this? We could open small issues and gradually fix the various remarks we have brought up here.

cwpearson · 2024-09-09T15:26:10Z

Let's merge this, and then I can open up issues for the comments here. I need an approving review though 😄

dssgabriel

Let's merge this and open issues for the relevant remarks we've made in the PR.
Main points I have in mind at the moment are:

re-organize the files to match ISO C++ library design
move the mpi namespace behind Experimental

cwpearson marked this pull request as ready for review July 10, 2024 21:35

cwpearson force-pushed the feature/transport branch 5 times, most recently from 13405f4 to 8de569a Compare July 10, 2024 21:55

cwpearson mentioned this pull request Jul 10, 2024

docs: irecv #89

Closed

cwpearson requested review from EvanDrakeSuggs and dssgabriel July 11, 2024 15:06

cwpearson force-pushed the feature/transport branch 2 times, most recently from d72f966 to 9126092 Compare July 11, 2024 21:26

cwpearson added 2 commits July 24, 2024 12:06

Refactor for multiple transports

0d2ffc4

docs

08fff5b

cwpearson force-pushed the feature/transport branch 3 times, most recently from 419f8f8 to c454f51 Compare July 24, 2024 18:15

cwpearson added 5 commits July 24, 2024 12:17

Suggest improved build parallelism, move aliases together

a3d87d0

docs: suggest cwpearson/clang-format-14

55c9eae

ci: displayed name is PR name

9de9d7a

test_2dhalo.cpp: use provided space

838781d

Increase test timeout

cfa1911

cwpearson force-pushed the feature/transport branch from c454f51 to f40fece Compare July 24, 2024 18:17

Req: don't extend lifetime of unmanaged views

8e2010c

cwpearson force-pushed the feature/transport branch from f40fece to 86ce85c Compare July 24, 2024 18:20

cwpearson mentioned this pull request Jul 24, 2024

Header files layout #15

Closed

cwpearson added 5 commits July 24, 2024 12:28

KokkosComm_version -> KokkosComm_config

ace266b

isend->send, irecv->recv

85cfa10

Transport -> CommunicationSpace

007ea48

Communication mode rework

497f33c

Fix includes

9c8066a

cwpearson force-pushed the feature/transport branch from 86ce85c to 9c8066a Compare July 24, 2024 18:28

cwpearson added 2 commits July 24, 2024 12:46

Build tests in compile-only test

780d653

Config.cmake.in: propagate KOKKOSCOMM_ENABLE_MPI

595794a

cwpearson force-pushed the feature/transport branch from 5e32a41 to 595794a Compare July 24, 2024 18:50

cwpearson self-assigned this Jul 24, 2024

cwpearson changed the title ~~Reorganize for Serial / MPI / NCCL transport layers~~ Reorganize for MPI / NCCL transport layers Jul 24, 2024

dssgabriel mentioned this pull request Jul 25, 2024

Clarify semantics of ExecSpace parameter for communication interfaces #108

Open

cwpearson added 3 commits August 12, 2024 09:15

Remove tags from core interface

dc74642

Move headers to KokkosComm/

2c2bc5f

Move perf tests to mpi/

7856b43

cedricchevalier19 reviewed Sep 4, 2024

View reviewed changes

dssgabriel approved these changes Sep 9, 2024

View reviewed changes

cwpearson merged commit a968a13 into kokkos:develop Sep 9, 2024
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reorganize for MPI / NCCL transport layers #109

Reorganize for MPI / NCCL transport layers #109

cwpearson commented Jul 10, 2024 •

edited

Loading

dssgabriel commented Jul 10, 2024 •

edited

Loading

cwpearson commented Jul 11, 2024 •

edited

Loading

cwpearson commented Jul 11, 2024

dssgabriel commented Jul 12, 2024

cwpearson commented Jul 12, 2024

cwpearson commented Jul 12, 2024 •

edited

Loading

cwpearson commented Jul 24, 2024 •

edited

Loading

dssgabriel commented Jul 25, 2024 •

edited

Loading

cwpearson commented Jul 25, 2024

dssgabriel commented Jul 26, 2024

cwpearson commented Aug 12, 2024

cwpearson commented Aug 22, 2024

cedricchevalier19 left a comment

cedricchevalier19 Sep 4, 2024

cedricchevalier19 Sep 4, 2024

dssgabriel commented Sep 6, 2024

cwpearson commented Sep 9, 2024 •

edited

Loading

dssgabriel left a comment

Reorganize for MPI / NCCL transport layers #109

Reorganize for MPI / NCCL transport layers #109

Conversation

cwpearson commented Jul 10, 2024 • edited Loading

Introduce Transport CommunicationSpace concept

Introduce a KokkosComm::Handle

Reduce main interface to 3 functions:

Move MPI-specific wrappers to mpi namespace

Update unit tests and perf tests for new layout

Move CommMode and related interfaces to mpi namespace

Add some additional support for non-contiguous views along existing lines

Grab-bag

dssgabriel commented Jul 10, 2024 • edited Loading

cwpearson commented Jul 11, 2024 • edited Loading

cwpearson commented Jul 11, 2024

dssgabriel commented Jul 12, 2024

cwpearson commented Jul 12, 2024

cwpearson commented Jul 12, 2024 • edited Loading

cwpearson commented Jul 24, 2024 • edited Loading

dssgabriel commented Jul 25, 2024 • edited Loading

cwpearson commented Jul 25, 2024

dssgabriel commented Jul 26, 2024

cwpearson commented Aug 12, 2024

cwpearson commented Aug 22, 2024

cedricchevalier19 left a comment

Choose a reason for hiding this comment

cedricchevalier19 Sep 4, 2024

Choose a reason for hiding this comment

cedricchevalier19 Sep 4, 2024

Choose a reason for hiding this comment

dssgabriel commented Sep 6, 2024

cwpearson commented Sep 9, 2024 • edited Loading

dssgabriel left a comment

Choose a reason for hiding this comment

cwpearson commented Jul 10, 2024 •

edited

Loading

Introduce `Transport` `CommunicationSpace` concept

Introduce a `KokkosComm::Handle`

Move MPI-specific wrappers to `mpi` namespace

Move `CommMode` and related interfaces to `mpi` namespace

dssgabriel commented Jul 10, 2024 •

edited

Loading

cwpearson commented Jul 11, 2024 •

edited

Loading

cwpearson commented Jul 12, 2024 •

edited

Loading

cwpearson commented Jul 24, 2024 •

edited

Loading

dssgabriel commented Jul 25, 2024 •

edited

Loading

cwpearson commented Sep 9, 2024 •

edited

Loading