Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reorganize for MPI / NCCL transport layers #109

Merged
merged 18 commits into from
Sep 9, 2024

Conversation

cwpearson
Copy link
Collaborator

@cwpearson cwpearson commented Jul 10, 2024

First part of #100

Introduce Transport CommunicationSpace concept

This is all modeled after Kokkos Core execution spaces. Each will have its own subdirectory under src (like Kokkos' backends). The interface that each comm space needs to implement is a struct. For example, Irecv, from KokkosComm_fwd.hpp:

template <KokkosView RecvView, KokkosExecutionSpace ExecSpace = Kokkos::DefaultExecutionSpace,
          CommunicationSpace CommSpace = DefaultCommunicationSpace>
struct Irecv;

Why a struct? Because in practice what this means is that the comm space will have a partial specialization (not allowed for functions) that looks like this (from mpi/KokkosComm_mpi_irecv.hpp):

template <KokkosExecutionSpace ExecSpace, KokkosView RecvView>
struct Irecv<RecvView, ExecSpace, Mpi> { ... };

Where Mpi is a struct in the KokkosComm namespace, analogous to e.g. Kokkos::Cuda. A Serial transport would have a corresponding

template <KokkosExecutionSpace ExecSpace, KokkosView RecvView>
struct Irecv<RecvView, ExecSpace, Serial> { ... };

NCCL would have

template <KokkosExecutionSpace ExecSpace, KokkosView RecvView>
struct Irecv<RecvView, ExecSpace, NCCL> { ... };

and so on.

To support this, a lot of the include structure needs to be refined and adjusted to more closely match how Kokkos Core does it.

Future work:

  • can we have two enabled transports? If so, do we have a preferred and a fallback? Does choosing between them need to happen at runtime?

Introduce a KokkosComm::Handle

This hides the CommunicationSpace details from the KokkosComm interface. It's specialized on each CommunicationSpace. For MPI, this holds a Kokkos execution space instance and an MPI communicator.

Reduce main interface to 3 functions:

KokkosComm::Req KokkosComm::isend(Handle...);
KokkosComm::Req KokkosComm::irecv(Handle...);
void KokkosComm::barrier(Handle...);

Move MPI-specific wrappers to mpi namespace

This is so we can still have useful MPI interop stuff but it's not part of the "blessed" interface

Update unit tests and perf tests for new layout

Many unit-tests are now MPI specific. All perf_tests are only built if MPI is built (this could be revisited when other Transports are done).

Move CommMode and related interfaces to mpi namespace

This is MPI-specific stuff. I also made the CommMode stuff more canonically C++

  • switched e.g. ReadyCommMode -> CommModeReady
  • Made the comm mode the last argument, and simplified how it's used in some places

Add some additional support for non-contiguous views along existing lines

A Req<Mpi> knows how to keep track of some lambda functions it can call after MPI_Wait (e.g. to unpack the result of an MPI_IRecv).

Grab-bag

  • req.wait -> KokkosComm::wait(req);
  • add wait_any and wait_all.

@cwpearson cwpearson marked this pull request as ready for review July 10, 2024 21:35
@cwpearson cwpearson force-pushed the feature/transport branch 5 times, most recently from 13405f4 to 8de569a Compare July 10, 2024 21:55
@cwpearson cwpearson mentioned this pull request Jul 10, 2024
@dssgabriel
Copy link
Collaborator

dssgabriel commented Jul 10, 2024

I haven't looked into the code in detail, but a few notes:

  • I think the mpi namespace belongs in Impl, and we shouldn't expose it to users directly (same for nccl and other transports we may implement). We only provide a "basic" API: send/receive, collectives, and other needed utilities (barriers, waits, etc...), and don't expose the underlying specialization, similarly to how Kokkos' parallel_* functions aren't prefixed by an ExecSpace namespace;
  • We should remove the i in front of isend/irecv to part ways with the MPI naming scheme. It will be clearer, as we won't provide blocking versions anyway, and the non-blocking behavior is clearly defined in the docs. Also matches NCCL better! 🙂
  • As discussed during last week's meeting, we should drop the Serial mode entirely. Users should be able to emulate it using only one (MPI) process. If not, their code is probably broken;
  • A random idea I just had, but what about CommunicationSpace instead of Transport? It sounds similar to Kokkos' ExecutionSpace and is modeled similarly as far as I can see here.

Thank you for the work on this big PR; things are progressing nicely!

@cwpearson
Copy link
Collaborator Author

cwpearson commented Jul 11, 2024

I think the mpi namespace belongs in Impl, and we shouldn't expose it to users directly (same for nccl and other transports we may implement). We only provide a "basic" API: send/receive, collectives, and other needed utilities (barriers, waits, etc...), and don't expose the underlying specialization, similarly to how Kokkos' parallel_* functions aren't prefixed by an ExecSpace namespace;

I was debating this myself - I was thinking about the MPI interop angle. I think it makes sense for us to actually provide stable-ish MPI wrappers since that is one of the motivations for this library and what our current users want from it. Impl suggests they are just an implementation detail. I wanted to tuck them away in mpi though so it's clear they're not part of the "blessed" "core" API where we might do some new stuff. Does it seem okay?

We should remove the i in front of isend/irecv to part ways with the MPI naming scheme. It will be clearer, as we won't provide blocking versions anyway, and the non-blocking behavior is clearly defined in the docs. Also matches NCCL better! 🙂

Will do

As discussed during last week's meeting, we should drop the Serial mode entirely. Users should be able to emulate it using only one (MPI) process. If not, their code is probably broken;

This doesn't actually implement anything except MPI, so no danger here 😄 . We could implement Serial (or not) in the future as needed.

A random idea I just had, but what about CommunicationSpace instead of Transport? It sounds similar to Kokkos' ExecutionSpace and is modeled similarly as far as I can see here.

Good idea.

@cwpearson cwpearson force-pushed the feature/transport branch 2 times, most recently from d72f966 to 9126092 Compare July 11, 2024 21:26
@cwpearson
Copy link
Collaborator Author

@dssgabriel I'm proposing some modifications to how the comm mode is used 9126092 this could potentially be a separate PR

@dssgabriel
Copy link
Collaborator

I think it makes sense for us to actually provide stable-ish MPI wrappers since that is one of the motivations for this library and what our current users want from it. Impl suggests they are just an implementation detail. I wanted to tuck them away in mpi though so it's clear they're not part of the "blessed" "core" API where we might do some new stuff.

I fear that if we provide direct MPI wrappers people will just start using those instead of our "core" API, as it may be easier for them to adopt. This won't add any value to their code, and they may as well keep the MPI interop they already have in place.
Given that we are trying to design an API that is "transport-agnostic", I think we should push users towards it as much as possible.

Some others questions/remarks:

  • The interface that each comm space needs to implement is a struct.

    Do we really need structs or could we circumvent that and only provide functions? If Kokkos handles the dispatch of parallel_* patterns on different execution space without structs, could we try a similar approach? Genuine question as I don't know how it is implemented in Kokkos.

  • Can we have two enabled transports? If so, do we have a preferred and a fallback? Does choosing between them need to happen at runtime?

    I think this is mandatory. NCCL does not explicitly require MPI to be initialized, but it is much easier to do with it.
    IMHO, MPI should be our fallback, but perhaps we can make it the default when running on a host exec space vs NCCL is the default when on a (CUDA) device?
    Regarding whether we can choose the backend at runtime is still unclear, I think we need to write some code to explore how it should work.

Let's discuss this PR during next week's meeting, we may want to split it up even further. I'll give a better overview of what NCCL offers and how it should interact with this PR. 👍

@cwpearson
Copy link
Collaborator Author

It's done with structs in Kokkos because you can't do partial function specialization - e.g. you CAN'T do

template <typename View, typename CommunicationSpace>
void send(View v);

// KokkosComm implementation for MPI in here
template <typename View>
void send<View, Mpi>(View v) {...};

but you CAN do

template <typename View, typename CommunicationSpace>
struct send;

// KokkosComm implementation for MPI in here
template <typename View>
struct send<View, Mpi> {...};

@cwpearson
Copy link
Collaborator Author

cwpearson commented Jul 12, 2024

I fear that if we provide direct MPI wrappers people will just start using those instead of our "core" API, as it may be easier for them to adopt. This won't add any value to their code, and they may as well keep the MPI interop they already have in place.
Given that we are trying to design an API that is "transport-agnostic", I think we should push users towards it as much as possible.

The value of MPI wrappers for the users is two places

  1. we maintain the nitpicky non-contiguous-data-handling code rather than them, and perhaps transparently provide better performance.
  2. We make MPI / execution space instance syncing harder to mess up.

There is demonstrated interest in these. They can be realized in the near term, and provide value to the overall effort:

  1. We establish some cred in the community for when we propose new ideas
  2. People integrate the library with their apps, so we have a ready-made set of people to test the new ideas.

Like you (?), I think the core API should be more thoughtful and novel. However, if what we come up with is such a small step forward that people find MPI-Kokkos interop more compelling, then they weren't going to use the new thing anyway.

@cwpearson cwpearson force-pushed the feature/transport branch 3 times, most recently from 419f8f8 to c454f51 Compare July 24, 2024 18:15
@cwpearson cwpearson mentioned this pull request Jul 24, 2024
@cwpearson
Copy link
Collaborator Author

cwpearson commented Jul 24, 2024

@dssgabriel how are we feeling about this? We have an external party asking about writing a backend to support their research, plus our own NCCL work.

@cwpearson cwpearson changed the title Reorganize for Serial / MPI / NCCL transport layers Reorganize for MPI / NCCL transport layers Jul 24, 2024
@dssgabriel
Copy link
Collaborator

dssgabriel commented Jul 25, 2024

I think this is on the right track!

A few points of discussion that we probably need to figure out before merging:

  • NCCL does not have the concept of tags: do we consider removing them from our "core/blessed" API? Maybe we can provide an overload that has the tag (so we can just ignore it in the NCCL specialization).
  • NCCL does not have the concept of communication modes: do we drop them entirely? I am not sure they have actual use cases in the applications we target.
    Alternatively, we could keep them as part of the MPI-interop API only (the one in the KokkosComm::mpi namespace).
  • Does the PR in its current form let us have multiple CommunicationSpaces enabled at the same time? I feel like this will be a requirement (e.g., NCCL is far easier to initialize if MPI is available, I don't know about other transport layers).
  • IMO, the OSU perf test should be in a dedicated mpi/ subdirectory (similar to some of the unit tests), so as not to mix tests that use the "core/blessed" API and the MPI-specific stuff.
  • As this is a big PR that moves/renames a lot of files, we might as well really think about our project layout/structure.
    I am quite a fan of system includes that are prefixed with the project name (e.g. boost, fmt, etc...):
    #include <KokkosComm/KokkosComm.hpp>
    #include <KokkosComm/mpi/comm_modes.hpp>
    ...
    Rearranging the layout would let us remove some of the prefixing in the filenames, the project structure specifies things (this is only a proposition, it is to be discussed/reworked!):
    .
    ├── src
    |   ├── KokkosComm
    |   |   ├── mpi
    |   │   |   ├── collectives
    |   │   │   |   ├── allgather.hpp
    |   │   │   |   ├── reduce.hpp
    |   │   │   |   └── ...
    |   │   |   ├── point_to_point
    |   │   │   |   ├── recv.hpp
    |   │   │   |   └── send.hpp
    |   │   |   ├── utilities
    |   │   │   |   ├── test_all.hpp
    |   |   |   |   ├── wait.hpp
    |   │   │   |   └── ...
    |   │   |   ├── interop
    |   │   |   │   └── ...
    |   │   |   ├── comm_modes.hpp
    |   │   |   └── ...
    |   |   ├── nccl
    |   │   |   └── ... # similar structure as `mpi`
    |   |   └── whatever_backend
    |   |       └── ... # same
    |   ├── concepts.hpp
    |   ├── KokkosComm.hpp
    |   └── ...
    ├── unit_tests
    |   ├── core
    |   ├── mpi
    |   └── ...
    └── perf_tests
        ├── core
        ├── mpi
        |   └── osu
        └── ...
    

I will review the code once again tomorrow so that we can discuss it on next monday's meeting and quickly merge it afterwards. 👍

@cwpearson
Copy link
Collaborator Author

* NCCL does not have the concept of tags: do we consider removing them from our "core/blessed" API? Maybe we can provide an overload that has the tag (so we can just ignore it in the NCCL specialization).

I tried reading the NCCL docs but I couldn't figure out what the semantics are if I do two ncclSend/ncclRecv in a row to the same destination. Do you know if they are non-overtaking like MPI? I think there needs to be some way to distinguish multiple in-flight messages with the same source and dest, but keeping messages ordered may be enough., I'll drop the tags for now and we can discuss.

* NCCL does not have the concept of communication modes: do we drop them entirely? I am not sure they have actual use cases in the applications we target.
  Alternatively, we could keep them as part of the MPI-interop API only (the one in the `KokkosComm::mpi` namespace).

I think just keep them for MPI since they are basically an MPI thing.

* Does the PR in its current form let us have multiple CommunicationSpaces enabled at the same time? I feel like this will be a requirement (e.g., NCCL is far easier to initialize if MPI is available, I don't know about other transport layers).
  1. I was thinking we could support Preferred/Fallback - if the Preferred CommunicationSpace implements an API, it will be used, otherwise the fallback would be used (MPI would be the fallback for the forseeable future).

  2. It may be okay for a preferred CS to require a specific fallback CS (or a fallback CS at all) to be enabled (maybe NCCL falls into this camp, it needs a fallback to help initialize itself). Or maybe just have the NCCL CS actually be a hybrid NCCL / MPI CS. I don't think we want to allow people to say "oh, my backend requires X/Y/Z CS also be enabled", too complicated. Something that complicated should just be its own CS that manages / uses X/Y/Z internally

* IMO, the OSU perf test should be in a dedicated `mpi/` subdirectory (similar to some of the unit tests), so as not to mix tests that use the "core/blessed" API and the MPI-specific stuff.

I think you're right about this, let me take another stab at it.

* As this is a big PR that moves/renames a lot of files, we might as well really think about our project layout/structure.
  I am quite a fan of system includes that are prefixed with the project name (e.g. boost, fmt, etc...):
  ```c++
  #include <KokkosComm/KokkosComm.hpp>
  #include <KokkosComm/mpi/comm_modes.hpp>
  ...
  ```

I prefer this too, I was just following Kokkos Core and Kernels. Let me see if I can make this change. Then we'll really be touching every file 😄

@dssgabriel
Copy link
Collaborator

I tried reading the NCCL docs but I couldn't figure out what the semantics are if I do two ncclSend/ncclRecv in a row to the same destination.

As far as I understand, NCCL enqueues communication operations on a CUDA stream. If two successive NCCL calls are on the same stream, they are non-overtaking. However, I think two communications enqueued on different streams may be able to overtake each other? I believe some folks at CEA know the lead developer of NCCL, so we'll try to set up a meeting with them and get answers to all our semantics-related questions.

I think just keep [CommModes] for MPI since they are basically an MPI thing.

LGTM 👍

I don't think we want to allow people to say "oh, my backend requires X/Y/Z CS also be enabled", too complicated. Something that complicated should just be its own CS that manages/uses X/Y/Z internally.

I agree. Maybe we can automatically enable MPI if users compiled their code with -DKokkosComm_ENABLE_NCCL=ON?
To be fair, NCCL only needs to broadcast a unique identifier to all participating processes, so it's a rather "lightweight" setup. Broadcasting is doable using UNIX sockets but it's much easier if we have MPI_Bcast available.

I prefer this too, I was just following Kokkos Core and Kernels. Let me see if I can make this change. Then we'll really be touching every file 😄

Great! I'm looking forward to reviewing every file 😁 Thank you for all the work on this PR!

@cwpearson
Copy link
Collaborator Author

I ended up moving all the perf tests to an mpi subdirectory because the perf test driver uses MPI. We should probably re-write it in our core API as a follow-up.

@cwpearson
Copy link
Collaborator Author

@dssgabriel I think all future work is waiting behind this PR, how are we feeling?

Copy link
Member

@cedricchevalier19 cedricchevalier19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not particularly in favor of adding a public KokkosComm::mpi namespace. Can we put it in an Experimental namespace?

Else, I am in favor of merging this PR as it really improves code architecture., thanks a lot!

@@ -0,0 +1,48 @@
//@HEADER
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure naming this file "mpi.hpp" is a good idea.


#include <type_traits>

#include "../concepts.hpp"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to avoid relative paths in "public" includes.

@dssgabriel
Copy link
Collaborator

Do we want to go ahead and just merge this? We could open small issues and gradually fix the various remarks we have brought up here.

@cwpearson
Copy link
Collaborator Author

cwpearson commented Sep 9, 2024

Let's merge this, and then I can open up issues for the comments here. I need an approving review though 😄

Copy link
Collaborator

@dssgabriel dssgabriel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge this and open issues for the relevant remarks we've made in the PR.
Main points I have in mind at the moment are:

  • re-organize the files to match ISO C++ library design
  • move the mpi namespace behind Experimental

@cwpearson cwpearson merged commit a968a13 into kokkos:develop Sep 9, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants