Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-contiguous data handling, restructure for multiple transports #100

Draft
wants to merge 7 commits into
base: develop
Choose a base branch
from

Conversation

cwpearson
Copy link
Collaborator

@cwpearson cwpearson commented Jun 20, 2024

This is a big PR that does a lot of things that I felt were interrelated

Prepare for non-MPI "transports"

This is all modeled after Kokkos Core. Each transport will have its own subdirectory under src (like Kokkos' backends). The interface that each transport needs to implement is a struct. For example, Irecv, from KokkosComm_fwd.hpp:

template <KokkosView RecvView, KokkosExecutionSpace ExecSpace = Kokkos::DefaultExecutionSpace,
          Transport TRANSPORT = DefaultTransport>
struct Irecv;

Why a struct? Because in practice what this means is that the transport will have a partial specialization (not allowed for functions) that looks like this (from mpi/KokkosComm_mpi_irecv.hpp):

template <KokkosExecutionSpace ExecSpace, KokkosView RecvView>
struct Irecv<RecvView, ExecSpace, Mpi> { ... };

Where Mpi is a struct in the KokkosComm namespace, analogous to e.g. Kokkos::Cuda. A Serial transport would have a corresponding

template <KokkosExecutionSpace ExecSpace, KokkosView RecvView>
struct Irecv<RecvView, ExecSpace, Serial> { ... };

NCCL would have

template <KokkosExecutionSpace ExecSpace, KokkosView RecvView>
struct Irecv<RecvView, ExecSpace, NCCL> { ... };

and so on.

To support this, a lot of the include structure needs to be refined and adjusted to more closely match how Kokkos Core does it.

Future work:

  • can we have two enabled transports? If so, do we have a preferred and a fallback? Does choosing between them need to happen at runtime?
  • Convert the rest of our MPI APIs into this style.

Non-contiguous view handling in MPI

Now that we have a place to implement MPI-specific things, this is where we can do non-contiguous data handling strategies. Originally I thought this was orthogonal to the transport, but it is not - consider if we want to use MPI Derived Datatypes to handle non-contiguous data.

The current implementation basically defines a sequence of seven "phases" to coordinate Kokkos execution space instances, MPI communicator, and associated non-contiguous data handling operations:

Five phases to get the communication posted

  1. init_fence (optional fence before host-side calls)
  2. allocations (optional allocations)
  3. pre_copies (optional data copies)
  4. pre_comm_fence (optional fence before host-side calls)
  5. comm (actually make MPI calls)

Plus two more that happen after the communication has posted

  1. wait (optional wait for MPI calls)
  2. post-wait (optional operations after waits)

All MPI operations + non-contiguous data handling must be fit into these 7 phases.

Future work:

  • are these the right 7 phases? In any case, this is an implementation detail that can be changed.

Reduce fencing due to host-side MPI calls

One problem with a higher-level interface as we've started defining it today is that people want to do this:

Kokkos::parallel_for("fill send buffers", space, ...);
space.fence(); // just one fence
MPI_Isend(...);
MPI_Isend
MPI_Isend
...

However, our semantics say that the communication is ordered in the space, which means for non-host spaces we have to fence internally in our e.g. isend before we actually call MPI_Isend. So for us, we get some pointless fences

Kokkos::parallel_for("fill send buffers", space, ...);
KokkosComm::isend(space, ...); // implicit fence
KokkosComm::isend(space, ...); // implicit fence
KokkosComm::isend(space, ...); // implicit fence
...

This introduces a "plan" construct like this:

std::vector<KokkosComm::Req<>> reqs = KokkosComm::plan(space, comm, [=](KokkosComm::Handle<Space> &handle) {
    KokkosComm::isend(handle, xp1_s, get_rank(xp1, ry), 0);
    KokkosComm::isend(handle, xm1_s, get_rank(xm1, ry), 1);
    KokkosComm::isend(handle, yp1_s, get_rank(rx, yp1), 2);
    KokkosComm::isend(handle, ym1_s, get_rank(rx, ym1), 3);
  });

Now our pure KokkosComm APIs would take this handle argument, though which the implementation tells the plan function whether it needs fences, and how its operation is implemented in terms of the 7 phases. The plan looks at all of those and issues the minimal number of fences. This returns one KokkosCom::Req for each async operation in the lambda.

We can still have a lower-level search-and-replace style API for interop as well. It can be implemented in terms of this plan, or more directly.

KokkosComm::wait, wait_any, wait_all

Free-standing functions to wait on KokkosComm::Req

Replaces #64 and #32

@cwpearson cwpearson self-assigned this Jun 20, 2024
@cwpearson cwpearson force-pushed the feature/handle-and-noncontiguous branch from 01dc842 to 3975bc6 Compare June 20, 2024 22:47
@cwpearson cwpearson force-pushed the feature/handle-and-noncontiguous branch from 3975bc6 to 5b95a4c Compare June 20, 2024 22:51
@aprokop
Copy link
Contributor

aprokop commented Jun 21, 2024

How are you going to implement a loop over (waitany, process) using handles? For example, if receiving halos from multiple processors, wait for any to arrive, process its data, then wait for the next to arrive. To overlap communication and computation.

@cwpearson cwpearson force-pushed the feature/handle-and-noncontiguous branch from 44ef546 to cc3fcfc Compare June 21, 2024 21:46
@cwpearson
Copy link
Collaborator Author

cwpearson commented Jun 24, 2024

How are you going to implement a loop over (waitany, process) using handles? For example, if receiving halos from multiple processors, wait for any to arrive, process its data, then wait for the next to arrive. To overlap communication and computation.

I made the plan thing return an array of Req instead, implemented wait_any, and I made a big update to the PR comment.

@dutkalex
Copy link
Collaborator

How are you going to implement a loop over (waitany, process) using handles? For example, if receiving halos from multiple processors, wait for any to arrive, process its data, then wait for the next to arrive. To overlap communication and computation.

I made the plan thing return an array of Req instead, implemented wait_any, and I made a big update to the PR comment.

Do you think it would make sense to allow other kinds of statements inside of the plan lambda? Like computations which could be done on the fly upon reception of some data while waiting for other communications?

@cwpearson cwpearson changed the title Another attempt at comm/space wrapping and noncontig views Non-contiguous data, restructure for multiple transports Jun 26, 2024
@cwpearson cwpearson changed the title Non-contiguous data, restructure for multiple transports Non-contiguous data handling, restructure for multiple transports Jun 26, 2024
@cwpearson
Copy link
Collaborator Author

cwpearson commented Jun 26, 2024

Do you think it would make sense to allow other kinds of statements inside of the plan lambda? Like computations which could be done on the fly upon reception of some data while waiting for other communications?

I think this is an interesting idea as a follow-on...users may be tempted to do something like this:

std::vector<KokkosComm::Req<>> reqs = KokkosComm::plan(space, comm, [=](KokkosComm::Handle<Space> &handle) {
    KokkosComm::isend(handle, xp1_s, get_rank(xp1, ry), 0);
    Kokkos::parallel_for(space, ...) // roughly...
    KokkosComm::isend(handle, xm1_s, get_rank(xm1, ry), 1);
  });

But this actually won't work properly because the KokkosComm::isend(handle, ...) call doesn't immediately inject the communication into the stream, so we'd probably have to offer some kind of KokkosComm::parallel_for(handle, ...) that could be called inside this region and make it visible to the plan.

@dutkalex
Copy link
Collaborator

dutkalex commented Jun 26, 2024

we'd probably have to offer some kind of KokkosComm::parallel_for(handle, ...) that could be called inside this region and make it visible to the plan

Yeah I was thinking of something along these lines, but I think we can avoid redefining new parallel_for functions by having the isend return a new handle:

auto reqs = KokkosComm::plan( space, comm, [=]( KokkosComm::Handle<Space>& handle ){
    KokkosComm::isend(handle, xp1_s, get_rank(xp1, ry), 0).and_then( 
        [=]{ Kokkos::parallel_for( space, ... ); }
    );
    KokkosComm::isend( handle, xm1_s, get_rank(xm1, ry), 1 );
} );

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants