Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sessions: add support for ucx and more #12723

Merged
merged 1 commit into from
Oct 23, 2024

Conversation

hppritcha
Copy link
Member

@hppritcha hppritcha commented Jul 30, 2024

Greatly simplify support for MPI_Comm_create_from_group and MPI_Intercomm_create_from_group by removing the need to support the 128-bit excid notion.

Instead, make use of a PMIx capability - PMIX_GROUP_LOCAL_CID and the notion of PMIX_GROUP_INFO. This capability was introduced in Open PMIx 4.1.3. This capability allows us to piggy-back a local cid selected for the new communicator on the PMIx_Group_construct operation. Using this approach, a lot of the complex active message style operations implemented in the OB1 PML to support excids can be avoided.

Infrastructure for debugging communicator management routines was also introduced, along with a new MCA parameter - mpi_comm_verbose.

Related to #12566

@hppritcha hppritcha changed the title sessions: add support for ucx more sessions: add support for ucx and more Jul 30, 2024
@hppritcha
Copy link
Member Author

@rhc54 could you check the PMIx usage here?

@hppritcha hppritcha requested a review from wenduwan July 30, 2024 16:50
@wenduwan
Copy link
Contributor

Running AWS CI

@hppritcha
Copy link
Member Author

hmmm that coredump is coming from prrte -

Core was generated by `prterun --map-by :OVERSUBSCRIBE -np 4 python3 ./main.py -v -f'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000ffffb0a9c7b8 in prte_grpcomm_base_get_tracker (sig=0xffffa83e1ae0, create=true) at base/grpcomm_base_stubs.c:313
313	        if (coll->dmns[n] == PRTE_PROC_MY_NAME->rank) {
[Current thread is 1 (Thread 0xffffb0c28f40 (LWP 3683323))]
(gdb) bt
#0  0x0000ffffb0a9c7b8 in prte_grpcomm_base_get_tracker (sig=0xffffa83e1ae0, create=true) at base/grpcomm_base_stubs.c:313
#1  0x0000ffffb0a9bac4 in allgather_stub (fd=-1, args=4, cbdata=0xffffa80df400) at base/grpcomm_base_stubs.c:159
#2  0x0000ffffb04ee628 in event_process_active_single_queue () from /lib64/libevent_core-2.1.so.6
#3  0x0000ffffb04ef290 in event_base_loop () from /lib64/libevent_core-2.1.so.6
#4  0x0000000000408fd4 in main (argc=9, argv=0xfffff697ac38) at prte.c:1297
(gdb) list
308	
309	    /* see if I am in the array of participants - note that I may
310	     * be in the rollup tree even though I'm not participating
311	     * in the collective itself */
312	    for (n = 0; n < coll->ndmns; n++) {
313	        if (coll->dmns[n] == PRTE_PROC_MY_NAME->rank) {
314	            coll->nexpected++;
315	            break;
316	        }
317	    }
(gdb) print n
$1 = 0
(gdb) print coll
$2 = (prte_grpcomm_coll_t *) 0x137b0da0
(gdb) print coll->ndmns
$3 = 1
(gdb) print coll->dmns
$4 = (pmix_rank_t *) 0x0
(gdb) 

ompi/runtime/ompi_mpi_params.c Show resolved Hide resolved
ompi/communicator/comm_cid.c Outdated Show resolved Hide resolved
rc = PMIx_Group_construct(tag, procs, proc_count, &pinfo, 1, &results, &nresults);
PMIX_INFO_DESTRUCT(&pinfo);
rc = PMIx_Group_construct(tag, procs, proc_count, pinfo, ninfo, &results, &nresults);
PMIX_DATA_ARRAY_DESTRUCT(&darray);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There must be some optimizations behind this centralized group creation for it to scale. I would love to hear more about.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends on the RTE, of course, as all PMIx does is construct the request and pass it up. For PRRTE, it is just the tree-based allgather we use for pretty much all collectives. I don't believe anyone has exerted much effort towards optimizing it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's exactly what I'm concerned about as I have seen no performance report on this change.

@rhc54
Copy link
Contributor

rhc54 commented Jul 31, 2024

hmmm that coredump is coming from prrte -

Not surprising - we know that the group construct code in PRRTE master has problems. I've been working on it and am almost done with the fix, but am getting pulled in multiple directions right now. So it is sitting off to the side for now.

I'll try to look thru the PMIx stuff as time permits.

@hppritcha
Copy link
Member Author

hmmm that coredump is coming from prrte -

Not surprising - we know that the group construct code in PRRTE master has problems. I've been working on it and am almost done with the fix, but am getting pulled in multiple directions right now. So it is sitting off to the side for now.

I'll try to look thru the PMIx stuff as time permits.

I did some digging in to this and it turns out that the Open MPI is feeding a corrupted proc list into PMIx_Group_construct for at least one rank, that's ending up causing problems in prte and ulitimately the segfault. not sure how pmix could validate that kind of input.

@rhc54
Copy link
Contributor

rhc54 commented Jul 31, 2024

You mean one proc in the array has a bad nspace? Yeah, I'm not sure how we check for a bad string, though we do know the max size of the nspace - so we should not be charging off into nowhere land. PMIX_CHECK_NSPACE has the protection for that much - could be we have a place in PRRTE that isn't using it.

@hppritcha
Copy link
Member Author

You mean one proc in the array has a bad nspace? Yeah, I'm not sure how we check for a bad string, though we do know the max size of the nspace - so we should not be charging off into nowhere land. PMIX_CHECK_NSPACE has the protection for that much - could be we have a place in PRRTE that isn't using it.

a namespace sanity check in prrte group construct may have caught this but not sure.

@hppritcha hppritcha force-pushed the topic/ucx_excid_support branch 2 times, most recently from 7f17065 to 23865b7 Compare August 1, 2024 15:19
@wenduwan
Copy link
Contributor

wenduwan commented Aug 1, 2024

FYI AWS CI passed for this PR

@hppritcha hppritcha requested a review from janjust August 1, 2024 15:21
@hppritcha
Copy link
Member Author

@wenduwan could you rerun AWS CI?

ompi/mca/mtl/ofi/mtl_ofi.c Outdated Show resolved Hide resolved
ompi/mca/mtl/ofi/mtl_ofi.h Outdated Show resolved Hide resolved
ompi/runtime/params.h Outdated Show resolved Hide resolved
ompi/communicator/comm.c Outdated Show resolved Hide resolved
ompi/communicator/comm.c Outdated Show resolved Hide resolved
ompi/communicator/comm_cid.c Outdated Show resolved Hide resolved
@hppritcha
Copy link
Member Author

@wenduwan when you have a chance could you check changes per your comments?

@hppritcha hppritcha requested a review from wenduwan August 7, 2024 18:56
@hppritcha
Copy link
Member Author

@janjust please review - esp. UCX part when you get a chance

Copy link
Contributor

@wenduwan wenduwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comments have been addressed. Also passed AWS CI.

ompi/mca/mtl/ofi/mtl_ofi.h Outdated Show resolved Hide resolved
Copy link
Member

@bosilca bosilca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a while to understand the change, but overall I think this is a good and necessary move. Two comments:

  • I would move it up from the PML comm into the OMPI comm, and then it will be automatic for all MTL/PML;
  • I would be interested in seeing some performance at scale (we now have a critical component depending on a software we have no control over).

@rhc54
Copy link
Contributor

rhc54 commented Aug 8, 2024

  • we now have a critical component depending on a software we have no control over

The dependency has been true for a very long time, actually - OMPI doesn't run at all without PMIx and hasn't for years. In particular for this PR, PMIx has been in the path for connect/accept for years, and the collective used here is the same as the collective currently in that path. The only difference being gradually introduced is to replace the multiple calls to PMIx functions combined with a bunch of MPI ops with a single call to the PMIx group op.

So you should eventually see a performance improvement, although it may take another PR step to get it as I withdrew the PR that consolidated to a single op. This just takes an initial step in that direction.

As for control - OMPI continues to have (in some opinions, an undue) significant influence over PMIx. As noted elsewhere, the only real "disconnect" has been the lack of code contribution by OMPI members, which means that things like faster "allgather" in PRRTE don't get implemented - not due to refusal, but just due to lack of contribution.

@hppritcha
Copy link
Member Author

Just for clarification - we only use the PMIx_Group_construct method for communicators being created using MPI_Comm_create_from_group and MPI_Intercomm_create_from_groups. Fortunately for us, the mpi4py testsuite has a number of tests which exercise this functionality.

ompi/mca/pml/ucx/pml_ucx.c Outdated Show resolved Hide resolved
ompi/mca/pml/ucx/pml_ucx.c Outdated Show resolved Hide resolved
ompi/mca/pml/ucx/pml_ucx.c Outdated Show resolved Hide resolved
ompi/mca/pml/ucx/pml_ucx.c Outdated Show resolved Hide resolved
ompi/mca/pml/ucx/pml_ucx.c Outdated Show resolved Hide resolved
ompi/mca/pml/ucx/pml_ucx_request.c Show resolved Hide resolved
ompi/mca/pml/ucx/pml_ucx.h Outdated Show resolved Hide resolved
ompi/mca/pml/ucx/pml_ucx_component.c Show resolved Hide resolved
((((uint64_t) (_tag) ) << (PML_UCX_RANK_BITS + PML_UCX_CONTEXT_BITS)) | \
(((uint64_t)(_comm)->c_my_rank ) << PML_UCX_CONTEXT_BITS) | \
((uint64_t)(_comm)->c_index))
((uint64_t)(_c_index)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we only have 20bits here, shall we check that cid is always lower than 2^20? or somehow we are guaranteed that cid is within limit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

who has only 20 bits?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PML_UCX_CONTEXT_BITS is 20 bits so if _c_index was to ever be bigger, it would spill on the ->c_my_rank field, maybe that's no the case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

guess i'm curious - OMPI's communicator index has always been 32-bits. So have you just been lucky because the index never grew big enough to violate your 20-bit threshold? If so, then it seems you have a design problem as there is no guarantee that the index won't be greater than 20-bits.

Copy link
Contributor

@yosefe yosefe Aug 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes - pml_max_contextid field in UCX module definition should guarantee the communicator index being below 20bit

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but in connect/accept and comm_spawn operations, the context id is being set by PMIx, which provides a 32-bit ID. There are other places in the code that also set the ID, not just in the PML. So how do you handle ID's beyond 20-bit? Or is UCX simply limited to 20-bit ID's and fails if OMPI sets anything above that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually the cid in the dpm.c code is being generated in the call to ompi_comm_nextcid and using the "OLD" algorithm for allocating cids in the ompi_comm_nextcid_nb function since the MPI functions supported in dpm.c for communicator construction always require an input communicator.

ompi/communicator/communicator.h Outdated Show resolved Hide resolved
ompi/communicator/comm_cid.c Outdated Show resolved Hide resolved
ompi/communicator/comm_cid.c Outdated Show resolved Hide resolved
ompi/communicator/comm_cid.c Outdated Show resolved Hide resolved
ompi/communicator/comm_cid.c Outdated Show resolved Hide resolved
ompi/communicator/comm_cid.c Outdated Show resolved Hide resolved
ompi/mca/pml/ucx/pml_ucx.c Outdated Show resolved Hide resolved
@rhc54
Copy link
Contributor

rhc54 commented Aug 13, 2024

I have seen the question of the size of the context ID raised several times, so let me provide a little perspective on it. In PMIx, the context ID for a group is specified to be size_t. Of course, the actual size of the value being returned is set by the RTE since it is solely responsible for assigning that value. In the case of PRRTE, the variable tracking context IDs has a uint32_t definition.

@hppritcha had previously (a long time ago, now) requested that PRRTE start assigning values at UINT32_MAX and count downwards from there. IIRC, this was to avoid conflict with OMPI's internal CID assignment code that started at 0 (for MPI_COMM_WORLD) and counted up.

I don't really understand all the bit manipulation going on here, so I am probably missing something - but it strikes me that if you are limiting the context ID to something less than 32-bits, and using the CID returned by PMIx, then you have a problem.

@hppritcha hppritcha removed the request for review from tvegas1 October 3, 2024 23:11
@hppritcha
Copy link
Member Author

Any objections or suggestions for changes?

ompi/communicator/comm_cid.c Outdated Show resolved Hide resolved
@hppritcha
Copy link
Member Author

@yosefe could you check now?

@hppritcha
Copy link
Member Author

not sure why read the docs failed.

@hppritcha
Copy link
Member Author

@yosefe ping...

@bosilca
Copy link
Member

bosilca commented Oct 8, 2024

I want this to be clearly stated: this functionality fulfills the needs of two newly added MPI group creation (MPI_Comm_create_from_group and MPI_Intercomm_create_from_group) by replacing an existing, AM-based, implementation with a more generic, PMIX-based, one. However, this new approach loses in scalability what it gains in generality, as we are exchanging an RTT to retrieve the comm_id for each peer that communicate together against a full scale allgather done not even in the context of OMPI but in PMIX.

Moreover, no performance results were provided to assess the cost of this change. We all heard the argument that someone in the future might possibly try to optimize this particular operation in PMIX. Irrelevant ! As soon as this becomes the only way to create such communicators, we will be relying on an unoptimized PMIX capability that is far from the level of optimization we would expect from MPI-level collectives.

@rhc54
Copy link
Contributor

rhc54 commented Oct 8, 2024

Errrr...just to be clear, the collective is in PRRTE and has absolutely nothing to do with PMIx. All PMIx does is locally collect the info and pass it to the local PRRTE daemon.

@hppritcha
Copy link
Member Author

@yosefe @janjust could one of you check the changes I did per yosefe request?

@hppritcha hppritcha requested a review from yosefe October 9, 2024 15:13
Copy link
Contributor

@yosefe yosefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good except minor comments

rc = ompi_comm_get_remote_cid(comm, dst, &cid);
if (OPAL_UNLIKELY(OMPI_SUCCESS != rc)) {
return rc;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: space line after

Comment on lines 558 to 566

*remote_cid = comm->c_index;

} else if (0 != comm->c_index_vec[dest]) {

*remote_cid = comm->c_index_vec[dest];

} else {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO these space line are redundant

@hppritcha
Copy link
Member Author

Actually this PR doesn't replace the AM based approach used in OB1. That path is still there. This PR just enables the UCX PML to work with communicators created using MPI_Comm_create_from_group and MPI_Intercomm_create_from_groups. If at some point, someone in the UCX community wants to go to the effort to add an AM path in the UCX PML that could be done.

@hppritcha hppritcha removed the request for review from bosilca October 9, 2024 16:43
@bosilca
Copy link
Member

bosilca commented Oct 9, 2024

Not from OB1 but it removes it from OFI MTL.

@hppritcha
Copy link
Member Author

Okay i'll put the mtl code back in.

Greatly simplify support for MPI_Comm_create_from_group and
MPI_Intercomm_create_from_group by removing the need to support
the 128-bit excid notion.

Instead, make use of a PMIx capability - PMIX_GROUP_LOCAL_CID and the notion of
PMIX_GROUP_INFO. This capability was introduced in Open PMIx 4.1.3.
This capability allows us to piggy-back a local cid selected
for the new communicator on the PMIx_Group_construct operation.
Using this approach, a lot of the complex active message style operations
implemented in the OB1 PML to support excids can be avoided.

Infrastructure for debugging communicator management routines was also
introduced, along with a new MCA parameter -  mpi_comm_verbose.

Related to open-mpi#12566

Signed-off-by: Howard Pritchard <[email protected]>
@hppritcha hppritcha requested review from bosilca and rhc54 and removed request for rhc54 October 18, 2024 19:56
@hppritcha
Copy link
Member Author

ofi mtl code restored.

@hppritcha hppritcha merged commit ef2cb80 into open-mpi:main Oct 23, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants