Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3rd-party: bump prrte to latest v3.0 branch #12655

Merged
merged 1 commit into from
Jul 8, 2024

Conversation

wenduwan
Copy link
Contributor

@wenduwan wenduwan commented Jul 3, 2024

bot:notacherrypick

bot:notacherrypick

Signed-off-by: Wenduo Wang <[email protected]>
@github-actions github-actions bot added this to the v5.0.4 milestone Jul 3, 2024
@wenduwan wenduwan added the mpi4py-all Run the optional mpi4py CI tests label Jul 3, 2024
@rhc54
Copy link
Contributor

rhc54 commented Jul 3, 2024

Hooray! Please let me know how the AWS CI does. I'd like to ensure we get PRRTE updated for your release, if at all possible.

@wenduwan
Copy link
Contributor Author

wenduwan commented Jul 4, 2024

AWS CI passed.

I'm running more tests manually to cover more platforms.

@wenduwan
Copy link
Contributor Author

wenduwan commented Jul 5, 2024

This PR fails on AMD with hwloc 1.11.

[ip-172-31-44-98.us-west-2.compute.internal:14273] [[10462,1],79] selected pml ob1, but peer [[10462,1],0] on ip-172-31-44-98 selected pml cm

[ip-172-31-33-160.us-west-2.compute.internal:134572] [[10462,1],350] selected pml ob1, but peer [[10462,1],0] on ip-172-31-44-98 selected pml cm

--------------------------------------------------------------------------

MPI_INIT has failed because at least one MPI process is unreachable

from another.  This *usually* means that an underlying communication

plugin -- such as a BTL or an MTL -- has either not loaded or not

allowed itself to be used.  Your MPI job will now abort.


You may wish to try to narrow down the problem;


 * Check the output of ompi_info to see which BTL/MTL plugins are

   available.

 * Run your application with MPI_THREAD_SINGLE.

 * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,

   if using MTL-based communications) to see exactly which

   communication plugins were considered and/or discarded.

@rhc54
Copy link
Contributor

rhc54 commented Jul 5, 2024

Nothing to do with PMIx or PRRTE or HWLOC - this is a typical issue in the MPI layer. The procs on the two nodes are making a different PML selection.

@wenduwan
Copy link
Contributor Author

wenduwan commented Jul 8, 2024

The issues is due to EFA resource limit. It's not related to Open MPI.

@wenduwan wenduwan requested a review from hppritcha July 8, 2024 17:56
@wenduwan
Copy link
Contributor Author

wenduwan commented Jul 8, 2024

@hpcraink Tommy is on vacation so could you please do a review?

@hppritcha
Copy link
Member

@wenduwan this solves the old hwloc + ARM issue correct?

@wenduwan
Copy link
Contributor Author

wenduwan commented Jul 8, 2024

@hppritcha Yes I verified on Intel, AMD and ARM with both new and old hwloc. So far no issues.

@wenduwan wenduwan merged commit 580c452 into open-mpi:v5.0.x Jul 8, 2024
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
mpi4py-all Run the optional mpi4py CI tests Target: v5.0.x
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants