-
Notifications
You must be signed in to change notification settings - Fork 359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
On muller-cpu, with an update to slighshot software, we are seeing a stall/hang in init #6655
Comments
I have found that simply adding a barrier before the above mpi_allreduce also seems to resolve the issue (ie can use default settings and not see stall/hang).
While I think the algorithm here could use some help, I don't see anything that is obviously a problem (requiring a barrier). I've seen about 12 cases at 256 nodes are OK with this "fix" and almost every case at that node count hangs (or stalls) without it. |
this looks like some kind of system issue. Since the allreduce is itself blocking, and it seems the barrier would just change the timing of when mpi tasks enter the allreduce? One thing a little atypical about this allreduce is that it is over the global element array ( array of size 120x120x6 for ne120). so it's much larger than all the allreduces done during the timestepping. |
So far I'm agreeing with you Mark -- don't see how a barrier before a syncing allreduce would prevent a hang. NERSC is collecting more debug data and reporting. Note the nelem in the allreduce here for ne120 is 86400. |
We eventually discovered that this was caused by something new with these updates to slingshot. But it only happens with the default
HPE is suggesting we instead use
I have not found any hangs or other issues with using I will create a PR to make this change. |
With software updates to slingshot, not yet on pm-cpu, but could be soon, I've been testing a variety of things on internal test machine. We have a work-around for now that seems to not show issue.
Just wanted to make issue to record some of the info I have learned.
It happens with E3SM and scream. Does not occur every job, but happens more frequently as MPI ranks increased.
So far, fewest nodes I've seen the stall is 22 and the most I've seen it work is 256 nodes.
I've only experienced it with ne120 cases. None with ne30, but also have not run that may ne30.
When stalled, stack is the following on one rank:
NERSC staff also looking at issue has said they believe it's here:
Is it possible integer sum there actually is larger than integer size?
The text was updated successfully, but these errors were encountered: