Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On muller-cpu, with an update to slighshot software, we are seeing a stall/hang in init #6655

Open
ndkeen opened this issue Oct 1, 2024 · 4 comments · May be fixed by #6687
Open

On muller-cpu, with an update to slighshot software, we are seeing a stall/hang in init #6655

ndkeen opened this issue Oct 1, 2024 · 4 comments · May be fixed by #6687
Labels
HOMME pm-cpu Perlmutter at NERSC (CPU-only nodes) PotentialBug

Comments

@ndkeen
Copy link
Contributor

ndkeen commented Oct 1, 2024

With software updates to slingshot, not yet on pm-cpu, but could be soon, I've been testing a variety of things on internal test machine. We have a work-around for now that seems to not show issue.
Just wanted to make issue to record some of the info I have learned.

It happens with E3SM and scream. Does not occur every job, but happens more frequently as MPI ranks increased.
So far, fewest nodes I've seen the stall is 22 and the most I've seen it work is 256 nodes.
I've only experienced it with ne120 cases. None with ne30, but also have not run that may ne30.

When stalled, stack is the following on one rank:

#0  0x000014dbd06f9f22 in MPIR_Wait_impl () from /opt/cray/pe/lib64/libmpi_intel.so.12
#1  0x000014dbd129b926 in MPIC_Wait () from /opt/cray/pe/lib64/libmpi_intel.so.12
#2  0x000014dbd129fe29 in MPIC_Recv () from /opt/cray/pe/lib64/libmpi_intel.so.12
#3  0x000014dbd11cff68 in MPIR_Allreduce_intra_reduce_scatter_allgather () from /opt/cray/pe/lib64/libmpi_intel.so.12
#4  0x000014dbcfb72630 in MPIR_Allreduce_intra_auto () from /opt/cray/pe/lib64/libmpi_intel.so.12
#5  0x000014dbcfb726f1 in MPIR_Allreduce_impl () from /opt/cray/pe/lib64/libmpi_intel.so.12
#6  0x000014dbd141e1a7 in MPIR_CRAY_Allreduce () from /opt/cray/pe/lib64/libmpi_intel.so.12
#7  0x000014dbcfbbe27c in PMPI_Allreduce () from /opt/cray/pe/lib64/libmpi_intel.so.12
#8  0x000014dbd2243856 in pmpi_allreduce__ () from /opt/cray/pe/lib64/libmpifort_intel.so.12
#9  0x000000000160ec9b in dof_mod::setelemoffset (par=<error reading variable: Cannot access memory at address 0x0>, elem=<error reading variable: Location address is not set.>, globaluniquecolsp=0) at /mscratch/sd/n/ndk/repos/ms12-sep20/components/homme/src/share/dof_mod.F90:299
#10 0x00000000016b81ee in prim_driver_base::prim_init1_geometry (elem=<error reading variable: Cannot access memory at address 0x0>, par=<error reading variable: Cannot access memory at address 0x0>, dom_mt=0x0) at /mscratch/sd/n/ndk/repos/ms12-sep20/components/homme/src/share/prim_driver_base.F90:579
#11 0x00000000016b86a3 in prim_driver_base::prim_init1 (elem=<error reading variable: Cannot access memory at address 0x0>, par=<error reading variable: Cannot access memory at address 0x0>, dom_mt=0x0, tl=...) at /mscratch/sd/n/ndk/repos/ms12-sep20/components/homme/src/share/prim_driver_base.F90:109
#12 0x0000000001e2c777 in dyn_comp::dyn_init1 (fh=<error reading variable: Cannot access memory at address 0x0>, nlfilename=..., dyn_in=..., dyn_out=..., .tmp.NLFILENAME.len_V$9dc=64) at /mscratch/sd/n/ndk/repos/ms12-sep20/components/eam/src/dynamics/se/dyn_comp.F90:217
#13 0x00000000015eb3b2 in inital::cam_initial (dyn_in=<error reading variable: Location address is not set.>, dyn_out=<error reading variable: Location address is not set.>, nlfilename=<error reading variable: value requires 118866368 bytes, which is more than max-value-size>, .tmp.NLFILENAME.len_V$589=22934354183216)
    at /mscratch/sd/n/ndk/repos/ms12-sep20/components/eam/src/dynamics/se/inital.F90:39
#14 0x000000000055bd58 in cam_comp::cam_init (cam_out=<error reading variable: Cannot access memory at address 0x0>, cam_in=<error reading variable: Cannot access memory at address 0x0>, mpicom_atm=0, start_ymd=0, start_tod=<error reading variable: Cannot access memory at address 0x40>, 
    ref_ymd=<error reading variable: Cannot access memory at address 0x0>, ref_tod=0, stop_ymd=10106, stop_tod=0, perpetual_run=.FALSE., perpetual_ymd=-999, calendar=..., .tmp.CALENDAR.len_V$3a6e=80) at /mscratch/sd/n/ndk/repos/ms12-sep20/components/eam/src/control/cam_comp.F90:162
#15 0x0000000000552b27 in atm_comp_mct::atm_init_mct (eclock=<error reading variable: Cannot access memory at address 0x0>, cdata_a=<error reading variable: Cannot access memory at address 0x0>, x2a_a=..., a2x_a=..., nlfilename=<error reading variable: value requires 140720308486151 bytes, which is more than max-value-size>, 
    .tmp.NLFILENAME.len_V$5bab=0) at /mscratch/sd/n/ndk/repos/ms12-sep20/components/eam/src/cpl/atm_comp_mct.F90:369
#16 0x0000000000465128 in component_mod::component_init_cc (eclock=<error reading variable: Cannot access memory at address 0x0>, comp=<error reading variable: Location address is not set.>, infodata=..., nlfilename=<error reading variable: value requires 70378676 bytes, which is more than max-value-size>, 
    seq_flds_x2c_fluxes=<error reading variable: value requires 70378676 bytes, which is more than max-value-size>, seq_flds_c2x_fluxes=<error reading variable: value requires 70378676 bytes, which is more than max-value-size>, .tmp.NLFILENAME.len_V$70c7=6, .tmp.SEQ_FLDS_X2C_FLUXES.len_V$70ca=0, .tmp.SEQ_FLDS_C2X_FLUXES.len_V$70cd=0)
    at /mscratch/sd/n/ndk/repos/ms12-sep20/driver-mct/main/component_mod.F90:257
#17 0x00000000004517f6 in cime_comp_mod::cime_init () at /mscratch/sd/n/ndk/repos/ms12-sep20/driver-mct/main/cime_comp_mod.F90:1488
#18 0x0000000000461f82 in cime_driver () at /mscratch/sd/n/ndk/repos/ms12-sep20/driver-mct/main/cime_driver.F90:122

NERSC staff also looking at issue has said they believe it's here:

     call MPI_Allreduce(gOffset,numElem2P,nelem,MPIinteger_t,MPI_SUM,par%comm,ierr)
which is in:

  subroutine SetElemOffset(par,elem,GlobalUniqueColsP)
File reference is components/homme/src/share/dof_mod.F90 line 299.

Is it possible integer sum there actually is larger than integer size?

@ndkeen ndkeen added PotentialBug pm-cpu Perlmutter at NERSC (CPU-only nodes) HOMME labels Oct 1, 2024
@ndkeen
Copy link
Contributor Author

ndkeen commented Oct 2, 2024

I have found that simply adding a barrier before the above mpi_allreduce also seems to resolve the issue (ie can use default settings and not see stall/hang).

     call MPI_Barrier(par%comm, ierr)                                                                                                           
     call MPI_Allreduce(gOffset,numElem2P,nelem,MPIinteger_t,MPI_SUM,par%comm,ierr) 

While I think the algorithm here could use some help, I don't see anything that is obviously a problem (requiring a barrier).
I've also been adding debug checks to test at scale.

I've seen about 12 cases at 256 nodes are OK with this "fix" and almost every case at that node count hangs (or stalls) without it.

@ndkeen ndkeen changed the title On pm-cpu, with an upcoming update to slighshot, we are seeing a stall/hang in init On muller-cpu, with an update to slighshot software, we are seeing a stall/hang in init Oct 2, 2024
@mt5555
Copy link
Contributor

mt5555 commented Oct 2, 2024

this looks like some kind of system issue. Since the allreduce is itself blocking, and it seems the barrier would just change the timing of when mpi tasks enter the allreduce?

One thing a little atypical about this allreduce is that it is over the global element array ( array of size 120x120x6 for ne120). so it's much larger than all the allreduces done during the timestepping.

@ndkeen
Copy link
Contributor Author

ndkeen commented Oct 2, 2024

So far I'm agreeing with you Mark -- don't see how a barrier before a syncing allreduce would prevent a hang. NERSC is collecting more debug data and reporting.

Note the nelem in the allreduce here for ne120 is 86400.

@ndkeen
Copy link
Contributor Author

ndkeen commented Oct 15, 2024

We eventually discovered that this was caused by something new with these updates to slingshot. But it only happens with the default

FI_MR_CACHE_MONITOR=userfaultfd

HPE is suggesting we instead use

FI_MR_CACHE_MONITOR=kdreg2

I have not found any hangs or other issues with using kdreg2. It might be about 1% slower with 256-node CPU job (which uses 128 MPI's per node). HPE explains that kdreg2 is the future, so might be default one day.

I will create a PR to make this change.
It might be that we only run into hangs with default settings for higher-node count cases.

@ndkeen ndkeen linked a pull request Oct 16, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
HOMME pm-cpu Perlmutter at NERSC (CPU-only nodes) PotentialBug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants