On muller-cpu, with an update to slighshot software, we are seeing a stall/hang in init #6655

ndkeen · 2024-10-01T19:43:34Z

With software updates to slingshot, not yet on pm-cpu, but could be soon, I've been testing a variety of things on internal test machine. We have a work-around for now that seems to not show issue.
Just wanted to make issue to record some of the info I have learned.

It happens with E3SM and scream. Does not occur every job, but happens more frequently as MPI ranks increased.
So far, fewest nodes I've seen the stall is 22 and the most I've seen it work is 256 nodes.
I've only experienced it with ne120 cases. None with ne30, but also have not run that may ne30.

When stalled, stack is the following on one rank:

#0  0x000014dbd06f9f22 in MPIR_Wait_impl () from /opt/cray/pe/lib64/libmpi_intel.so.12
#1  0x000014dbd129b926 in MPIC_Wait () from /opt/cray/pe/lib64/libmpi_intel.so.12
#2  0x000014dbd129fe29 in MPIC_Recv () from /opt/cray/pe/lib64/libmpi_intel.so.12
#3  0x000014dbd11cff68 in MPIR_Allreduce_intra_reduce_scatter_allgather () from /opt/cray/pe/lib64/libmpi_intel.so.12
#4  0x000014dbcfb72630 in MPIR_Allreduce_intra_auto () from /opt/cray/pe/lib64/libmpi_intel.so.12
#5  0x000014dbcfb726f1 in MPIR_Allreduce_impl () from /opt/cray/pe/lib64/libmpi_intel.so.12
#6  0x000014dbd141e1a7 in MPIR_CRAY_Allreduce () from /opt/cray/pe/lib64/libmpi_intel.so.12
#7  0x000014dbcfbbe27c in PMPI_Allreduce () from /opt/cray/pe/lib64/libmpi_intel.so.12
#8  0x000014dbd2243856 in pmpi_allreduce__ () from /opt/cray/pe/lib64/libmpifort_intel.so.12
#9  0x000000000160ec9b in dof_mod::setelemoffset (par=<error reading variable: Cannot access memory at address 0x0>, elem=<error reading variable: Location address is not set.>, globaluniquecolsp=0) at /mscratch/sd/n/ndk/repos/ms12-sep20/components/homme/src/share/dof_mod.F90:299
#10 0x00000000016b81ee in prim_driver_base::prim_init1_geometry (elem=<error reading variable: Cannot access memory at address 0x0>, par=<error reading variable: Cannot access memory at address 0x0>, dom_mt=0x0) at /mscratch/sd/n/ndk/repos/ms12-sep20/components/homme/src/share/prim_driver_base.F90:579
#11 0x00000000016b86a3 in prim_driver_base::prim_init1 (elem=<error reading variable: Cannot access memory at address 0x0>, par=<error reading variable: Cannot access memory at address 0x0>, dom_mt=0x0, tl=...) at /mscratch/sd/n/ndk/repos/ms12-sep20/components/homme/src/share/prim_driver_base.F90:109
#12 0x0000000001e2c777 in dyn_comp::dyn_init1 (fh=<error reading variable: Cannot access memory at address 0x0>, nlfilename=..., dyn_in=..., dyn_out=..., .tmp.NLFILENAME.len_V$9dc=64) at /mscratch/sd/n/ndk/repos/ms12-sep20/components/eam/src/dynamics/se/dyn_comp.F90:217
#13 0x00000000015eb3b2 in inital::cam_initial (dyn_in=<error reading variable: Location address is not set.>, dyn_out=<error reading variable: Location address is not set.>, nlfilename=<error reading variable: value requires 118866368 bytes, which is more than max-value-size>, .tmp.NLFILENAME.len_V$589=22934354183216)
    at /mscratch/sd/n/ndk/repos/ms12-sep20/components/eam/src/dynamics/se/inital.F90:39
#14 0x000000000055bd58 in cam_comp::cam_init (cam_out=<error reading variable: Cannot access memory at address 0x0>, cam_in=<error reading variable: Cannot access memory at address 0x0>, mpicom_atm=0, start_ymd=0, start_tod=<error reading variable: Cannot access memory at address 0x40>, 
    ref_ymd=<error reading variable: Cannot access memory at address 0x0>, ref_tod=0, stop_ymd=10106, stop_tod=0, perpetual_run=.FALSE., perpetual_ymd=-999, calendar=..., .tmp.CALENDAR.len_V$3a6e=80) at /mscratch/sd/n/ndk/repos/ms12-sep20/components/eam/src/control/cam_comp.F90:162
#15 0x0000000000552b27 in atm_comp_mct::atm_init_mct (eclock=<error reading variable: Cannot access memory at address 0x0>, cdata_a=<error reading variable: Cannot access memory at address 0x0>, x2a_a=..., a2x_a=..., nlfilename=<error reading variable: value requires 140720308486151 bytes, which is more than max-value-size>, 
    .tmp.NLFILENAME.len_V$5bab=0) at /mscratch/sd/n/ndk/repos/ms12-sep20/components/eam/src/cpl/atm_comp_mct.F90:369
#16 0x0000000000465128 in component_mod::component_init_cc (eclock=<error reading variable: Cannot access memory at address 0x0>, comp=<error reading variable: Location address is not set.>, infodata=..., nlfilename=<error reading variable: value requires 70378676 bytes, which is more than max-value-size>, 
    seq_flds_x2c_fluxes=<error reading variable: value requires 70378676 bytes, which is more than max-value-size>, seq_flds_c2x_fluxes=<error reading variable: value requires 70378676 bytes, which is more than max-value-size>, .tmp.NLFILENAME.len_V$70c7=6, .tmp.SEQ_FLDS_X2C_FLUXES.len_V$70ca=0, .tmp.SEQ_FLDS_C2X_FLUXES.len_V$70cd=0)
    at /mscratch/sd/n/ndk/repos/ms12-sep20/driver-mct/main/component_mod.F90:257
#17 0x00000000004517f6 in cime_comp_mod::cime_init () at /mscratch/sd/n/ndk/repos/ms12-sep20/driver-mct/main/cime_comp_mod.F90:1488
#18 0x0000000000461f82 in cime_driver () at /mscratch/sd/n/ndk/repos/ms12-sep20/driver-mct/main/cime_driver.F90:122

NERSC staff also looking at issue has said they believe it's here:

     call MPI_Allreduce(gOffset,numElem2P,nelem,MPIinteger_t,MPI_SUM,par%comm,ierr)
which is in:

  subroutine SetElemOffset(par,elem,GlobalUniqueColsP)
File reference is components/homme/src/share/dof_mod.F90 line 299.

Is it possible integer sum there actually is larger than integer size?

The text was updated successfully, but these errors were encountered:

ndkeen · 2024-10-02T05:52:57Z

I have found that simply adding a barrier before the above mpi_allreduce also seems to resolve the issue (ie can use default settings and not see stall/hang).

     call MPI_Barrier(par%comm, ierr)                                                                                                           
     call MPI_Allreduce(gOffset,numElem2P,nelem,MPIinteger_t,MPI_SUM,par%comm,ierr)

While I think the algorithm here could use some help, I don't see anything that is obviously a problem (requiring a barrier).
I've also been adding debug checks to test at scale.

I've seen about 12 cases at 256 nodes are OK with this "fix" and almost every case at that node count hangs (or stalls) without it.

mt5555 · 2024-10-02T17:40:35Z

this looks like some kind of system issue. Since the allreduce is itself blocking, and it seems the barrier would just change the timing of when mpi tasks enter the allreduce?

One thing a little atypical about this allreduce is that it is over the global element array ( array of size 120x120x6 for ne120). so it's much larger than all the allreduces done during the timestepping.

ndkeen · 2024-10-02T17:54:18Z

So far I'm agreeing with you Mark -- don't see how a barrier before a syncing allreduce would prevent a hang. NERSC is collecting more debug data and reporting.

Note the nelem in the allreduce here for ne120 is 86400.

ndkeen · 2024-10-15T17:17:36Z

We eventually discovered that this was caused by something new with these updates to slingshot. But it only happens with the default

FI_MR_CACHE_MONITOR=userfaultfd

HPE is suggesting we instead use

FI_MR_CACHE_MONITOR=kdreg2

I have not found any hangs or other issues with using kdreg2. It might be about 1% slower with 256-node CPU job (which uses 128 MPI's per node). HPE explains that kdreg2 is the future, so might be default one day.

I will create a PR to make this change.
It might be that we only run into hangs with default settings for higher-node count cases.

ndkeen added PotentialBug pm-cpu Perlmutter at NERSC (CPU-only nodes) HOMME labels Oct 1, 2024

ndkeen changed the title ~~On pm-cpu, with an upcoming update to slighshot, we are seeing a stall/hang in init~~ On muller-cpu, with an update to slighshot software, we are seeing a stall/hang in init Oct 2, 2024

ndkeen mentioned this issue Oct 2, 2024

Intermittent runtime error in init: surfrd_veg_all ERROR: sum of wt_cft not 1.0 on pm-cpu. Solved: at least 2 nodes are suspect #6469

Open

ndkeen linked a pull request Oct 16, 2024 that will close this issue

Use FI_MR_CACHE_MONITOR=kdreg2 for all nersc machines #6687

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On muller-cpu, with an update to slighshot software, we are seeing a stall/hang in init #6655

On muller-cpu, with an update to slighshot software, we are seeing a stall/hang in init #6655

ndkeen commented Oct 1, 2024 •

edited

Loading

ndkeen commented Oct 2, 2024 •

edited

Loading

mt5555 commented Oct 2, 2024 •

edited

Loading

ndkeen commented Oct 2, 2024

ndkeen commented Oct 15, 2024

On muller-cpu, with an update to slighshot software, we are seeing a stall/hang in init #6655

On muller-cpu, with an update to slighshot software, we are seeing a stall/hang in init #6655

Comments

ndkeen commented Oct 1, 2024 • edited Loading

ndkeen commented Oct 2, 2024 • edited Loading

mt5555 commented Oct 2, 2024 • edited Loading

ndkeen commented Oct 2, 2024

ndkeen commented Oct 15, 2024

ndkeen commented Oct 1, 2024 •

edited

Loading

ndkeen commented Oct 2, 2024 •

edited

Loading

mt5555 commented Oct 2, 2024 •

edited

Loading