You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been trying to update several of the module versions on pm-cpu to "ideal" versions. Most test (with intel,gnu,nvidia) seem OK, but a few are still problematic and I wanted to save some notes here. For GNU, wanting to update to 12.3 and NERSC calls the module gcc-native/12.3. Also updating these at same time (some are required as a package update) -- currently trying those in "ideal" below:
Running e3sm_integration, the only tests with issues are only DEBUG built tests and unfortunately they hang during init. I managed to find two files that I can alter compiler flags and get a FPE instead, considered an improvement, but may or may not be the same issue causing the hang. For the following 2 files, if I add-O (ie, disable -O0)
561: Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
561:
561: Backtrace for this error:
561: #0 0x14c47a423372 in ???
561: #1 0x14c47a422505 in ???
561: #2 0x14c479853dbf in ???
561: #3 0x1a35e91 in __edge_mod_base_MOD_edgevunpack_nlyr
561: at /global/cfs/cdirs/e3sm/ndk/repos/pr/ndk_mf_pm-cpu-update-compiler-versions/components/homme/src/share/edge_mod_base.F90:903
561: #4 0x26be308 in __inidat_MOD_read_inidat
561: at /global/cfs/cdirs/e3sm/ndk/repos/pr/ndk_mf_pm-cpu-update-compiler-versions/components/eam/src/dynamics/se/inidat.F90:643
561: #5 0x1cd9e85 in __startup_initialconds_MOD_initial_conds
561: at /global/cfs/cdirs/e3sm/ndk/repos/pr/ndk_mf_pm-cpu-update-compiler-versions/components/eam/src/control/startup_initialconds.F90:18
561: #6 0x19dbe8e in __inital_MOD_cam_initial
561: at /global/cfs/cdirs/e3sm/ndk/repos/pr/ndk_mf_pm-cpu-update-compiler-versions/components/eam/src/dynamics/se/inital.F90:67
561: #7 0x66335a in __cam_comp_MOD_cam_init
561: at /global/cfs/cdirs/e3sm/ndk/repos/pr/ndk_mf_pm-cpu-update-compiler-versions/components/eam/src/control/cam_comp.F90:162
561: #8 0x651769 in __atm_comp_mct_MOD_atm_init_mct
561: at /global/cfs/cdirs/e3sm/ndk/repos/pr/ndk_mf_pm-cpu-update-compiler-versions/components/eam/src/cpl/atm_comp_mct.F90:371
561: #9 0x49d9bc in __component_mod_MOD_component_init_cc
561: at /global/cfs/cdirs/e3sm/ndk/repos/pr/ndk_mf_pm-cpu-update-compiler-versions/driver-mct/main/component_mod.F90:248
561: #10 0x484b80 in __cime_comp_mod_MOD_cime_init
561: at /global/cfs/cdirs/e3sm/ndk/repos/pr/ndk_mf_pm-cpu-update-compiler-versions/driver-mct/main/cime_comp_mod.F90:1488
561: #11 0x4964d5 in cime_driver
561: at /global/cfs/cdirs/e3sm/ndk/repos/pr/ndk_mf_pm-cpu-update-compiler-versions/driver-mct/main/cime_driver.F90:122
561: #12 0x496611 in main
561: at /global/cfs/cdirs/e3sm/ndk/repos/pr/ndk_mf_pm-cpu-update-compiler-versions/driver-mct/main/cime_driver.F90:23
I can also login to compute node (on similar machine muller-cpu) during a hang and view where one process is sitting:
#0 cxi_eq_peek_event (eq=0x22e12dc8) at /usr/include/cxi_prov_hw.h:1531
#1 cxip_ep_ctrl_eq_progress (ep_obj=0x22e25790, ctrl_evtq=0x22e12dc8, tx_evtq=true, ep_obj_locked=true) at prov/cxi/src/cxip_ctrl.c:318
#2 0x00001503828591dd in cxip_ep_progress (fid=<optimized out>) at prov/cxi/src/cxip_ep.c:186
#3 0x000015038285e969 in cxip_util_cq_progress (util_cq=0x22e15220) at prov/cxi/src/cxip_cq.c:112
#4 0x000015038283a301 in ofi_cq_readfrom (cq_fid=0x22e15220, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:232
#5 0x00001503860fa0f2 in MPIR_Wait_impl () from /opt/cray/pe/lib64/libmpi_intel.so.12
#6 0x0000150386c9b926 in MPIC_Wait () from /opt/cray/pe/lib64/libmpi_intel.so.12
#7 0x0000150386ca7685 in MPIC_Sendrecv () from /opt/cray/pe/lib64/libmpi_intel.so.12
#8 0x0000150386bd232d in MPIR_Alltoall_intra_brucks () from /opt/cray/pe/lib64/libmpi_intel.so.12
#9 0x00001503855bee8a in MPIR_Alltoall_intra_auto.part.0 () from /opt/cray/pe/lib64/libmpi_intel.so.12
#10 0x00001503855bf05c in MPIR_Alltoall_impl () from /opt/cray/pe/lib64/libmpi_intel.so.12
#11 0x00001503855bf83f in PMPI_Alltoall () from /opt/cray/pe/lib64/libmpi_intel.so.12
#12 0x0000150387c4364e in pmpi_alltoall__ () from /opt/cray/pe/lib64/libmpifort_intel.so.12
#13 0x0000000000bcad8f in mpialltoallint (sendbuf=..., sendcnt=1, recvbuf=..., recvcnt=1, comm=-1006632954) at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/control/wrap_mpi.F90:1143
#14 0x0000000002b93c02 in phys_grid::transpose_block_to_chunk (record_size=88, block_buffer=<error reading variable: value requires 2509056 bytes, which is more than max-value-size>, chunk_buffer=<error reading variable: value requires 2452032 bytes, which is more than max-value-size>,
window=<error reading variable: Cannot access memory at address 0x0>) at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/physics/cam/phys_grid.F90:4137
#15 0x0000000005304965 in dp_coupling::d_p_coupling (phys_state=..., phys_tend=..., pbuf2d=0x26500aa0, dyn_out=...) at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/dynamics/se/dp_coupling.F90:242
#16 0x0000000003719020 in stepon::stepon_run1 (dtime_out=1800, phys_state=..., phys_tend=..., pbuf2d=0x26500aa0, dyn_in=..., dyn_out=...) at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/dynamics/se/stepon.F90:244
#17 0x0000000000948d7c in cam_comp::cam_run1 (cam_in=..., cam_out=...) at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/control/cam_comp.F90:251
#18 0x0000000000905530 in atm_comp_mct::atm_init_mct (eclock=..., cdata_a=..., x2a_a=..., a2x_a=..., nlfilename=..., .tmp.NLFILENAME.len_V$5bab=6) at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/cpl/atm_comp_mct.F90:499
#19 0x00000000004a7045 in component_mod::component_init_cc (eclock=..., comp=..., infodata=..., nlfilename=..., seq_flds_x2c_fluxes=..., seq_flds_c2x_fluxes=..., .tmp.NLFILENAME.len_V$7206=6, .tmp.SEQ_FLDS_X2C_FLUXES.len_V$7209=4096, .tmp.SEQ_FLDS_C2X_FLUXES.len_V$720c=4096)
at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/driver-mct/main/component_mod.F90:257
#20 0x000000000045d9d6 in cime_comp_mod::cime_init () at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/driver-mct/main/cime_comp_mod.F90:2370
#21 0x000000000049dfc2 in cime_driver () at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/driver-mct/main/cime_driver.F90:122
where it looks like at #14 it is error before MPI stack?
Noting some tests that complete:
(all other tests in e3sm_integration)
SMS_D.ne30_oECv3_gis.IGELM_MLI.pm-cpu_gnu.elm-extrasnowlayers
SMS_D.ne30pg2_r05_IcoswISC30E3r5.GPMPAS-JRA.pm-cpu_gnu.mosart-rof_ocn_2way
SMS_D_Ln3.ne4pg2_ne4pg2.FAQP.pm-cpu_gnu
SMS_D.ne30pg2_ne30pg2.IELMTEST.pm-cpu_gnu
The text was updated successfully, but these errors were encountered:
It looks like when I simply try machine defaults, the tests that were failing are OK. Which is easy fix, but that means to upgrade other compilers (such as intel), we will need to have different versions of things like mpich depending on the compiler. Which is maybe fine -- will make branch/PR.
Another issue is that I already tested and merged PR to scream repo that updates pm-gpu with the "ideal" version of GNU compiler. Nothing stopping us from having two different versions on the two different compute clusters, but of course, best to have them the same.
Well whadyaknow... coming back to this, I verified still see same hang with newer GCC with at least two of the tests above, and then trying again with kdreg2 (#6687), at least one case does not hang.
So using kdreg2 may help here as well. Will do more testing and can hopefully update GCC compiler.
I've been trying to update several of the module versions on pm-cpu to "ideal" versions. Most test (with intel,gnu,nvidia) seem OK, but a few are still problematic and I wanted to save some notes here. For GNU, wanting to update to 12.3 and NERSC calls the module
gcc-native/12.3
. Also updating these at same time (some are required as a package update) -- currently trying those in "ideal" below:Running e3sm_integration, the only tests with issues are only DEBUG built tests and unfortunately they hang during init. I managed to find two files that I can alter compiler flags and get a FPE instead, considered an improvement, but may or may not be the same issue causing the hang. For the following 2 files, if I add
-O
(ie, disable-O0
)I can get the following error with several tests:
So far, here are the tests showing the error:
I can also login to compute node (on similar machine muller-cpu) during a hang and view where one process is sitting:
where it looks like at
#14
it is error before MPI stack?Noting some tests that complete:
The text was updated successfully, but these errors were encountered: