Hang for test using nvidia compiler only for certain smaller MPI counts `ERS.hcru_hcru.IELM.pm-cpu_nvidia.elm-multi_inst` #6521

ndkeen · 2024-07-22T17:21:21Z

This looks like a new test -- it is failing on pm-cpu with nvidia compiler. Based on the dates of log files, it looks like
the test is hanging.

Note the current MPI count used by default for this test is 192, which is 1.5 nodes on pm-cpu

ndkeen · 2024-09-01T15:57:03Z

This test has still be failing/hanging. Adding a little more details, these tests seem OK, so it looks like it's combination of the multi_inst modifier with newer nvidia compilers.

These tests pass with nvidia 23.9 as well as 24.5

SMS.hcru_hcru.IELM.pm-cpu_nvidia
SMS_D.hcru_hcru.IELM.pm-cpu_nvidia
ERS.hcru_hcru.IELM.pm-cpu_nvidia

And these tests seem to have the same fail/hang issue:

SMS_D.hcru_hcru.IELM.pm-cpu_nvidia.elm-multi_inst
ERS_D.hcru_hcru.IELM.pm-cpu_nvidia.elm-multi_inst

Where flow might be during hang:

#0  cxip_ep_ctrl_progress_locked (ep_obj=0x11bfff40) at prov/cxi/src/cxip_ctrl.c:373
#1  0x000014d4b2e591dd in cxip_ep_progress (fid=<optimized out>) at prov/cxi/src/cxip_ep.c:186
#2  0x000014d4b2e5e969 in cxip_util_cq_progress (util_cq=0x11bef9d0) at prov/cxi/src/cxip_cq.c:112
#3  0x000014d4b2e3a301 in ofi_cq_readfrom (cq_fid=0x11bef9d0, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:232
#4  0x000014d4b75491b4 in MPIR_Waitall_impl () from /opt/cray/pe/lib64/libmpi_nvidia.so.12
#5  0x000014d4b7595025 in MPIR_Waitall () from /opt/cray/pe/lib64/libmpi_nvidia.so.12
#6  0x000014d4b7595821 in PMPI_Waitall () from /opt/cray/pe/lib64/libmpi_nvidia.so.12
#7  0x000014d4b9030ebe in pmpi_waitall__ () from /opt/cray/pe/lib64/libmpifort_nvidia.so.12
#8  0x00000000026b0766 in m_rearranger::rearrange_ ()
    at /global/cfs/cdirs/e3sm/ndk/repos/pr/ndk_mf_pm-cpu-update-compiler-versions/externals/mct/mct/m_Rearranger.F90:1194
#9  0x000000000090a555 in seq_map_mod::seq_map_map (mapper=..., av_s=..., av_d=..., fldlist=..., 
    norm=<error reading variable: Cannot access memory at address 0x0>, avwts_s=<error reading variable: Location address is not set.>, 
    avwtsfld_s=..., string=..., msgtag=1014, omit_nonlinear=<error reading variable: Cannot access memory at address 0x0>)
    at /global/cfs/cdirs/e3sm/ndk/repos/pr/ndk_mf_pm-cpu-update-compiler-versions/driver-mct/main/seq_map_mod.F90:345
#10 0x0000000000759f9d in component_mod::component_exch (comp=..., flow=..., infodata=..., infodata_string=..., mpicom_barrier=-1006632930, 
    run_barriers=.FALSE., timer_barrier=..., timer_comp_exch=..., timer_map_exch=..., timer_infodata_exch=...)
    at /global/cfs/cdirs/e3sm/ndk/repos/pr/ndk_mf_pm-cpu-update-compiler-versions/driver-mct/main/component_mod.F90:908
#11 0x0000000000742302 in cime_comp_mod::cime_run_lnd_recv_post ()
    at /global/cfs/cdirs/e3sm/ndk/repos/pr/ndk_mf_pm-cpu-update-compiler-versions/driver-mct/main/cime_comp_mod.F90:4301
#12 0x0000000000733818 in cime_comp_mod::cime_run ()
    at /global/cfs/cdirs/e3sm/ndk/repos/pr/ndk_mf_pm-cpu-update-compiler-versions/driver-mct/main/cime_comp_mod.F90:3043
#13 0x000000000074fedc in cime_driver ()
    at /global/cfs/cdirs/e3sm/ndk/repos/pr/ndk_mf_pm-cpu-update-compiler-versions/driver-mct/main/cime_driver.F90:153

ndkeen · 2024-09-01T18:08:16Z

OK, there might be an issue with how it's launching more tasks/jobs as if I force the test to land on one node only, it passes. That is: ERS_P128x1.hcru_hcru.IELM.pm-cpu_nvidia.elm-multi_inst

ndkeen · 2024-09-03T18:32:43Z

I created PR #6581 to use 3 full nodes (192 MPI's) instead of the current odd value of 192 MPI's. There must have been a reason why I used 192 here -- and indeed search remind me of
#6486

I want to keep this issue open as it's still odd that certain MPI counts cause a hang, while others don't.
Seemingly, only for nvidia compiler.

…next (PR #6581) Currently, the tests for this resolution use 192 MPI's on pm-cpu which is an odd value (1.5 nodes). Here it's being changed to use -3 (or 384 MPI's). Example of test that would use this layout: SMS.hcru_hcru.IELM This change is effective work-around (but not fix) for #6521 with #6486 in mind as noted below. [bfb]

ndkeen · 2024-09-04T17:07:45Z

Merged #6581 so we should not see the issue on cdash.

ndkeen · 2024-10-17T19:55:57Z

It may be that #6687 will also address this issue.

Using kdreg2, I was able to run the following:

ERS_D_P128.hcru_hcru.IELM.muller-cpu_nvidia.elm-multi_inst
ERS_D_P192.hcru_hcru.IELM.muller-cpu_nvidia.elm-multi_inst
ERS_D_P2048.hcru_hcru.IELM.muller-cpu_nvidia.elm-multi_inst
ERS_D_P256.hcru_hcru.IELM.muller-cpu_nvidia.elm-multi_inst
ERS_P128.hcru_hcru.IELM.muller-cpu_nvidia.elm-multi_inst
ERS_P192.hcru_hcru.IELM.muller-cpu_nvidia.elm-multi_inst
ERS_P2048.hcru_hcru.IELM.muller-cpu_nvidia.elm-multi_inst
ERS_P256.hcru_hcru.IELM.muller-cpu_nvidia.elm-multi_inst
ERS_P512.hcru_hcru.IELM.muller-cpu_nvidia.elm-multi_inst
ERS_P64.hcru_hcru.IELM.muller-cpu_nvidia.elm-multi_inst

ndkeen added pm-cpu Perlmutter at NERSC (CPU-only nodes) nvidia compiler nvidia compiler (formerly PGI) labels Jul 22, 2024

ndkeen mentioned this issue Sep 3, 2024

For pm-cpu, increase default pelayout to 3 nodes for tests using l%360x720cru #6581

Merged

ndkeen changed the title ~~Hang with a new test using nvidia compiler ERS.hcru_hcru.IELM.pm-cpu_nvidia.elm-multi_inst~~ Hang for test using nvidia compiler only for certain smaller MPI counts ERS.hcru_hcru.IELM.pm-cpu_nvidia.elm-multi_inst Sep 7, 2024

ndkeen linked a pull request Oct 17, 2024 that will close this issue

Use FI_MR_CACHE_MONITOR=kdreg2 for all nersc machines #6687

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hang for test using nvidia compiler only for certain smaller MPI counts `ERS.hcru_hcru.IELM.pm-cpu_nvidia.elm-multi_inst` #6521

Hang for test using nvidia compiler only for certain smaller MPI counts `ERS.hcru_hcru.IELM.pm-cpu_nvidia.elm-multi_inst` #6521

ndkeen commented Jul 22, 2024 •

edited

Loading

ndkeen commented Sep 1, 2024 •

edited

Loading

ndkeen commented Sep 1, 2024

ndkeen commented Sep 3, 2024 •

edited

Loading

ndkeen commented Sep 4, 2024

ndkeen commented Oct 17, 2024

Hang for test using nvidia compiler only for certain smaller MPI counts ERS.hcru_hcru.IELM.pm-cpu_nvidia.elm-multi_inst #6521

Hang for test using nvidia compiler only for certain smaller MPI counts ERS.hcru_hcru.IELM.pm-cpu_nvidia.elm-multi_inst #6521

Comments

ndkeen commented Jul 22, 2024 • edited Loading

ndkeen commented Sep 1, 2024 • edited Loading

ndkeen commented Sep 1, 2024

ndkeen commented Sep 3, 2024 • edited Loading

ndkeen commented Sep 4, 2024

ndkeen commented Oct 17, 2024

Hang for test using nvidia compiler only for certain smaller MPI counts `ERS.hcru_hcru.IELM.pm-cpu_nvidia.elm-multi_inst` #6521

Hang for test using nvidia compiler only for certain smaller MPI counts `ERS.hcru_hcru.IELM.pm-cpu_nvidia.elm-multi_inst` #6521

ndkeen commented Jul 22, 2024 •

edited

Loading

ndkeen commented Sep 1, 2024 •

edited

Loading

ndkeen commented Sep 3, 2024 •

edited

Loading