Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coupled weather model forecasts fail after large # of file writes when CICE is compiled using PIO #94

Open
LarissaReames-NOAA opened this issue Oct 8, 2024 · 2 comments

Comments

@LarissaReames-NOAA
Copy link

Description

Using CICE in a S2S configuration in ufs-weather-model causes failures after a large number of CICE file (restart and/or history) writes (500-700ish) when CICE is compiled with PIO but not with NetCDF. The failure always happens on a CICE process. The current work around for weather model regression tests have been to set export I_MPI_SHM_HEAP_VSIZE=16384 in the job submission script, but this is not a long-term solution.

To Reproduce:

  1. Compile weather model with ATM+ICE+OCN on Hera, Gaea, or WCOSS2. Have used multiple different weather model regression test configurations and resolutions (cpld_control_c48, cpld_control_nowave_noaero_p8) and stack-stack versions/intel compilers (2021v2023) with similar results.
  2. Either run very long simulations with infrequent output or shorter simulations with high frequency output.
  3. Experience failure after 500-700 files written.

Additional context

Cause of issue first reported in weather model issue 2320

I've also tried all possible options of restart/history_format in ice_in and the failure is always the same.

Output

On Hera the failure looks like:

73: Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2266: comm->shm_numa_layout[my_numa_node].base_addr
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(MPL_backtrace_show+0x1c) [0x150a7a430bcc]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(MPIR_Assert_fail+0x21) [0x150a79e0adf1]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x2b1eb9) [0x150a79ad9eb9]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x176584) [0x150a7999e584]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x17a9f9) [0x150a799a29f9]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x199a60) [0x150a799c1a60]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x1717ec) [0x150a799997ec]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x2b4387) [0x150a79adc387]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(PMPI_Allreduce+0x561) [0x150a799376e1]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(MPI_File_open+0x17d) [0x150a7a4492bd]
73: /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/parallel-netcdf-1.12.2-cwokdeb/lib/libpnetcdf.so.4(ncmpio_create+0x199) [0x150a73e592c9]
73: /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/parallel-netcdf-1.12.2-cwokdeb/lib/libpnetcdf.so.4(ncmpi_create+0x4e7) [0x150a73daf4a7]
73: /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/parallelio-2.5.10-2wulfav/lib/libpioc.so(PIOc_createfile_int+0x2e6) [0x150a7c436696]
73: /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/parallelio-2.5.10-2wulfav/lib/libpioc.so(PIOc_createfile+0x41) [0x150a7c432451]
73: /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/parallelio-2.5.10-2wulfav/lib/libpiof.so(piolib_mod_mp_createfile_+0x25e) [0x150a7c1caabe]

On WCOSS2 and Gaea the error looks like

17: MPICH ERROR [Rank 17] [job id 135188771.0] [Mon Oct  7 17:37:30 2024] [c5n1294] - Abort(806965007) (rank 17 in comm 0): Fatal error in PMPI_Comm_split: Other MPI error, error stack:
17: PMPI_Comm_split(513)................: MPI_Comm_split(comm=0xc400314e, color=1, key=0, new_comm=0x7ffe5bbb2d74) failed
17: PMPI_Comm_split(494)................:
17: MPIR_Comm_split_impl(268)...........:
17: MPIR_Get_contextid_sparse_group(610): Too many communicators (0/2048 free on this process; ignore_id=0)
@DeniseWorthen
Copy link
Collaborator

DeniseWorthen commented Nov 13, 2024

@LarissaReames-NOAA @junwang-noaa We have a proposed fix for this issue now. I reached out to Tony Craig and he was able to reproduce the issue in standalone CICE and quickly zero'd in on the problem/solution. He was able to generate 8700 files in standalone testing. I'll make a test branch and hopefully one of us can try it out and ensure it works.

@DeniseWorthen
Copy link
Collaborator

I've tested Tony's fix (https://github.com/DeniseWorthen/CICE/tree/bugfix/manyfiles) using the C48-5deg case on Gaea. I was able to create 1906 hourly history files before hitting the wall clock time (8hours). So I think I have a fix, although the exact implementation may change a bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants