Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

floating point exception in LDAS_DEBUGCONUS/model test when using ESMA_env v4.26.0 #726

Open
gmao-rreichle opened this issue Mar 1, 2024 · 6 comments
Assignees

Comments

@gmao-rreichle
Copy link
Contributor

The LDAS_DEBUGCONUS/model test crashes with a floating point exception when using ESMA_env v4.26.0. The test runs ok with ESMA_env v4.23.0. All other tests (incl. GNUDEBUGCONUS) are ok.

Note that ESMA_env v4.26.0 uses a new version of HDF5.

The GEOSldas "err" and "log" files from the run that crashed are:
GEOSldas_err_txt.txt
GEOSldas_log_txt.txt
The log file suggests that the floating point exception occurs when opening an GEOS nc file (Line 5376 of LDAS_Forcing.F90) using nf90_open().

I overlooked this problem when testing for #713, where I probably only ran the standard tests and not the debug tests.

I suspect the problem is not within ESMA_env v4.26.0 but rather poor coding in LDAS that is exposed with the DEBUG build .

cc: @mathomp4 @weiyuan-jiang @biljanaorescanin

@gmao-rreichle gmao-rreichle self-assigned this Mar 1, 2024
@mathomp4
Copy link
Member

mathomp4 commented Mar 1, 2024

Hmm. There definitely was no change in netcdf-fortran in the newer Baselibs. And while HDF5 did update, if that caused it, every netcdf open would fall apart.

@mathomp4
Copy link
Member

mathomp4 commented Mar 1, 2024

Such an odd traceback:

 3  /discover/nobackup/rreichle/SystemTests/runs/LDAS_DEBUGCONUS/model/CURRENT/build//lib/libMAPL.pfio.so(H5T__init_native_float_types+0xfc8) [0x2adbc3395c18]
 4  /discover/nobackup/rreichle/SystemTests/runs/LDAS_DEBUGCONUS/model/CURRENT/build//lib/libMAPL.pfio.so(H5T_init+0x98) [0x2adbc32fb3f8]
 5  /discover/nobackup/rreichle/SystemTests/runs/LDAS_DEBUGCONUS/model/CURRENT/build//lib/libMAPL.pfio.so(H5VL_init_phase2+0x78) [0x2adbc33b8618]
 6  /discover/nobackup/rreichle/SystemTests/runs/LDAS_DEBUGCONUS/model/CURRENT/build//lib/libMAPL.pfio.so(H5_init_library+0x26b) [0x2adbc30f6aeb]
 7  /discover/nobackup/rreichle/SystemTests/runs/LDAS_DEBUGCONUS/model/CURRENT/build//lib/libMAPL.pfio.so(H5Eset_auto2+0x205) [0x2adbc3172ec5]
 8  /discover/nobackup/rreichle/SystemTests/runs/LDAS_DEBUGCONUS/model/CURRENT/build//lib/libMAPL.pfio.so(+0xcb9f33) [0x2adbc2d72f33]
 9  /discover/nobackup/rreichle/SystemTests/runs/LDAS_DEBUGCONUS/model/CURRENT/build//lib/libMAPL.pfio.so(nc4_hdf5_initialize+0x1d) [0x2adbc2d72f58]
10  /discover/nobackup/rreichle/SystemTests/runs/LDAS_DEBUGCONUS/model/CURRENT/build//lib/libMAPL.pfio.so(NC_HDF5_initialize+0x2b) [0x2adbc2d705a3]
11  /discover/nobackup/rreichle/SystemTests/runs/LDAS_DEBUGCONUS/model/CURRENT/build//lib/libMAPL.pfio.so(nc_initialize+0xdd) [0x2adbc2dd15ef]
12  /discover/nobackup/rreichle/SystemTests/runs/LDAS_DEBUGCONUS/model/CURRENT/build//lib/libMAPL.pfio.so(NC_open+0x8e) [0x2adbc2d0feff]
13  /discover/nobackup/rreichle/SystemTests/runs/LDAS_DEBUGCONUS/model/CURRENT/build//lib/libMAPL.pfio.so(nc_open+0x5d) [0x2adbc2d0eeae]
14  /discover/nobackup/rreichle/SystemTests/runs/LDAS_DEBUGCONUS/model/CURRENT/build//lib/libMAPL.so(nf_open_+0xa2) [0x2adbbdd4c372]
15  /discover/nobackup/rreichle/SystemTests/runs/LDAS_DEBUGCONUS/model/CURRENT/build//lib/libMAPL.so(netcdf_mp_nf90_open_+0xf9) [0x2adbbdcfa3a9]

It's almost like the file is odd. I think we need to know what file was trying to be opened and take a look at it. I wonder if it's something that HDF5 1.10 let through but HDF5 1.14 is a bit more sensitive or exacting with?

@gmao-rreichle
Copy link
Contributor Author

@mathomp4, I think the perhaps more useful part of the backtrace is ~line 2567 in GEOSldas_err_txt, see below for an excerpt. The run is trying to read a MERRA-2 file. Here's the corresponding log entry from the successful CONUS run with standard optimization:

opening file: ../input/met_forcing/MERRA2_land_forcing//MERRA2_400/diag/Y2013/M12/MERRA2_400.tavg1_2d_rad_Nx.20131231.nc4

Excerpt from GEOSldas_err_txt of failed run:

=================================
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
GEOSldas.x         00000000012F695E  Unknown               Unknown  Unknown
libpthread-2.22.s  00002B3489155CE0  Unknown               Unknown  Unknown
GEOSldas.x         00000000012F6D4F  Unknown               Unknown  Unknown
libpthread-2.22.s  00002B3489155CE0  Unknown               Unknown  Unknown
libMAPL.pfio.so    00002B3480991C18  Unknown               Unknown  Unknown
libMAPL.pfio.so    00002B34808F73F8  Unknown               Unknown  Unknown
libMAPL.pfio.so    00002B34809B4618  Unknown               Unknown  Unknown
libMAPL.pfio.so    00002B34806F2AEB  Unknown               Unknown  Unknown
libMAPL.pfio.so    00002B348076EEC5  Unknown               Unknown  Unknown
libMAPL.pfio.so    00002B348036EF33  Unknown               Unknown  Unknown
libMAPL.pfio.so    00002B348036EF58  Unknown               Unknown  Unknown
libMAPL.pfio.so    00002B348036C5A3  Unknown               Unknown  Unknown
libMAPL.pfio.so    00002B34803CD5EF  Unknown               Unknown  Unknown
libMAPL.pfio.so    00002B348030BEFF  Unknown               Unknown  Unknown
libMAPL.pfio.so    00002B348030AEAE  Unknown               Unknown  Unknown
libMAPL.so         00002B347B348372  Unknown               Unknown  Unknown
libMAPL.so         00002B347B2F63A9  Unknown               Unknown  Unknown
GEOSldas.x         00000000005CF358  ldas_forcemod_mp_        5376  LDAS_Forcing.F90
GEOSldas.x         000000000058468E  ldas_forcemod_mp_        3741  LDAS_Forcing.F90
GEOSldas.x         00000000004C5415  ldas_forcemod_mp_         332  LDAS_Forcing.F90
GEOSldas.x         0000000000487C7D  geos_metforcegrid         708  GEOS_MetforceGridComp.F90

@mathomp4
Copy link
Member

@gmao-rreichle This might be a moot issue. We've discovered some other issues with HDF5 1.14 in some of our testing. So I might be moving back our HDF5 to 1.10 for now.

Weirdly, the issues we see in Baselibs with HDF5 1.14 don't seem to be happening with Spack + 1.14, so I'm...perplexed.

@weiyuan-jiang
Copy link
Contributor

I guess this issue is solved by this

@mathomp4
Copy link
Member

@weiyuan-jiang No. That was an attempt to work around it. I'm currently trying to build Baselibs 7.20.0 everywhere and then I'll make a new ESMA_env which reverts to HDF5 1.10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants