Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segmentation fault during shutdown, sedgem, can't generate output for restart #264

Open
evansmn opened this issue Oct 24, 2024 · 4 comments

Comments

@evansmn
Copy link

evansmn commented Oct 24, 2024

This is using release v0.9.33. On completing the simulation, shutdown fails at sedgem shutdown. Is there any way to recover without having to rerun the experiment? This is a spinup simulation, running 1e6 years without acceleration. What I would like to do is use this as a restart file for another spinup simulation, but attempting to do so produces a similar segmentation fault originating with the same line number in genie.job, but after only a few years of simulation/saving. Below is the error message.

        Initialising SEDGEM module shutdown ...

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0 0x7f8b700218c2 in ???
derpycode/muffindoc#1 0x7f8b70020a55 in ???
derpycode/muffindoc#2 0x7f8b6fd6204f in ???
at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0
derpycode/muffindoc#3 0x563636d2b3db in ???
derpycode/muffindoc#4 0x563636d05838 in ???
derpycode/muffindoc#5 0x56363696a9ca in ???
derpycode/muffindoc#6 0x563636973f5e in ???
derpycode/muffindoc#7 0x56363695291e in ???
derpycode/muffindoc#8 0x7f8b6fd4d249 in __libc_start_call_main
at ../sysdeps/nptl/libc_start_call_main.h:58
derpycode/muffindoc#9 0x7f8b6fd4d304 in __libc_start_main_impl
at ../csu/libc-start.c:360
derpycode/muffindoc#10 0x563636952940 in ???
derpycode/muffindoc#11 0xffffffffffffffff in ???
./genie.job: line 357: 1263740 Segmentation fault ./genie.exe

real 26121m14.894s
user 26117m14.995s
sys 1m44.597s
cp: cannot stat 'fort.2': No such file or directory
ERROR: !!!!!!!!!! ERROR PROCESSING !!!!!!!!!!

Thanks in advance for suggestions on how to proceed.

Per advice I am putting the base and user-config file I was using, and the entire output of the 1 Myr experiment, and the restart file it started from, here:

https://umd.box.com/s/qhp196dotupisnd8ufnbkxvjfpvmm7qj

Run command was:

./runmuffin.sh cgenie.eb_go_gs_ac_bg_sg_rg_gl_eg.wolr0570t6.BASES PALEO exp27.CBSGRL.wolr0570t6.OMEN_Prdxo.SPIN4 1000000 exp27.CBSGRL.wolr0570t6.OMEN_Prdxo.SPIN3c

I can also make a copy of the user-config and run from the failed user-config as restart, and see the same error crop up in just the first few years, e.g.

cp -rp exp27.CBSGRL.wolr0570t6.OMEN_Prdxo.SPIN4 exp27.CBSGRL.wolr0570t6.OMEN_Prdxo.SPIN5

./runmuffin.sh cgenie.eb_go_gs_ac_bg_sg_rg_gl_eg.wolr0570t6.BASES PALEO exp27.CBSGRL.wolr0570t6.OMEN_Prdxo.SPIN5 100 exp27.CBSGRL.wolr0570t6.OMEN_Prdxo.SPIN4

Let me know if you can see it OK?

Thank you!

@evansmn
Copy link
Author

evansmn commented Oct 27, 2024

Thinking about/investigating this further:

  • I have some non-accelerated simulations very much like this one, that run for 500ky without problem.
  • The problem seems to arise from creating/writing sedcore.nc with timeseries data in the sedgem/ directory. I can see/read fields_sedgem_2d.nc , so that gets written OK. As a test I set no sediment cores to be written by sedgem, e.g. all entries in the .sedcoremask.dat file to zero.
  • I was wondering if the malloc error was because these files might be very large; but it looks like netcdf-4 has unlimited file size.

@evansmn
Copy link
Author

evansmn commented Nov 16, 2024

Following up:

  • The test without writing sediment cores finished successfully. This seems to indicate the problem is in sedgem, writing to sedcore.nc.
  • But a restart from that experiment then failed after two years with another segmentation fault:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0 0x7fbf4c6218c2 in ???
#1 0x7fbf4c620a55 in ???
#2 0x7fbf4c43b04f in ???
at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0
#3 0x7fbf4c493fbf in unlink_chunk
at ./malloc/malloc.c:1622
#4 0x7fbf4c496ddc in _int_malloc
at ./malloc/malloc.c:4303
#5 0x7fbf4c497a09 in __GI___libc_malloc
at ./malloc/malloc.c:3315
#6 0x561ecc02c024 in ???
#7 0x561ecbee34d6 in ???
#8 0x561ecbc96802 in ???
#9 0x561ecbc9b81c in ???
#10 0x561ecbc7f91e in ???
#11 0x7fbf4c426249 in __libc_start_call_main
at ../sysdeps/nptl/libc_start_call_main.h:58
#12 0x7fbf4c426304 in __libc_start_main_impl
at ../csu/libc-start.c:360
#13 0x561ecbc7f940 in ???
#14 0xffffffffffffffff in ???
./genie.job: line 357: 249448 Segmentation fault ./genie.exe

real 0m7.147s
user 0m5.430s
sys 0m1.443s
cp: cannot stat 'fort.2': No such file or directory
ERROR: !!!!!!!!!! ERROR PROCESSING !!!!!!!!!!

So a similar error with some more/different information but same origin at l 357 error catching in ./genie.job.

Looking at the output of the successfully completed simulation:

  • the _restart.nc and fields_sedgem_2d.nc files written at the end of the successful run, it looks like they are valid netcdfs, but empty, e.g. fields are all full of missing values, as if the files are created but not filled with data.
  • I can see for instance from that global percent CaCO3 is ~ 46%, but there are only missing values in the field 'sedcore_CaCO3' in the netcdf file: fields_sedgem_2d.nc
  • I can see reasonable values in sed_POC from _restart.nc.

To the same box directory: I've uploaded the output of the successful run, the output from the failed restart, and the user-config for that. Which was just a copy of the previous user-config, just restarted from the prior output with 500ky requested instead of 1My.

Any ideas? Thanks!

@evansmn
Copy link
Author

evansmn commented Nov 16, 2024

For what it is worth,

ulimit -a
real-time non-blocking time (microseconds, -R) unlimited
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 1030080
max locked memory (kbytes, -l) 32970048
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 1030080
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

@evansmn
Copy link
Author

evansmn commented Nov 19, 2024

Further information: the problem seems to originate with this restart simulation:

https://umd.box.com/s/ld8ihq9bu2lnl4uwd32jabq4aad63to7

If I restart from this simulation, the new simulation will run to completion.

But then a restart from the new simulation will fail with a memory allocation/segfault error in the first few years.

For instance,

./runmuffin.sh cgenie.eb_go_gs_ac_bg_sg_rg_gl_eg.wolr0570t6.BASES PALEO exp27.CBSGRL.wolr0570t6.OMEN_Prdxo.SPIN4s 10 exp27.CBSGRL.wolr0570t6.OMEN_Prdxo.SPIN3c

Then make a copy of the experiment *4s as *4sa and run it with restart from *4s:

./runmuffin.sh cgenie.eb_go_gs_ac_bg_sg_rg_gl_eg.wolr0570t6.BASES PALEO exp27.CBSGRL.wolr0570t6.OMEN_Prdxo.SPIN4sa 10 exp27.CBSGRL.wolr0570t6.OMEN_Prdxo.SPIN4s

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

SPIN3c and SPIN4 are identical experiments and running with the same base-config.

diff exp27.CBSGRL.wolr0570t6.OMEN_Prdxo.SPIN3c exp27.CBSGRL.wolr0570t6.OMEN_Prdxo.SPIN4
373c373
< sg_par_sed_clay_fLi_alpha = 8.4 #Li clay formation sink (mol yr-1) (Li/Ca normalized)

sg_par_sed_clay_fLi_alpha =8.4 # was: 68.25 in CTRL3-6 # was:73.5 #was: 63 # was:42 # was: 84 # was: 8.4 in SPIN3 #Li clay formation sink (mol yr-1) (Li/Ca normalized)

SPIN4s is identical to SPIN4 is identical to SPIN4sa.

Since I haven't seen that problem with other such experiments: it looks like there was simply some problem with the SPIN3c as restart, maybe in sedgem saving? And I just have to start over.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant