Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intel18 + openmp + MOM6 global_ALE_z crashes on theta , lscsky50, theia #1

Open
nikizadehgfdl opened this issue Apr 20, 2018 · 3 comments

Comments

@nikizadehgfdl
Copy link
Owner

nikizadehgfdl commented Apr 20, 2018

Intel18 + openmp executable (single thread test) causes crash or just hang for MOM6 test cases on all three machines theta (KNL) , lscsky50 (skylake) and theia.

Here's the crash output for global_ALE_z test case on theta and lscsky50:

 EKEmin=  1.000000000000000E+016 ResMin=   236869.453598697
 src=   1332071.81173317      ldamping=  8.991153093102879E-082
 gamma-b=  0.832273068599009      gamma-t=  0.901219478800562
 drag_visc=  2.083867476924661E-004 Ubg2=  0.000000000000000E+000
Something has gone very wrong
[NID 02598] 2018-04-20 14:33:31 Apid 4349292: initiated application termination

or for another test (benchmark):

_pmiu_daemon(SIGCHLD): [NID 00471] [c2-0c1s5n3] [Fri Apr 20 16:31:03 2018] PE RANK 19 exit signal Bus error
[NID 00471] 2018-04-20 16:31:03 Apid 4349434: initiated application termination
[NID 00471] 2018-04-20 16:31:04 Apid 4349434: Error detected during page fault processing.  Process terminated via bus error.

on KNL box:

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 182461 RUNNING AT lscsky50-d
=   EXIT CODE: 9
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
   Intel(R) MPI Library troubleshooting guide:
      https://software.intel.com/node/561764
===================================================================================

No such issues for Intel17.

No such issue for non-openmp exec with Intel18.

@byrdman1982
Copy link

@nikizadehgfdl lscsky50 is a skylake machine, not KNL. Theta is KNL.

Is this compiled with debug (-O0 -g -traceback) options? Do you know where the crash is occurring? You could run it through ddt on theta.

@nikizadehgfdl
Copy link
Owner Author

nikizadehgfdl commented Apr 20, 2018

@byrdman1982 thanks for pointing that out.
I get the same bad behavior with both repro-openmp (-O2) and prod-openmp(-O3) on the three intel18 platforms.
With debug-openmp (-O0) the model hangs on theta and theia , but it runs fine on the skylake box!

@byrdman1982
Copy link

@nikizadehgfdl have to tried watching the memory? If it's related to openmp, there could be a data race. We also saw some memory leaking with openmp when using pointers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants