-
Notifications
You must be signed in to change notification settings - Fork 359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hang for test using nvidia compiler only for certain smaller MPI counts ERS.hcru_hcru.IELM.pm-cpu_nvidia.elm-multi_inst
#6521
Comments
This test has still be failing/hanging. Adding a little more details, these tests seem OK, so it looks like it's combination of the multi_inst modifier with newer nvidia compilers. These tests pass with nvidia 23.9 as well as 24.5
And these tests seem to have the same fail/hang issue:
Where flow might be during hang:
|
OK, there might be an issue with how it's launching more tasks/jobs as if I force the test to land on one node only, it passes. That is: |
I created PR #6581 to use 3 full nodes (192 MPI's) instead of the current odd value of 192 MPI's. There must have been a reason why I used 192 here -- and indeed search remind me of I want to keep this issue open as it's still odd that certain MPI counts cause a hang, while others don't. |
…next (PR #6581) Currently, the tests for this resolution use 192 MPI's on pm-cpu which is an odd value (1.5 nodes). Here it's being changed to use -3 (or 384 MPI's). Example of test that would use this layout: SMS.hcru_hcru.IELM This change is effective work-around (but not fix) for #6521 with #6486 in mind as noted below. [bfb]
Merged #6581 so we should not see the issue on cdash. |
ERS.hcru_hcru.IELM.pm-cpu_nvidia.elm-multi_inst
ERS.hcru_hcru.IELM.pm-cpu_nvidia.elm-multi_inst
It may be that #6687 will also address this issue. Using
|
This looks like a new test -- it is failing on pm-cpu with nvidia compiler. Based on the dates of log files, it looks like
the test is hanging.
Note the current MPI count used by default for this test is 192, which is 1.5 nodes on pm-cpu
The text was updated successfully, but these errors were encountered: