-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UFS-dev PR#184 #482
UFS-dev PR#184 #482
Conversation
@mkavulich @dustinswales @scrasmussen @ligiabernardet @lulinxue It seems to work and be able to run through a case OK, but there are a couple of things that we still need to address. The first is the run script. Previously, one could simply execute the executable by invoking it, but now, I think that we'll need to use Secondly, we need to update the CI workflows to load/install MPI for the various compilers if we want CI to continue to work. I briefly looked at this, but figured that one of you could probably do this more efficiently. I'm suggesting that if you're able to help to please submit a PR into my PR branch (ufs-dev-PR184). This is required to catch the SCM up to the latest physics in ufs/dev, so we need it for the release. |
@grantfirl I'll take a stab at this later today. |
@grantfirl Sorry I missed today's physics meeting, I forgot to RSVP so my calendar didn't remind me! @dustinswales The build and run is working fine in my existing branch for PR #477 (including framework updates that make MPI mandatory), but I don't have all of the same physics changes as this PR. When I test this PR (and when I attempt to bring in these changes to my own branch) I get argument mismatch compiler errors:
I don't understand why this is happening though, as it seems as if all the appropriate changes have been made to change mpicomm to the appropriate "MPI_Comm" type. My build attempt is on Hera here: /scratch1/BMC/hmtb/kavulich/CCPP/workdir/suite_names_update/PR_482/ccpp-scm @grantfirl @dustinswales Does this make sense to either of you? |
@mkavulich Not sure what's happening on your end. I just opened a PR into this branch adding the openmpi to the CI tests, and replacing "srun -a gmtb" w/ "mpirun". I still have some tiny CI issues I'm debugging, which will be solved imminently. |
@grantfirl I figured out the issue; I was compiling with all CCPP suites and the tests only compile for a subset. When you compile with all suites, it results in a failure. It seems like Ferrier_Aligo is not used in any of our tested CCPP suites (in SCM or ufs-weather-model), and so we didn't catch the failure before now. The question is, is this worth fixing at this stage? I think it is, but I also don't know what fix is needed... |
Not worth it. Open an issue that the FA MP isn't working with openmpi f08 and we can ping the developer. |
Changes for openmpi
Thanks @dustinswales. I merged in the CI fixes. Since the run script is using |
@grantfirl Can we add a variable to control the run command? I worry that mpirun will not be supported on all platforms. |
For future reference, for Hera interactive mode, I used: If you don't, the run will fail without a good error message. I didn't even figure out what happened until I got an email from rdhpcs saying that using mpirun on a login node is prohibited. |
Sure. What are the options? mpirun, srun, mpiexec? |
We should also play with command line options for the those commands to make sure that we're getting enough written out to stderr/stdout to debug as effectively as before. I think that by default, I was getting way less useful debugging information using |
Also, note from @dustinswales's PR that added CI fixes into this one that it seems that the MPI overhead is adding about 20% to the computation times on Hera, at least with my test case (using ARM_SGP_summer_1997_A case with HR3 suite). |
…ZE; check for errors for MPI_INIT/FINALIZE
I was just thinking a free-form string that can be specified as an argument to the run script, with mpirun (or whatever is the most appropriate) as a default. |
@dustinswales DEPHY CI test was fixed by adding a MPI_FINALIZE call. I knew that I had forgotten that. |
Is this what you had in mind? 81296f2 |
Note, Heradocs say that
It seems to work OK without these set, and I don't know what they do. The default is 'off', not 'disable'. Also don't know if those are equivalent. |
That looks great, exactly what I was thinking! For the record, I was able to run most cases on Hera from the login node without having to submit a batch job. A few did fail though, I don't know if that's because of the login node or something else. I'm currently investigating. |
Just to update on this: it looks like there isn't a problem with mpirun on login nodes per se, but those jobs that were failing were killed by the Hera system-level process because they were using too much memory or other resources on the login nodes. |
@mkavulich @dustinswales This is ready for review, then, I suppose. |
@mkavulich @dustinswales @scrasmussen OK, this is good to know. Even if we're occasionally running into the failures (I guess that I happened to pick a particular memory-hungry combo in my initial test), we should probably address it. I think that our options are to 1) Recommend to start an interactive compute node session on HPCs before using the run script 2) recommending and better supporting batch scripts to run the SCM on HPC or 3) automatically doing 1 or 2 from the run script? Option 3 seems user-friendliest but also will likely have the most ongoing maintenance. My preference is probably option 1. We should be encouraging folks to use compute nodes when using the SCM for real science anyway. Sure, you can maybe "get away with" using login nodes for some stuff, but they're really not supposed to be used for running models, no matter how simple. I'm certainly open to suggestions here. |
The changes looks good to me, but I am curious as to why the rt CI is failing. It looks like it's failing in the plotting script? |
Oh, I didn't even notice. Ya, let's take a look. |
@mkavulich @dustinswales Looks like 4e1ad07 fixed the plotting. I'm guessing the previous merged PR (#483 ) didn't catch this because there were no baseline differences. This PR does change baselines and actually triggered plot generation. |
Looks good, I'll pull that into my branch as well. But now the Nvidia tests are failing at the CMake step due to not finding MPI, do they need openmpi as well or is there something nvidia-specific here? |
I think that the NVidia workflow script needs to add MPI installation too. I think that we can deal with this in a followup PR, perhaps when the GF openACC bugfix comes in (because that also causes the same tests to fail). |
@grantfirl @mkavulich The Nvidia compiler come with their own MPI, so we don't need to install anything else, but it will need some reconfiguring (in a follow up PR). |
@dustinswales @mkavulich I'll merge this and NCAR/ccpp-physics#1073 when approved, but I'll hold off on NCAR/fv3atm#128 and NCAR/ufs-weather-model#132 as discussed due to the regression testing updates needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, assuming nvidia tests are fixed in subsequent PR
@dustinswales @mkavulich The DEPHY CI test failed after updating the ccpp/physics submodule pointer, but succeeded upon re-running. I don't know what that is about, but I'm going to merge and see if that happens again in subsequent PRs. |
@grantfirl I had noticed the same thing on my PR after I pulled in your latest changes, I didn't think anything of it since it wasn't showing up on your branch. Didn't realize you had seen that happen too. |
I have a hypothesis that I'm going to test. |
Analogous to NCAR/fv3atm#128