UFS-dev PR#184 #482

grantfirl · 2024-05-23T20:57:00Z

grantfirl · 2024-05-29T17:37:15Z

@mkavulich @dustinswales @scrasmussen @ligiabernardet @lulinxue
All, I'm going to need your help getting this PR to work. With ufs-community/ccpp-physics#160, the CCPP requires that MPI be used by the host model. Even though this is overkill for the SCM right now, we still need to implement it. I think that I've put in the simplest possible MPI implementation in the SCM in this PR; we're basically just importing mpi_f08, running mpi_init(), and setting the MPI communicator passed to the CCPP as the communicator represented by MPI_COMM_WORLD that is set up via mpi_init().

It seems to work and be able to run through a case OK, but there are a couple of things that we still need to address. The first is the run script. Previously, one could simply execute the executable by invoking it, but now, I think that we'll need to use srun, right? This is the way that it is currently set up in the run script anyway. One also needs to provide an account. I've hardcoded it to gmtb right now, just for testing, but I'm guessing that we should reference an environment variable in the run script for this. Do you agree, or is there a better way? It is also just setting the number of processes to 1 because the SCM code isn't actually doing anything with multiple processes. I think that we'll need to change some documentation due to this as well.

Secondly, we need to update the CI workflows to load/install MPI for the various compilers if we want CI to continue to work. I briefly looked at this, but figured that one of you could probably do this more efficiently.

I'm suggesting that if you're able to help to please submit a PR into my PR branch (ufs-dev-PR184). This is required to catch the SCM up to the latest physics in ufs/dev, so we need it for the release.

dustinswales · 2024-05-29T17:53:28Z

@grantfirl I'll take a stab at this later today.
FYI, here are the changes needed to the CI for openmpi, NCAR/ccpp-framework#551. I think @mkavulich has worked on this some in a personal branch. Mike, what is the status of this?

mkavulich · 2024-05-29T21:29:55Z

@grantfirl Sorry I missed today's physics meeting, I forgot to RSVP so my calendar didn't remind me!

@dustinswales The build and run is working fine in my existing branch for PR #477 (including framework updates that make MPI mandatory), but I don't have all of the same physics changes as this PR. When I test this PR (and when I attempt to bring in these changes to my own branch) I get argument mismatch compiler errors:

[ 25%] Building Fortran object ccpp/physics/CMakeFiles/ccpp_physics.dir/physics/MP/Ferrier_Aligo/mp_fer_hires.F90.o
/scratch1/BMC/hmtb/kavulich/CCPP/workdir/suite_names_update/PR_482/ccpp-scm/ccpp/physics/physics/MP/Ferrier_Aligo/mp_fer_hires.F90(90): error #6633: The type of the actual argument differs from the type of the dummy argument.   [MPICOMM]
       CALL FERRIER_INIT_HR(dtp,mpicomm,mpirank,mpiroot,threads,errmsg,errflg)
--------------------------------^
compilation aborted for /scratch1/BMC/hmtb/kavulich/CCPP/workdir/suite_names_update/PR_482/ccpp-scm/ccpp/physics/physics/MP/Ferrier_Aligo/mp_fer_hires.F90 (code 1)
make[2]: *** [ccpp/physics/CMakeFiles/ccpp_physics.dir/build.make:1635: ccpp/physics/CMakeFiles/ccpp_physics.dir/physics/MP/Ferrier_Aligo/mp_fer_hires.F90.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [CMakeFiles/Makefile2:230: ccpp/physics/CMakeFiles/ccpp_physics.dir/all] Error 2
make: *** [Makefile:136: all] Error 2

I don't understand why this is happening though, as it seems as if all the appropriate changes have been made to change mpicomm to the appropriate "MPI_Comm" type. My build attempt is on Hera here: /scratch1/BMC/hmtb/kavulich/CCPP/workdir/suite_names_update/PR_482/ccpp-scm

@grantfirl @dustinswales Does this make sense to either of you?

dustinswales · 2024-05-29T21:53:45Z

@grantfirl Sorry I missed today's physics meeting, I forgot to RSVP so my calendar didn't remind me!

@dustinswales The build and run is working fine in my existing branch for PR #477 (including framework updates that make MPI mandatory), but I don't have all of the same physics changes as this PR. When I test this PR (and when I attempt to bring in these changes to my own branch) I get argument mismatch compiler errors:
[ 25%] Building Fortran object ccpp/physics/CMakeFiles/ccpp_physics.dir/physics/MP/Ferrier_Aligo/mp_fer_hires.F90.o
/scratch1/BMC/hmtb/kavulich/CCPP/workdir/suite_names_update/PR_482/ccpp-scm/ccpp/physics/physics/MP/Ferrier_Aligo/mp_fer_hires.F90(90): error #6633: The type of the actual argument differs from the type of the dummy argument.   [MPICOMM]
       CALL FERRIER_INIT_HR(dtp,mpicomm,mpirank,mpiroot,threads,errmsg,errflg)
--------------------------------^
compilation aborted for /scratch1/BMC/hmtb/kavulich/CCPP/workdir/suite_names_update/PR_482/ccpp-scm/ccpp/physics/physics/MP/Ferrier_Aligo/mp_fer_hires.F90 (code 1)
make[2]: *** [ccpp/physics/CMakeFiles/ccpp_physics.dir/build.make:1635: ccpp/physics/CMakeFiles/ccpp_physics.dir/physics/MP/Ferrier_Aligo/mp_fer_hires.F90.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [CMakeFiles/Makefile2:230: ccpp/physics/CMakeFiles/ccpp_physics.dir/all] Error 2
make: *** [Makefile:136: all] Error 2
I don't understand why this is happening though, as it seems as if all the appropriate changes have been made to change mpicomm to the appropriate "MPI_Comm" type. My build attempt is on Hera here: /scratch1/BMC/hmtb/kavulich/CCPP/workdir/suite_names_update/PR_482/ccpp-scm

@grantfirl @dustinswales Does this make sense to either of you?

@mkavulich Not sure what's happening on your end. I just opened a PR into this branch adding the openmpi to the CI tests, and replacing "srun -a gmtb" w/ "mpirun". I still have some tiny CI issues I'm debugging, which will be solved imminently.

mkavulich · 2024-05-29T23:02:47Z

@grantfirl I figured out the issue; I was compiling with all CCPP suites and the tests only compile for a subset. When you compile with all suites, it results in a failure. It seems like Ferrier_Aligo is not used in any of our tested CCPP suites (in SCM or ufs-weather-model), and so we didn't catch the failure before now. The question is, is this worth fixing at this stage? I think it is, but I also don't know what fix is needed...

dustinswales · 2024-05-29T23:16:59Z

@grantfirl I figured out the issue; I was compiling with all CCPP suites and the tests only compile for a subset. When you compile with all suites, it results in a failure. It seems like Ferrier_Aligo is not used in any of our tested CCPP suites (in SCM or ufs-weather-model), and so we didn't catch the failure before now. The question is, is this worth fixing at this stage? I think it is, but I also don't know what fix is needed...

Not worth it. Open an issue that the FA MP isn't working with openmpi f08 and we can ping the developer.
The fact that this didn't get caught on the UFS side suggests that they don't test it either, so I don't think we need to prioritize this.

Changes for openmpi

grantfirl · 2024-05-30T17:33:51Z

Thanks @dustinswales. I merged in the CI fixes. Since the run script is using mpirun, Hera requires us to submit a batch job or run on an interactive node. If I run in an interactive mode, I can successfully run through the rt_test_cases.py file.

mkavulich · 2024-05-30T17:59:32Z

@grantfirl Can we add a variable to control the run command? I worry that mpirun will not be supported on all platforms.

grantfirl · 2024-05-30T18:04:53Z

For future reference, for Hera interactive mode, I used:
salloc -q debug -t 0:15:00 --nodes=1 -A gmtb

If you don't, the run will fail without a good error message. I didn't even figure out what happened until I got an email from rdhpcs saying that using mpirun on a login node is prohibited.

grantfirl · 2024-05-30T18:05:17Z

@grantfirl Can we add a variable to control the run command? I worry that mpirun will not be supported on all platforms.

Sure. What are the options? mpirun, srun, mpiexec?

grantfirl · 2024-05-30T18:06:47Z

We should also play with command line options for the those commands to make sure that we're getting enough written out to stderr/stdout to debug as effectively as before. I think that by default, I was getting way less useful debugging information using mpirun without other options.

grantfirl · 2024-05-30T18:07:49Z

Also, note from @dustinswales's PR that added CI fixes into this one that it seems that the MPI overhead is adding about 20% to the computation times on Hera, at least with my test case (using ARM_SGP_summer_1997_A case with HR3 suite).

…ZE; check for errors for MPI_INIT/FINALIZE

mkavulich · 2024-05-30T18:18:03Z

@grantfirl Can we add a variable to control the run command? I worry that mpirun will not be supported on all platforms.

Sure. What are the options? mpirun, srun, mpiexec?

I was just thinking a free-form string that can be specified as an argument to the run script, with mpirun (or whatever is the most appropriate) as a default.

.github/workflows/ci_build_scm_ubuntu_22.04.yml

.github/workflows/ci_run_scm_DEPHY.yml

.github/workflows/ci_run_scm_rts.yml

grantfirl · 2024-05-30T18:48:20Z

@dustinswales DEPHY CI test was fixed by adding a MPI_FINALIZE call. I knew that I had forgotten that.

grantfirl · 2024-05-30T20:29:33Z

@grantfirl Can we add a variable to control the run command? I worry that mpirun will not be supported on all platforms.

Sure. What are the options? mpirun, srun, mpiexec?

I was just thinking a free-form string that can be specified as an argument to the run script, with mpirun (or whatever is the most appropriate) as a default.

Is this what you had in mind? 81296f2

grantfirl · 2024-05-30T21:01:39Z

Note, Heradocs say that srun is preferred to mpirun and that we may need to set an environment variable: https://rdhpcs-common-docs.rdhpcs.noaa.gov/wiki/index.php?title=Running_and_Monitoring_Jobs

setenv I_MPI_PIN_RESPECT_CPUSET disable # for csh/tcsh
export I_MPI_PIN_RESPECT_CPUSET=disable # for bash type shells

It seems to work OK without these set, and I don't know what they do. The default is 'off', not 'disable'. Also don't know if those are equivalent.

mkavulich · 2024-05-30T23:31:33Z

@grantfirl Can we add a variable to control the run command? I worry that mpirun will not be supported on all platforms.

Sure. What are the options? mpirun, srun, mpiexec?

I was just thinking a free-form string that can be specified as an argument to the run script, with mpirun (or whatever is the most appropriate) as a default.

Is this what you had in mind? 81296f2

That looks great, exactly what I was thinking!

For the record, I was able to run most cases on Hera from the login node without having to submit a batch job. A few did fail though, I don't know if that's because of the login node or something else. I'm currently investigating.

mkavulich · 2024-05-31T00:34:54Z

For the record, I was able to run most cases on Hera from the login node without having to submit a batch job. A few did fail though, I don't know if that's because of the login node or something else. I'm currently investigating.

Just to update on this: it looks like there isn't a problem with mpirun on login nodes per se, but those jobs that were failing were killed by the Hera system-level process because they were using too much memory or other resources on the login nodes.

grantfirl · 2024-05-31T13:34:57Z

@mkavulich @dustinswales This is ready for review, then, I suppose.

grantfirl · 2024-05-31T13:45:20Z

For the record, I was able to run most cases on Hera from the login node without having to submit a batch job. A few did fail though, I don't know if that's because of the login node or something else. I'm currently investigating.

Just to update on this: it looks like there isn't a problem with mpirun on login nodes per se, but those jobs that were failing were killed by the Hera system-level process because they were using too much memory or other resources on the login nodes.

@mkavulich @dustinswales @scrasmussen

OK, this is good to know. Even if we're occasionally running into the failures (I guess that I happened to pick a particular memory-hungry combo in my initial test), we should probably address it. I think that our options are to 1) Recommend to start an interactive compute node session on HPCs before using the run script 2) recommending and better supporting batch scripts to run the SCM on HPC or 3) automatically doing 1 or 2 from the run script?

Option 3 seems user-friendliest but also will likely have the most ongoing maintenance. My preference is probably option 1. We should be encouraging folks to use compute nodes when using the SCM for real science anyway. Sure, you can maybe "get away with" using login nodes for some stuff, but they're really not supposed to be used for running models, no matter how simple.

I'm certainly open to suggestions here.

mkavulich · 2024-05-31T13:46:16Z

The changes looks good to me, but I am curious as to why the rt CI is failing. It looks like it's failing in the plotting script?

grantfirl · 2024-05-31T13:50:11Z

The changes looks good to me, but I am curious as to why the rt CI is failing. It looks like it's failing in the plotting script?

Oh, I didn't even notice. Ya, let's take a look.

grantfirl · 2024-05-31T14:31:56Z

@mkavulich @dustinswales Looks like 4e1ad07 fixed the plotting. I'm guessing the previous merged PR (#483 ) didn't catch this because there were no baseline differences. This PR does change baselines and actually triggered plot generation.

mkavulich · 2024-05-31T15:48:05Z

@mkavulich @dustinswales Looks like 4e1ad07 fixed the plotting.

Looks good, I'll pull that into my branch as well. But now the Nvidia tests are failing at the CMake step due to not finding MPI, do they need openmpi as well or is there something nvidia-specific here?

grantfirl · 2024-05-31T16:05:29Z

@mkavulich @dustinswales Looks like 4e1ad07 fixed the plotting.

Looks good, I'll pull that into my branch as well. But now the Nvidia tests are failing at the CMake step due to not finding MPI, do they need openmpi as well or is there something nvidia-specific here?

I think that the NVidia workflow script needs to add MPI installation too. I think that we can deal with this in a followup PR, perhaps when the GF openACC bugfix comes in (because that also causes the same tests to fail).

dustinswales · 2024-05-31T17:26:31Z

@grantfirl @mkavulich The Nvidia compiler come with their own MPI, so we don't need to install anything else, but it will need some reconfiguring (in a follow up PR).

grantfirl · 2024-05-31T20:19:53Z

@dustinswales @mkavulich I'll merge this and NCAR/ccpp-physics#1073 when approved, but I'll hold off on NCAR/fv3atm#128 and NCAR/ufs-weather-model#132 as discussed due to the regression testing updates needed.

mkavulich

Looks good to me, assuming nvidia tests are fixed in subsequent PR

grantfirl · 2024-06-04T18:38:46Z

@dustinswales @mkavulich The DEPHY CI test failed after updating the ccpp/physics submodule pointer, but succeeded upon re-running. I don't know what that is about, but I'm going to merge and see if that happens again in subsequent PRs.

mkavulich · 2024-06-04T20:14:12Z

@grantfirl I had noticed the same thing on my PR after I pulled in your latest changes, I didn't think anything of it since it wasn't showing up on your branch. Didn't realize you had seen that happen too.

grantfirl · 2024-06-04T20:42:53Z

@grantfirl I had noticed the same thing on my PR after I pulled in your latest changes, I didn't think anything of it since it wasn't showing up on your branch. Didn't realize you had seen that happen too.

I have a hypothesis that I'm going to test.

grantfirl added 2 commits May 23, 2024 16:56

changes to work with ufs-dev PR#184

0cd476c

add initial working dummy MPI statements to SCM code

37f6272

mkavulich mentioned this pull request May 29, 2024

UFS-dev PR#184 NCAR/ccpp-physics#1073

Merged

dustinswales added 3 commits May 29, 2024 21:07

Changes fro Grants branch

100f921

Move openmpi build up in CI workflow

de62a3d

CI changes

9ffcea0

dustinswales added 2 commits May 29, 2024 21:31

CI changes

ed1e4d8

CI changes

dcb2e5f

CI changes

e54a78a

Merge pull request #7 from dustinswales/ufs-dev-PR184-djs

7b8ca91

Changes for openmpi

remove ifdefs from SCM code since MPI is required now; add MPI_FINALI…

0958467

…ZE; check for errors for MPI_INIT/FINALIZE

mkavulich reviewed May 30, 2024

View reviewed changes

.github/workflows/ci_build_scm_ubuntu_22.04.yml Show resolved Hide resolved

.github/workflows/ci_run_scm_DEPHY.yml Show resolved Hide resolved

.github/workflows/ci_run_scm_rts.yml Show resolved Hide resolved

fix compilation error for MPI_INIT/FINALIZE error messages

356e4ee

add mpi_command command line argument

81296f2

Merge branch 'main' into ufs-dev-PR184

f168abc

swap launch_exe arugment order

7a695dd

grantfirl marked this pull request as ready for review May 30, 2024 21:03

mkavulich mentioned this pull request May 31, 2024

Rename CCPP suites to non-meaningful names #477

Draft

change len to shape for numpy array checking

4e1ad07

mkavulich approved these changes Jun 3, 2024

View reviewed changes

dustinswales approved these changes Jun 4, 2024

View reviewed changes

update ccpp/physics submodule pointer

5dee89c

grantfirl merged commit dd7f9b3 into NCAR:main Jun 4, 2024
18 of 22 checks passed

grantfirl mentioned this pull request Jun 6, 2024

UFS-dev PR#189 #486

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UFS-dev PR#184 #482

UFS-dev PR#184 #482

grantfirl commented May 23, 2024

grantfirl commented May 29, 2024

dustinswales commented May 29, 2024

mkavulich commented May 29, 2024

dustinswales commented May 29, 2024

mkavulich commented May 29, 2024

dustinswales commented May 29, 2024

grantfirl commented May 30, 2024

mkavulich commented May 30, 2024

grantfirl commented May 30, 2024

grantfirl commented May 30, 2024

grantfirl commented May 30, 2024

grantfirl commented May 30, 2024

mkavulich commented May 30, 2024

grantfirl commented May 30, 2024

grantfirl commented May 30, 2024

grantfirl commented May 30, 2024 •

edited

Loading

mkavulich commented May 30, 2024 •

edited

Loading

mkavulich commented May 31, 2024

grantfirl commented May 31, 2024

grantfirl commented May 31, 2024 •

edited

Loading

mkavulich commented May 31, 2024

grantfirl commented May 31, 2024

grantfirl commented May 31, 2024

mkavulich commented May 31, 2024

grantfirl commented May 31, 2024

dustinswales commented May 31, 2024

grantfirl commented May 31, 2024

mkavulich left a comment

grantfirl commented Jun 4, 2024

mkavulich commented Jun 4, 2024

grantfirl commented Jun 4, 2024

UFS-dev PR#184 #482

UFS-dev PR#184 #482

Conversation

grantfirl commented May 23, 2024

grantfirl commented May 29, 2024

dustinswales commented May 29, 2024

mkavulich commented May 29, 2024

dustinswales commented May 29, 2024

mkavulich commented May 29, 2024

dustinswales commented May 29, 2024

grantfirl commented May 30, 2024

mkavulich commented May 30, 2024

grantfirl commented May 30, 2024

grantfirl commented May 30, 2024

grantfirl commented May 30, 2024

grantfirl commented May 30, 2024

mkavulich commented May 30, 2024

grantfirl commented May 30, 2024

grantfirl commented May 30, 2024

grantfirl commented May 30, 2024 • edited Loading

mkavulich commented May 30, 2024 • edited Loading

mkavulich commented May 31, 2024

grantfirl commented May 31, 2024

grantfirl commented May 31, 2024 • edited Loading

mkavulich commented May 31, 2024

grantfirl commented May 31, 2024

grantfirl commented May 31, 2024

mkavulich commented May 31, 2024

grantfirl commented May 31, 2024

dustinswales commented May 31, 2024

grantfirl commented May 31, 2024

mkavulich left a comment

Choose a reason for hiding this comment

grantfirl commented Jun 4, 2024

mkavulich commented Jun 4, 2024

grantfirl commented Jun 4, 2024

grantfirl commented May 30, 2024 •

edited

Loading

mkavulich commented May 30, 2024 •

edited

Loading

grantfirl commented May 31, 2024 •

edited

Loading