UFS-dev PR#189 #486

grantfirl · 2024-06-06T17:54:41Z

Contains changes from:

NOAA-EMC/fv3atm#816
NOAA-EMC/fv3atm#831
NOAA-EMC/fv3atm#807

Plus:

Update to SCM CMakeLists.txt to require MPI (should have been implemented in UFS-dev PR#184 #482 )
Add cdata%thread_cnt initialization
Combined with Github action version updates: Node.js 16 actions are deprecated #478
Combined with Add dependabot.yml to keep Github actions up to date #480

…ated. Checkout to v4, setup-python to v5, cache to v4, upload-artifact to v4, setup-miniconda to v3.

grantfirl · 2024-06-06T20:52:28Z

@scrasmussen @mkavulich @dustinswales There are several CI issues outstanding:

Nvidia builds need changes for MPI (previous failures caused by the physics bug should be fixed in this PR - see UFS-dev PR#189 ccpp-physics#1075)
My hypothesis for the DEPHY CI test fix didn't pan out. I cannot replicate this failure on other platforms and there isn't much debugging information to go off of. This has also been intermittent with previous PRs. I have no idea.
The Dockerfile needs tweaking for MPI.

Do we want to try to fix any of these in this PR or do it separately?

grantfirl · 2024-06-06T20:59:25Z

Also @ligiabernardet @scrasmussen @dustinswales @mkavulich UFS recently updated their modulefiles for Hera: ufs-community/ufs-weather-model#2093. In particular GNU version went from 9 to 13! I'm guessing that we should follow suit. I can do this in this PR. What do you think?

dustinswales · 2024-06-06T21:19:20Z

@grantfirl We can move to ubuntu24.04 for the RT test, which has GNU 13.
I can help with debugging the CI tests.

mkavulich

Looks good

dustinswales · 2024-06-10T18:06:16Z

@grantfirl I'm still working through the CI issues. Don't let me hold this PR up. I will follow up with a CI PR once it's all working again.

grantfirl · 2024-06-10T18:15:35Z

@grantfirl I'm still working through the CI issues. Don't let me hold this PR up. I will follow up with a CI PR once it's all working again.

I'm going to update the Hera modulefiles at least before merging.

grantfirl · 2024-06-11T16:10:17Z

@mkavulich We'll need to upload the single precision artifact to the FTP server in order for the single precision RTs to not fail the RT CI script for single precision.

grantfirl · 2024-06-11T16:31:10Z

@mkavulich Could you double-check my changes to the Hera module files? I tried to make them compatible with ufs-community/ufs-weather-model#2093. Was there a reason that we needed cmake 3.28.1 or to separately load miniconda? It looks like this is already included in spack-stack 1.6.0.

FYI, when using Hera GNU, I'm getting cmake warnings about policy CMP0074 for ccpp-framework and ccpp-physics. We may need to add an issue to resolve this warning in those repos.

mkavulich · 2024-06-11T17:07:34Z

@grantfirl cmake should be set to version 3.23.1; the newer version is the machine default but we should load 3.23.1 which is the spack-stack version, just for consistency across platforms. I don't think there was a need to also load miniconda, I might have copied that incorrectly from an older module file. If the tests are all working without it then I'd leave it out.

Is there a way to get the artifacts from these failed tests? Right now it's failing with an error because it can't download the data. My thinking is I could create a fake baseline file that's just a copy of the double-precision tests for it to download and compare, which should give us a "failed" test but it should complete without an error, which should get us a real Single-Precision artifact we can upload. Does that sound like a good plan?

grantfirl · 2024-06-11T17:14:27Z

@grantfirl cmake should be set to version 3.23.1; the newer version is the machine default but we should load 3.23.1 which is the spack-stack version, just for consistency across platforms. I don't think there was a need to also load miniconda, I might have copied that incorrectly from an older module file. If the tests are all working without it then I'd leave it out.

Is there a way to get the artifacts from these failed tests? Right now it's failing with an error because it can't download the data. My thinking is I could create a fake baseline file that's just a copy of the double-precision tests for it to download and compare, which should give us a "failed" test but it should complete without an error, which should get us a real Single-Precision artifact we can upload. Does that sound like a good plan?

I'm thinking that your proposal would work. I don't know how else to do it.

mkavulich · 2024-06-11T17:39:09Z

Okay the "fake" artifacts are in place (baseline and plots). Let me know if you need anything else.

mkavulich · 2024-06-17T22:36:54Z

@grantfirl Can you make a quick change before this PR is merged? I got an email from Lara Ziady suggesting I move our staged artifacts to a new location on the web server. It's a one-line change, and shouldn't need any additional testing assuming the tests still pass after this change (I already copied the artifacts to the new location):

diff --git a/.github/workflows/ci_run_scm_rts.yml b/.github/workflows/ci_run_scm_rts.yml
index 7ce607c..319576f 100644
--- a/.github/workflows/ci_run_scm_rts.yml
+++ b/.github/workflows/ci_run_scm_rts.yml
@@ -208,7 +208,7 @@ jobs:
     - name: Download SCM RT baselines
       run: |
         cd ${dir_bl}
-        wget https://dtcenter.ucar.edu/ccpp/users/rt/rt-baselines-${{matrix.build-type}}.zip
+        wget https://dtcenter.ucar.edu/ccpp/rt/rt-baselines-${{matrix.build-type}}.zip
         unzip rt-baselines-${{matrix.build-type}}.zip

dustinswales · 2024-06-18T21:22:41Z

@grantfirl cmake should be set to version 3.23.1; the newer version is the machine default but we should load 3.23.1 which is the spack-stack version, just for consistency across platforms. I don't think there was a need to also load miniconda, I might have copied that incorrectly from an older module file. If the tests are all working without it then I'd leave it out.

Is there a way to get the artifacts from these failed tests? Right now it's failing with an error because it can't download the data. My thinking is I could create a fake baseline file that's just a copy of the double-precision tests for it to download and compare, which should give us a "failed" test but it should complete without an error, which should get us a real Single-Precision artifact we can upload. Does that sound like a good plan?

@mkavulich
When I first created the Baseline artifacts for the CI , I ran the CI script with the comparison commented out. Then I moved the "Un compared" artifact to the FTP server, unconnected the comparison step, and reran the CI.
For the SP tests you can do the same thing to create the baselines.
Unfortunately, there is a subset of SDFs for SP, so we can't use the same BL for both SP and the Release/Debug tests. To get around this, one could subset the existing baseline files to only include the tests used by the SP tests, and upload this to the FTP server.

grantfirl · 2024-07-01T18:58:52Z

@grantfirl Can you make a quick change before this PR is merged? I got an email from Lara Ziady suggesting I move our staged artifacts to a new location on the web server. It's a one-line change, and shouldn't need any additional testing assuming the tests still pass after this change (I already copied the artifacts to the new location):
diff --git a/.github/workflows/ci_run_scm_rts.yml b/.github/workflows/ci_run_scm_rts.yml
index 7ce607c..319576f 100644
--- a/.github/workflows/ci_run_scm_rts.yml
+++ b/.github/workflows/ci_run_scm_rts.yml
@@ -208,7 +208,7 @@ jobs:
     - name: Download SCM RT baselines
       run: |
         cd ${dir_bl}
-        wget https://dtcenter.ucar.edu/ccpp/users/rt/rt-baselines-${{matrix.build-type}}.zip
+        wget https://dtcenter.ucar.edu/ccpp/rt/rt-baselines-${{matrix.build-type}}.zip
         unzip rt-baselines-${{matrix.build-type}}.zip

Done.

grantfirl · 2024-07-08T21:12:32Z

@dustinswales @mkavulich @scrasmussen Unfortunately, there are some runtime failures in the CI RTs for some cases/suites. I cannot replicate the failures locally, so I don't know how to debug. Any ideas?

dustinswales · 2024-07-08T21:27:12Z

@grantfirl Looking into the CI failures now.

dustinswales · 2024-07-08T22:30:06Z

@grantfirl I also cannot replicate this failure on Hera.
Some observations:

All failed tests are unsupported for v7. https://github.com/NCAR/ccpp-scm/blob/main/test/rt_test_cases.py#L31
The RELEASE mode tests fail (2) w/ error-code 1.
The DEBUG mode tests fail (7) w/ error code 136.
The SP mode tests fail (1) w/ error code 139.
The CI tests use GNU v11, whereas I used GNUv13 on Hera. I'm testing the CI scripts with GNU12 to see if that's the problem. , which if fails I will update the CI script to GNU13, using ubunto24.04.

dustinswales · 2024-07-09T14:54:25Z

@grantfirl I also cannot replicate this failure on Hera. Some observations:

All failed tests are unsupported for v7. https://github.com/NCAR/ccpp-scm/blob/main/test/rt_test_cases.py#L31

The RELEASE mode tests fail (2) w/ error-code 1.

The DEBUG mode tests fail (7) w/ error code 136.

The SP mode tests fail (1) w/ error code 139.
The CI tests use GNU v11, whereas I used GNUv13 on Hera. I'm testing the CI scripts with GNU12 to see if that's the problem. , which if fails I will update the CI script to GNU13, using ubunto24.04.

@grantfirl
Some improvement going from GNU11 -> GNU12

Only failures (5) are for the same case (ARM-SGP), in DEBUG mode.
No errors with RELEASE and Single_precision modes.
Both supported (GFSv16 and RRFS_v1) and unsupported suites fail this time w/ error code 136.

I'm going to try GNU13 using Ubuntu24.04 and see what happens.

dustinswales · 2024-07-09T16:21:34Z

@grantfirl Same story with GNU13. RELEASE and SP Pass. Errors in DEBUG mode, error code 136. I will look into this more later on today.

dustinswales · 2024-07-09T22:07:37Z

@grantfirl For some reason unknown to me, if you apply this change, all the tests run w/o error.

grantfirl · 2024-07-17T15:07:55Z

@grantfirl For some reason unknown to me, if you apply this change, all the tests run w/o error.

@dustinswales @mkavulich @scrasmussen This doesn't seem to be the case. With 93732db, I'm still seeing some status 136s and 139s except the output is more verbose. My hunch is that there is an MPI issue causing this with the GitHub workflow somehow. I think that we can debug this more effectively using containers after the release. I don't think that this should hold up anything since we can't replicate failures on any other machines.

grantfirl · 2024-07-17T15:10:23Z

@dustinswales I removed the extra verbosity flag for the RTs for now since it was just adding length and made it harder to find failures.

scrasmussen and others added 12 commits May 22, 2024 18:03

Updating Github action versions that the CI is warning will be deprec…

2f0478c

…ated. Checkout to v4, setup-python to v5, cache to v4, upload-artifact to v4, setup-miniconda to v3.

Add dependabot.yml to keep Github actions up to date

a19a29d

Pointing to bugfix/build_scm_nvfortran branch

8cb1c75

update ccpp/physics submodule

fdeba64

Merge branch 'main' into update/github-action-versions

c9de89a

GFS_typedefs changes for ufs-dev-PR205

7a6cf52

host changes to work with ufs-dev NCAR#189

1c7d5c6

add MPI include statements to SCM CMakeLists.txt

679a679

update ccpp/physics submodule pointer

9d2bb11

set cdata%thread_cnt in scm.F90

893cb7e

Merge branch 'update/github-action-versions' into ufs-dev-PR189

f8b505d

Merge branch 'feature-add-dependabot' into ufs-dev-PR189

283208e

grantfirl requested review from dustinswales and mkavulich as code owners June 6, 2024 17:54

grantfirl added 2 commits June 6, 2024 14:02

update .gitmodules for testing

53d9628

update ccpp/physics submodule pointer

33bf2b0

This was referenced Jun 6, 2024

Github action version updates: Node.js 16 actions are deprecated #478

Closed

Add dependabot.yml to keep Github actions up to date #480

Closed

add Debug mode to DEPHY CI test to debug

a47774b

mkavulich approved these changes Jun 6, 2024

View reviewed changes

Merge branch 'main' into ufs-dev-PR189

fe374b6

grantfirl added 2 commits June 11, 2024 16:14

update Hera modulefiles to match the latest UFS

505a473

be specific about w3emc in hera_gnu.lua

8a479e8

grantfirl mentioned this pull request Jul 1, 2024

bugfix: NVidia Github Actions #489

Merged

grantfirl added 2 commits July 1, 2024 14:56

Merge branch 'main' into ufs-dev-PR189

11e52c9

change RT artifact dir on server

e10149b

update ccpp/physics submodule pointer and .gitmodules

dbff20b

grantfirl force-pushed the ufs-dev-PR189 branch from 93732db to dbff20b Compare July 17, 2024 15:09

grantfirl merged commit ab270c2 into NCAR:main Jul 17, 2024
46 of 48 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UFS-dev PR#189 #486

UFS-dev PR#189 #486

grantfirl commented Jun 6, 2024 •

edited

Loading

grantfirl commented Jun 6, 2024 •

edited

Loading

grantfirl commented Jun 6, 2024

dustinswales commented Jun 6, 2024

mkavulich left a comment

dustinswales commented Jun 10, 2024

grantfirl commented Jun 10, 2024

grantfirl commented Jun 11, 2024

grantfirl commented Jun 11, 2024

mkavulich commented Jun 11, 2024

grantfirl commented Jun 11, 2024

mkavulich commented Jun 11, 2024

mkavulich commented Jun 17, 2024

dustinswales commented Jun 18, 2024

grantfirl commented Jul 1, 2024

grantfirl commented Jul 8, 2024

dustinswales commented Jul 8, 2024

dustinswales commented Jul 8, 2024

dustinswales commented Jul 9, 2024

dustinswales commented Jul 9, 2024

dustinswales commented Jul 9, 2024

grantfirl commented Jul 17, 2024 •

edited

Loading

grantfirl commented Jul 17, 2024

UFS-dev PR#189 #486

UFS-dev PR#189 #486

Conversation

grantfirl commented Jun 6, 2024 • edited Loading

grantfirl commented Jun 6, 2024 • edited Loading

grantfirl commented Jun 6, 2024

dustinswales commented Jun 6, 2024

mkavulich left a comment

Choose a reason for hiding this comment

dustinswales commented Jun 10, 2024

grantfirl commented Jun 10, 2024

grantfirl commented Jun 11, 2024

grantfirl commented Jun 11, 2024

mkavulich commented Jun 11, 2024

grantfirl commented Jun 11, 2024

mkavulich commented Jun 11, 2024

mkavulich commented Jun 17, 2024

dustinswales commented Jun 18, 2024

grantfirl commented Jul 1, 2024

grantfirl commented Jul 8, 2024

dustinswales commented Jul 8, 2024

dustinswales commented Jul 8, 2024

dustinswales commented Jul 9, 2024

dustinswales commented Jul 9, 2024

dustinswales commented Jul 9, 2024

grantfirl commented Jul 17, 2024 • edited Loading

grantfirl commented Jul 17, 2024

grantfirl commented Jun 6, 2024 •

edited

Loading

grantfirl commented Jun 6, 2024 •

edited

Loading

grantfirl commented Jul 17, 2024 •

edited

Loading