Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UFS-dev PR#189 #486

Merged
merged 21 commits into from
Jul 17, 2024
Merged

UFS-dev PR#189 #486

merged 21 commits into from
Jul 17, 2024

Conversation

grantfirl
Copy link
Collaborator

@grantfirl grantfirl commented Jun 6, 2024

Contains changes from:

NOAA-EMC/fv3atm#816
NOAA-EMC/fv3atm#831
NOAA-EMC/fv3atm#807

Plus:

  1. Update to SCM CMakeLists.txt to require MPI (should have been implemented in UFS-dev PR#184 #482 )
  2. Add cdata%thread_cnt initialization
  3. Combined with Github action version updates: Node.js 16 actions are deprecated #478
  4. Combined with Add dependabot.yml to keep Github actions up to date #480

@grantfirl
Copy link
Collaborator Author

grantfirl commented Jun 6, 2024

@scrasmussen @mkavulich @dustinswales There are several CI issues outstanding:

  1. Nvidia builds need changes for MPI (previous failures caused by the physics bug should be fixed in this PR - see UFS-dev PR#189 ccpp-physics#1075)
  2. My hypothesis for the DEPHY CI test fix didn't pan out. I cannot replicate this failure on other platforms and there isn't much debugging information to go off of. This has also been intermittent with previous PRs. I have no idea.
  3. The Dockerfile needs tweaking for MPI.

Do we want to try to fix any of these in this PR or do it separately?

@grantfirl
Copy link
Collaborator Author

Also @ligiabernardet @scrasmussen @dustinswales @mkavulich UFS recently updated their modulefiles for Hera: ufs-community/ufs-weather-model#2093. In particular GNU version went from 9 to 13! I'm guessing that we should follow suit. I can do this in this PR. What do you think?

@dustinswales
Copy link
Collaborator

@grantfirl We can move to ubuntu24.04 for the RT test, which has GNU 13.
I can help with debugging the CI tests.

Copy link
Collaborator

@mkavulich mkavulich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@dustinswales
Copy link
Collaborator

@grantfirl I'm still working through the CI issues. Don't let me hold this PR up. I will follow up with a CI PR once it's all working again.

@grantfirl
Copy link
Collaborator Author

@grantfirl I'm still working through the CI issues. Don't let me hold this PR up. I will follow up with a CI PR once it's all working again.

I'm going to update the Hera modulefiles at least before merging.

@grantfirl
Copy link
Collaborator Author

@mkavulich We'll need to upload the single precision artifact to the FTP server in order for the single precision RTs to not fail the RT CI script for single precision.

@grantfirl
Copy link
Collaborator Author

@mkavulich Could you double-check my changes to the Hera module files? I tried to make them compatible with ufs-community/ufs-weather-model#2093. Was there a reason that we needed cmake 3.28.1 or to separately load miniconda? It looks like this is already included in spack-stack 1.6.0.

FYI, when using Hera GNU, I'm getting cmake warnings about policy CMP0074 for ccpp-framework and ccpp-physics. We may need to add an issue to resolve this warning in those repos.

@mkavulich
Copy link
Collaborator

@grantfirl cmake should be set to version 3.23.1; the newer version is the machine default but we should load 3.23.1 which is the spack-stack version, just for consistency across platforms. I don't think there was a need to also load miniconda, I might have copied that incorrectly from an older module file. If the tests are all working without it then I'd leave it out.

Is there a way to get the artifacts from these failed tests? Right now it's failing with an error because it can't download the data. My thinking is I could create a fake baseline file that's just a copy of the double-precision tests for it to download and compare, which should give us a "failed" test but it should complete without an error, which should get us a real Single-Precision artifact we can upload. Does that sound like a good plan?

@grantfirl
Copy link
Collaborator Author

@grantfirl cmake should be set to version 3.23.1; the newer version is the machine default but we should load 3.23.1 which is the spack-stack version, just for consistency across platforms. I don't think there was a need to also load miniconda, I might have copied that incorrectly from an older module file. If the tests are all working without it then I'd leave it out.

Is there a way to get the artifacts from these failed tests? Right now it's failing with an error because it can't download the data. My thinking is I could create a fake baseline file that's just a copy of the double-precision tests for it to download and compare, which should give us a "failed" test but it should complete without an error, which should get us a real Single-Precision artifact we can upload. Does that sound like a good plan?

I'm thinking that your proposal would work. I don't know how else to do it.

@mkavulich
Copy link
Collaborator

Okay the "fake" artifacts are in place (baseline and plots). Let me know if you need anything else.

@mkavulich
Copy link
Collaborator

@grantfirl Can you make a quick change before this PR is merged? I got an email from Lara Ziady suggesting I move our staged artifacts to a new location on the web server. It's a one-line change, and shouldn't need any additional testing assuming the tests still pass after this change (I already copied the artifacts to the new location):

diff --git a/.github/workflows/ci_run_scm_rts.yml b/.github/workflows/ci_run_scm_rts.yml
index 7ce607c..319576f 100644
--- a/.github/workflows/ci_run_scm_rts.yml
+++ b/.github/workflows/ci_run_scm_rts.yml
@@ -208,7 +208,7 @@ jobs:
     - name: Download SCM RT baselines
       run: |
         cd ${dir_bl}
-        wget https://dtcenter.ucar.edu/ccpp/users/rt/rt-baselines-${{matrix.build-type}}.zip
+        wget https://dtcenter.ucar.edu/ccpp/rt/rt-baselines-${{matrix.build-type}}.zip
         unzip rt-baselines-${{matrix.build-type}}.zip

@dustinswales
Copy link
Collaborator

@grantfirl cmake should be set to version 3.23.1; the newer version is the machine default but we should load 3.23.1 which is the spack-stack version, just for consistency across platforms. I don't think there was a need to also load miniconda, I might have copied that incorrectly from an older module file. If the tests are all working without it then I'd leave it out.

Is there a way to get the artifacts from these failed tests? Right now it's failing with an error because it can't download the data. My thinking is I could create a fake baseline file that's just a copy of the double-precision tests for it to download and compare, which should give us a "failed" test but it should complete without an error, which should get us a real Single-Precision artifact we can upload. Does that sound like a good plan?

@mkavulich
When I first created the Baseline artifacts for the CI , I ran the CI script with the comparison commented out. Then I moved the "Un compared" artifact to the FTP server, unconnected the comparison step, and reran the CI.
For the SP tests you can do the same thing to create the baselines.
Unfortunately, there is a subset of SDFs for SP, so we can't use the same BL for both SP and the Release/Debug tests. To get around this, one could subset the existing baseline files to only include the tests used by the SP tests, and upload this to the FTP server.

@grantfirl
Copy link
Collaborator Author

@grantfirl Can you make a quick change before this PR is merged? I got an email from Lara Ziady suggesting I move our staged artifacts to a new location on the web server. It's a one-line change, and shouldn't need any additional testing assuming the tests still pass after this change (I already copied the artifacts to the new location):

diff --git a/.github/workflows/ci_run_scm_rts.yml b/.github/workflows/ci_run_scm_rts.yml
index 7ce607c..319576f 100644
--- a/.github/workflows/ci_run_scm_rts.yml
+++ b/.github/workflows/ci_run_scm_rts.yml
@@ -208,7 +208,7 @@ jobs:
     - name: Download SCM RT baselines
       run: |
         cd ${dir_bl}
-        wget https://dtcenter.ucar.edu/ccpp/users/rt/rt-baselines-${{matrix.build-type}}.zip
+        wget https://dtcenter.ucar.edu/ccpp/rt/rt-baselines-${{matrix.build-type}}.zip
         unzip rt-baselines-${{matrix.build-type}}.zip

Done.

@grantfirl
Copy link
Collaborator Author

@dustinswales @mkavulich @scrasmussen Unfortunately, there are some runtime failures in the CI RTs for some cases/suites. I cannot replicate the failures locally, so I don't know how to debug. Any ideas?

@dustinswales
Copy link
Collaborator

@grantfirl Looking into the CI failures now.

@dustinswales
Copy link
Collaborator

@grantfirl I also cannot replicate this failure on Hera.
Some observations:

@dustinswales
Copy link
Collaborator

@grantfirl I also cannot replicate this failure on Hera. Some observations:

@grantfirl
Some improvement going from GNU11 -> GNU12

  • Only failures (5) are for the same case (ARM-SGP), in DEBUG mode.
  • No errors with RELEASE and Single_precision modes.
  • Both supported (GFSv16 and RRFS_v1) and unsupported suites fail this time w/ error code 136.

I'm going to try GNU13 using Ubuntu24.04 and see what happens.

@dustinswales
Copy link
Collaborator

@grantfirl Same story with GNU13. RELEASE and SP Pass. Errors in DEBUG mode, error code 136. I will look into this more later on today.

@dustinswales
Copy link
Collaborator

@grantfirl For some reason unknown to me, if you apply this change, all the tests run w/o error.

@grantfirl
Copy link
Collaborator Author

grantfirl commented Jul 17, 2024

@grantfirl For some reason unknown to me, if you apply this change, all the tests run w/o error.

@dustinswales @mkavulich @scrasmussen This doesn't seem to be the case. With 93732db, I'm still seeing some status 136s and 139s except the output is more verbose. My hunch is that there is an MPI issue causing this with the GitHub workflow somehow. I think that we can debug this more effectively using containers after the release. I don't think that this should hold up anything since we can't replicate failures on any other machines.

@grantfirl
Copy link
Collaborator Author

@dustinswales I removed the extra verbosity flag for the RTs for now since it was just adding length and made it harder to find failures.

@grantfirl grantfirl merged commit ab270c2 into NCAR:main Jul 17, 2024
46 of 48 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants