Upgrade libraries to spack-stack/1.5.0 to match the UFS versions #866

DavidHuber-NOAA · 2023-10-31T19:45:17Z

DESCRIPTION OF CHANGES:

This upgrades the libraries to those currently used by the UFS as managed by spack-stack version 1.5.0 on Orion and Hera. In particular, this upgrades netCDF-c to 4.9.2, netCDF-Fortran to 4.6.0, HDF5 to 1.14.0, ip to 4.3.0, w3emc to 2.10.0, and ESMF to 8.4.2. The CI library versions were also updated to match these versions.

Lastly, during testing, it was determined that two tests needed additional resources. chgres_cube with sigio used close to 60GB on Orion (increased memory request from 50GB to 75GB) and took just less than 15 minutes (increased time request to 20 minutes from 15) and cpld_gridgen at 025 took a little longer than 15 minutes on Orion (increased time request from 15 minutes to 20 minutes).

TESTS CONDUCTED:

If there are changes to the build or source code, the tests below must be conducted. Contact a repository manager if you need assistance.

Compile branch on all Tier 1 machines using Intel (Orion, Jet, Hera and WCOSS2).
Compile branch on Hera using GNU.
Compile branch in 'Debug' mode on WCOSS2.
Run unit tests locally on any Tier 1 machine.
Run relevant consistency tests locally on all Tier 1 machine.

Describe any additional tests performed.

DEPENDENCIES:

None

DOCUMENTATION:

N/A

ISSUE:

#859

…k_gw

…/spack-stack_gw

DavidHuber-NOAA · 2023-11-01T14:05:11Z

I do not understand this CI error from the GCC ctests. The same tests passed successfully on commit, but are failing only in this PR and it looks like it is running the same workflows. Is there a difference between the tests run in the PR and those run on commit?

GeorgeGayno-NOAA · 2023-11-01T14:17:03Z

I do not understand this CI error from the GCC ctests. The same tests passed successfully on commit, but are failing only in this PR and it looks like it is running the same workflows. Is there a difference between the tests run in the PR and those run on commit?

I see a lot of 'illegal' instructions. I can't explain it, but I have seen that error before. Rerunning the workflow usually fixes it.

DavidHuber-NOAA · 2023-11-01T14:44:27Z

@GeorgeGayno-NOAA Thanks! I will likely push an update for Jet today, so I will let it run then.

Note that the chgres RTs violate the debug QOS policy, so I changed the QOS to batch.

DavidHuber-NOAA · 2023-11-01T19:44:00Z

Ran regression tests on all tier-1 platforms -- all passed with the only needed changes on Orion (time limits, memory) and Jet (different QOS). The GCC CI ctests continue to fail, but I think that is because the environment needs to be built from scratch again, but the setup job does not clean the environment. @GeorgeGayno-NOAA Any suggestions on how to clean the environment?

GeorgeGayno-NOAA · 2023-11-01T20:44:20Z

Ran regression tests on all tier-1 platforms -- all passed with the only needed changes on Orion (time limits, memory) and Jet (different QOS). The GCC CI ctests continue to fail, but I think that is because the environment needs to be built from scratch again, but the setup job does not clean the environment. @GeorgeGayno-NOAA Any suggestions on how to clean the environment?

Rerunning that particular job usually fixes the problem.

@aerorahul - Any idea why the GCC test fails occasionally? Example:
https://github.com/ufs-community/UFS_UTILS/actions/runs/6722919365/job/18271879484?pr=866

GeorgeGayno-NOAA · 2023-11-06T20:34:44Z

@DavidHuber-NOAA The head of 'develop' no longer compiles on Orion. I am guessing I need to merge your PR next. When were the libraries used by UFS_UTILS removed?

aerorahul · 2023-11-06T20:38:17Z

Ran regression tests on all tier-1 platforms -- all passed with the only needed changes on Orion (time limits, memory) and Jet (different QOS). The GCC CI ctests continue to fail, but I think that is because the environment needs to be built from scratch again, but the setup job does not clean the environment. @GeorgeGayno-NOAA Any suggestions on how to clean the environment?

Rerunning that particular job usually fixes the problem.

@aerorahul - Any idea why the GCC test fails occasionally? Example: https://github.com/ufs-community/UFS_UTILS/actions/runs/6722919365/job/18271879484?pr=866

Not really.
It is possible the runner is glitchy.
I am beginning to dislike the github runners for executing our codes.
Using containers and running our codes in them on these runners is a safer bet IMO.

DavidHuber-NOAA · 2023-11-06T20:44:52Z

@GeorgeGayno-NOAA IIRC, EPIC does not actually own /work/noaa/epic-ps and the libraries that were installed there by EPIC were done so by accident. I'm not sure when they were deleted exactly, but the 'correct' hpc-stack installation is here: /work/noaa/epic/role-epic/contrib/orion/hpc-stack/intel-2022.1.2/modulefiles/stack.

DavidHuber-NOAA · 2023-11-06T20:47:22Z

That said, I do not see netcdf/4.9.2 under that path, so it likely won't work for UFS_utils.

GeorgeGayno-NOAA · 2023-11-06T21:11:04Z

That said, I do not see netcdf/4.9.2 under that path, so it likely won't work for UFS_utils.

Ok. Is your branch ready to merge?

DavidHuber-NOAA · 2023-11-07T13:13:57Z

@GeorgeGayno-NOAA Yes, I just verified there are no new commits in develop that need to be merged/tested. It is ready to merge.

GeorgeGayno-NOAA · 2023-11-07T14:40:32Z

@GeorgeGayno-NOAA Yes, I just verified there are no new commits in develop that need to be merged/tested. It is ready to merge.

Ok. To be safe I recommend you merge yesterday's updates from 'develop' to your branch.

DavidHuber-NOAA · 2023-11-07T14:59:06Z

@GeorgeGayno-NOAA Thanks for the heads up, I missed that. I will rerun RTs on Jet today just to verify everything is still OK.

DavidHuber-NOAA · 2023-11-07T17:16:16Z

@GeorgeGayno-NOAA All regression tests passed on Jet after merging in develop. I believe this PR is now ready to merge.

GeorgeGayno-NOAA · 2023-11-07T21:37:54Z

@GeorgeGayno-NOAA All regression tests passed on Jet after merging in develop. I believe this PR is now ready to merge.

Several regression tests failed on Orion. I suspect the failures do not indicate a problem with your updates. I will confirm tomorrow. @DeniseWorthen - can you check the failed cpld_gridgen tests. My branch is here: /work/noaa/stmp/ggayno/huber/UFS_UTILS/reg_tests/cpld_gridgen

DeniseWorthen · 2023-11-07T22:53:22Z

@GeorgeGayno-NOAA Are other regression tests on orion passing, but the cpld_gridgen are not? Because I see no reason why these tests in particular would fail comparison. I appears that anything using ESMF weight generation is B4B different. I know when we updated the UWM to SS1.5.0 we had baseline changes. Do the cpld_gridgen baselines pass on hera but not on orion?

DeniseWorthen · 2023-11-07T23:08:42Z

EDIT: Disregard the below comment about location, I was looking at the hera location.

Regarding non-B4B in UWM w/ the move to SS, I think those changes were all put down to the change in the MAPL library. I'm also remembering that there were issues w/ ESMF meshes (which this code uses) when they first began testing SS for UWM. Are we sure there are no build options required in UFS-UTILS that we're missing?

Another question---UWM uses this location as module file on orion:

prepend_path("MODULEPATH", "/work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.5.0/envs/unified-env/install/modulefiles/Core")

This doesn't seem to be the same location as in this PR.

DavidHuber-NOAA · 2023-11-08T12:52:46Z

@GeorgeGayno-NOAA @DeniseWorthen The regression tests failed for me on Orion yesterday as well -- I should have spoken to that. Since the develop branch fails to build on Orion due to missing libraries (i.e. the recently deleted tree that was in /work/noaa/epic-ps). However, the rt.sh script still attempts to launch the tests on Orion, but they obviously fail due to a lack of a comparison run. This is why I opted to run the tests on Jet yesterday.

GeorgeGayno-NOAA · 2023-11-08T13:55:53Z

@GeorgeGayno-NOAA Are other regression tests on orion passing, but the cpld_gridgen are not? Because I see no reason why these tests in particular would fail comparison. I appears that anything using ESMF weight generation is B4B different. I know when we updated the UWM to SS1.5.0 we had baseline changes. Do the cpld_gridgen baselines pass on hera but not on orion?

Other regression tests are failing on Orion, not just cpld_gridgen. A quick look at one of the chgres tests showed very small (ie. floating point) differences with the baseline. So, I don't think there is any real problem with these updates. But we should check.

DeniseWorthen · 2023-11-08T14:15:45Z

I also saw roundoff level differences in my spot check. I'm curious...the weights files that we just moved into the fix directory are now slightly different. Isn't the idea that they are fix files in conflict with needing new ones potentially each time SS updates?

GeorgeGayno-NOAA · 2023-11-08T15:21:08Z

I also saw roundoff level differences in my spot check. I'm curious...the weights files that we just moved into the fix directory are now slightly different. Isn't the idea that they are fix files in conflict with needing new ones potentially each time SS updates?

Agree with your statement. Things change so rapidly at EMC now, that the concept of a 'fix' directory can have little meaning.

If the differences are roundoff level, then I would not bother updating the 'fix' directory.

DavidHuber-NOAA · 2023-11-08T18:17:51Z

I reran regression tests on Hera after realizing that I was running develop only. All regression tests failed, though consistency differences all appear to be at the roundoff level. The regression tests were performed in /scratch1/NCEPDEV/global/David.Huber/para/stmp/ufs-utils/UFS_UTILS.

Comparing results against @GeorgeGayno-NOAA's tests revealed that only one test differs significantly between Orion and Hera: grid_gen/c96.uniform. Consistency differences on Hera are no larger than 10^-7. On Orion, the grid_gen c96.uniform test is returning very large differences for the generated fix files. All other grid_gen tests seem to be reasonable for both platforms.

GeorgeGayno-NOAA · 2023-11-08T18:47:00Z

I reran regression tests on Hera after realizing that I was running develop only. All regression tests failed, though consistency differences all appear to be at the roundoff level. The regression tests were performed in /scratch1/NCEPDEV/global/David.Huber/para/stmp/ufs-utils/UFS_UTILS.

Comparing results against @GeorgeGayno-NOAA's tests revealed that only one test differs significantly between Orion and Hera: grid_gen/c96.uniform. Consistency differences on Hera are no larger than 10^-7. On Orion, the grid_gen c96.uniform test is returning very large differences for the generated fix files. All other grid_gen tests seem to be reasonable for both platforms.

The differences on Orion can be explained. The previous merge changed results for the c96.uniform test. However, before I could update the baseline on Orion, the libraries changed so that 'develop' would not compile.

DavidHuber-NOAA · 2023-11-08T18:49:36Z

@GeorgeGayno-NOAA OK, that's good to know. Would you like me to rerun regression tests on Jet as well?

GeorgeGayno-NOAA · 2023-11-08T18:55:51Z

@GeorgeGayno-NOAA OK, that's good to know. Would you like me to rerun regression tests on Jet as well?

I have already run them on Jet: /lfs4/HFIP/emcda/George.Gayno/stmp/huber/UFS_UTILS. I have not checked results yet.

GeorgeGayno-NOAA · 2023-11-08T19:43:58Z

@GeorgeGayno-NOAA OK, that's good to know. Would you like me to rerun regression tests on Jet as well?

I have already run them on Jet: /lfs4/HFIP/emcda/George.Gayno/stmp/huber/UFS_UTILS. I have not checked results yet.

Fewer tests failed on Jet - just the snow2mdl tests and the cpld_gridgen tests. According to the log file, a few cpld_gridgen files differed from the baseline. However, when I manually checked them using nccmp -dmfqS there were no differences. I checked the snow2mdl file differences using Grads and differences were minor.

I don't see any problems on Jet.

DavidHuber-NOAA · 2023-11-13T14:04:28Z

@GeorgeGayno-NOAA Is there anything else that needs to happen for this PR?

modulefiles/build.jet.intel.lua

DavidHuber-NOAA and others added 7 commits October 18, 2023 20:27

Update modulefiles to spack-stack-gw on Hera ufs-community#859

caa4593

Upgrade to spack-stack/1.5.0 on Hera.

30a942b

Merge remote-tracking branch 'origin/develop' into feature/spack-stac…

5d05b1d

…k_gw

Merge remote-tracking branch 'me/feature/spack-stack_gw' into feature…

d137f86

…/spack-stack_gw

Update CI spack.yaml for spack-stack upgrade ufs-community#859

dcd41af

Merge remote-tracking branch 'emc/develop' into feature/spack-stack_gw

4444fe5

Updated modules/tests for Orion ufs-community#859

4726eaa

Upgrade Jet to use spack-stack. ufs-community#859

c2b5127

Note that the chgres RTs violate the debug QOS policy, so I changed the QOS to batch.

GeorgeGayno-NOAA self-requested a review November 6, 2023 21:11

Merge remote-tracking branch 'emc/develop' into feature/spack-stack_gw

5968de4

GeorgeGayno-NOAA requested changes Nov 13, 2023

View reviewed changes

modulefiles/build.jet.intel.lua Show resolved Hide resolved

Added prod_util load to driver.jet.sh. ufs-community#866

d3ebd5d

GeorgeGayno-NOAA approved these changes Nov 13, 2023

View reviewed changes

GeorgeGayno-NOAA merged commit 892b693 into ufs-community:develop Nov 14, 2023
4 checks passed

DavidHuber-NOAA deleted the feature/spack-stack_gw branch December 19, 2023 13:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade libraries to spack-stack/1.5.0 to match the UFS versions #866

Upgrade libraries to spack-stack/1.5.0 to match the UFS versions #866

DavidHuber-NOAA commented Oct 31, 2023 •

edited

Loading

DavidHuber-NOAA commented Nov 1, 2023

GeorgeGayno-NOAA commented Nov 1, 2023

DavidHuber-NOAA commented Nov 1, 2023

DavidHuber-NOAA commented Nov 1, 2023

GeorgeGayno-NOAA commented Nov 1, 2023

GeorgeGayno-NOAA commented Nov 6, 2023

aerorahul commented Nov 6, 2023

DavidHuber-NOAA commented Nov 6, 2023

DavidHuber-NOAA commented Nov 6, 2023

GeorgeGayno-NOAA commented Nov 6, 2023

DavidHuber-NOAA commented Nov 7, 2023

GeorgeGayno-NOAA commented Nov 7, 2023

DavidHuber-NOAA commented Nov 7, 2023

DavidHuber-NOAA commented Nov 7, 2023

GeorgeGayno-NOAA commented Nov 7, 2023

DeniseWorthen commented Nov 7, 2023 •

edited

Loading

DeniseWorthen commented Nov 7, 2023 •

edited

Loading

DavidHuber-NOAA commented Nov 8, 2023

GeorgeGayno-NOAA commented Nov 8, 2023

DeniseWorthen commented Nov 8, 2023

GeorgeGayno-NOAA commented Nov 8, 2023

DavidHuber-NOAA commented Nov 8, 2023

GeorgeGayno-NOAA commented Nov 8, 2023

DavidHuber-NOAA commented Nov 8, 2023

GeorgeGayno-NOAA commented Nov 8, 2023

GeorgeGayno-NOAA commented Nov 8, 2023

DavidHuber-NOAA commented Nov 13, 2023

Upgrade libraries to spack-stack/1.5.0 to match the UFS versions #866

Upgrade libraries to spack-stack/1.5.0 to match the UFS versions #866

Conversation

DavidHuber-NOAA commented Oct 31, 2023 • edited Loading

DESCRIPTION OF CHANGES:

TESTS CONDUCTED:

DEPENDENCIES:

DOCUMENTATION:

ISSUE:

DavidHuber-NOAA commented Nov 1, 2023

GeorgeGayno-NOAA commented Nov 1, 2023

DavidHuber-NOAA commented Nov 1, 2023

DavidHuber-NOAA commented Nov 1, 2023

GeorgeGayno-NOAA commented Nov 1, 2023

GeorgeGayno-NOAA commented Nov 6, 2023

aerorahul commented Nov 6, 2023

DavidHuber-NOAA commented Nov 6, 2023

DavidHuber-NOAA commented Nov 6, 2023

GeorgeGayno-NOAA commented Nov 6, 2023

DavidHuber-NOAA commented Nov 7, 2023

GeorgeGayno-NOAA commented Nov 7, 2023

DavidHuber-NOAA commented Nov 7, 2023

DavidHuber-NOAA commented Nov 7, 2023

GeorgeGayno-NOAA commented Nov 7, 2023

DeniseWorthen commented Nov 7, 2023 • edited Loading

DeniseWorthen commented Nov 7, 2023 • edited Loading

DavidHuber-NOAA commented Nov 8, 2023

GeorgeGayno-NOAA commented Nov 8, 2023

DeniseWorthen commented Nov 8, 2023

GeorgeGayno-NOAA commented Nov 8, 2023

DavidHuber-NOAA commented Nov 8, 2023

GeorgeGayno-NOAA commented Nov 8, 2023

DavidHuber-NOAA commented Nov 8, 2023

GeorgeGayno-NOAA commented Nov 8, 2023

GeorgeGayno-NOAA commented Nov 8, 2023

DavidHuber-NOAA commented Nov 13, 2023

DavidHuber-NOAA commented Oct 31, 2023 •

edited

Loading

DeniseWorthen commented Nov 7, 2023 •

edited

Loading

DeniseWorthen commented Nov 7, 2023 •

edited

Loading