Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade libraries to spack-stack/1.5.0 to match the UFS versions #866

Merged

Conversation

DavidHuber-NOAA
Copy link
Collaborator

@DavidHuber-NOAA DavidHuber-NOAA commented Oct 31, 2023

DESCRIPTION OF CHANGES:

This upgrades the libraries to those currently used by the UFS as managed by spack-stack version 1.5.0 on Orion and Hera. In particular, this upgrades netCDF-c to 4.9.2, netCDF-Fortran to 4.6.0, HDF5 to 1.14.0, ip to 4.3.0, w3emc to 2.10.0, and ESMF to 8.4.2. The CI library versions were also updated to match these versions.

Lastly, during testing, it was determined that two tests needed additional resources. chgres_cube with sigio used close to 60GB on Orion (increased memory request from 50GB to 75GB) and took just less than 15 minutes (increased time request to 20 minutes from 15) and cpld_gridgen at 025 took a little longer than 15 minutes on Orion (increased time request from 15 minutes to 20 minutes).

TESTS CONDUCTED:

If there are changes to the build or source code, the tests below must be conducted. Contact a repository manager if you need assistance.

  • Compile branch on all Tier 1 machines using Intel (Orion, Jet, Hera and WCOSS2).
  • Compile branch on Hera using GNU.
  • Compile branch in 'Debug' mode on WCOSS2.
  • Run unit tests locally on any Tier 1 machine.
  • Run relevant consistency tests locally on all Tier 1 machine.

Describe any additional tests performed.

DEPENDENCIES:

None

DOCUMENTATION:

N/A

ISSUE:

#859

@DavidHuber-NOAA
Copy link
Collaborator Author

I do not understand this CI error from the GCC ctests. The same tests passed successfully on commit, but are failing only in this PR and it looks like it is running the same workflows. Is there a difference between the tests run in the PR and those run on commit?

@GeorgeGayno-NOAA
Copy link
Collaborator

I do not understand this CI error from the GCC ctests. The same tests passed successfully on commit, but are failing only in this PR and it looks like it is running the same workflows. Is there a difference between the tests run in the PR and those run on commit?

I see a lot of 'illegal' instructions. I can't explain it, but I have seen that error before. Rerunning the workflow usually fixes it.

@DavidHuber-NOAA
Copy link
Collaborator Author

@GeorgeGayno-NOAA Thanks! I will likely push an update for Jet today, so I will let it run then.

Note that the chgres RTs violate the debug QOS policy, so I changed the
QOS to batch.
@DavidHuber-NOAA
Copy link
Collaborator Author

Ran regression tests on all tier-1 platforms -- all passed with the only needed changes on Orion (time limits, memory) and Jet (different QOS). The GCC CI ctests continue to fail, but I think that is because the environment needs to be built from scratch again, but the setup job does not clean the environment. @GeorgeGayno-NOAA Any suggestions on how to clean the environment?

@GeorgeGayno-NOAA
Copy link
Collaborator

Ran regression tests on all tier-1 platforms -- all passed with the only needed changes on Orion (time limits, memory) and Jet (different QOS). The GCC CI ctests continue to fail, but I think that is because the environment needs to be built from scratch again, but the setup job does not clean the environment. @GeorgeGayno-NOAA Any suggestions on how to clean the environment?

Rerunning that particular job usually fixes the problem.

@aerorahul - Any idea why the GCC test fails occasionally? Example:
https://github.com/ufs-community/UFS_UTILS/actions/runs/6722919365/job/18271879484?pr=866

@GeorgeGayno-NOAA
Copy link
Collaborator

@DavidHuber-NOAA The head of 'develop' no longer compiles on Orion. I am guessing I need to merge your PR next. When were the libraries used by UFS_UTILS removed?

@aerorahul
Copy link
Contributor

Ran regression tests on all tier-1 platforms -- all passed with the only needed changes on Orion (time limits, memory) and Jet (different QOS). The GCC CI ctests continue to fail, but I think that is because the environment needs to be built from scratch again, but the setup job does not clean the environment. @GeorgeGayno-NOAA Any suggestions on how to clean the environment?

Rerunning that particular job usually fixes the problem.

@aerorahul - Any idea why the GCC test fails occasionally? Example: https://github.com/ufs-community/UFS_UTILS/actions/runs/6722919365/job/18271879484?pr=866

Not really.
It is possible the runner is glitchy.
I am beginning to dislike the github runners for executing our codes.
Using containers and running our codes in them on these runners is a safer bet IMO.

@DavidHuber-NOAA
Copy link
Collaborator Author

@GeorgeGayno-NOAA IIRC, EPIC does not actually own /work/noaa/epic-ps and the libraries that were installed there by EPIC were done so by accident. I'm not sure when they were deleted exactly, but the 'correct' hpc-stack installation is here: /work/noaa/epic/role-epic/contrib/orion/hpc-stack/intel-2022.1.2/modulefiles/stack.

@DavidHuber-NOAA
Copy link
Collaborator Author

That said, I do not see netcdf/4.9.2 under that path, so it likely won't work for UFS_utils.

@GeorgeGayno-NOAA
Copy link
Collaborator

That said, I do not see netcdf/4.9.2 under that path, so it likely won't work for UFS_utils.

Ok. Is your branch ready to merge?

@GeorgeGayno-NOAA GeorgeGayno-NOAA self-requested a review November 6, 2023 21:11
@DavidHuber-NOAA
Copy link
Collaborator Author

@GeorgeGayno-NOAA Yes, I just verified there are no new commits in develop that need to be merged/tested. It is ready to merge.

@GeorgeGayno-NOAA
Copy link
Collaborator

@GeorgeGayno-NOAA Yes, I just verified there are no new commits in develop that need to be merged/tested. It is ready to merge.

Ok. To be safe I recommend you merge yesterday's updates from 'develop' to your branch.

@DavidHuber-NOAA
Copy link
Collaborator Author

@GeorgeGayno-NOAA Thanks for the heads up, I missed that. I will rerun RTs on Jet today just to verify everything is still OK.

@DavidHuber-NOAA
Copy link
Collaborator Author

@GeorgeGayno-NOAA All regression tests passed on Jet after merging in develop. I believe this PR is now ready to merge.

@GeorgeGayno-NOAA
Copy link
Collaborator

@GeorgeGayno-NOAA All regression tests passed on Jet after merging in develop. I believe this PR is now ready to merge.

Several regression tests failed on Orion. I suspect the failures do not indicate a problem with your updates. I will confirm tomorrow. @DeniseWorthen - can you check the failed cpld_gridgen tests. My branch is here: /work/noaa/stmp/ggayno/huber/UFS_UTILS/reg_tests/cpld_gridgen

@DeniseWorthen
Copy link
Contributor

DeniseWorthen commented Nov 7, 2023

@GeorgeGayno-NOAA Are other regression tests on orion passing, but the cpld_gridgen are not? Because I see no reason why these tests in particular would fail comparison. I appears that anything using ESMF weight generation is B4B different. I know when we updated the UWM to SS1.5.0 we had baseline changes. Do the cpld_gridgen baselines pass on hera but not on orion?

@DeniseWorthen
Copy link
Contributor

DeniseWorthen commented Nov 7, 2023

EDIT: Disregard the below comment about location, I was looking at the hera location.

Regarding non-B4B in UWM w/ the move to SS, I think those changes were all put down to the change in the MAPL library. I'm also remembering that there were issues w/ ESMF meshes (which this code uses) when they first began testing SS for UWM. Are we sure there are no build options required in UFS-UTILS that we're missing?

Another question---UWM uses this location as module file on orion:

prepend_path("MODULEPATH", "/work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.5.0/envs/unified-env/install/modulefiles/Core")

This doesn't seem to be the same location as in this PR.

@DavidHuber-NOAA
Copy link
Collaborator Author

@GeorgeGayno-NOAA @DeniseWorthen The regression tests failed for me on Orion yesterday as well -- I should have spoken to that. Since the develop branch fails to build on Orion due to missing libraries (i.e. the recently deleted tree that was in /work/noaa/epic-ps). However, the rt.sh script still attempts to launch the tests on Orion, but they obviously fail due to a lack of a comparison run. This is why I opted to run the tests on Jet yesterday.

@GeorgeGayno-NOAA
Copy link
Collaborator

@GeorgeGayno-NOAA Are other regression tests on orion passing, but the cpld_gridgen are not? Because I see no reason why these tests in particular would fail comparison. I appears that anything using ESMF weight generation is B4B different. I know when we updated the UWM to SS1.5.0 we had baseline changes. Do the cpld_gridgen baselines pass on hera but not on orion?

Other regression tests are failing on Orion, not just cpld_gridgen. A quick look at one of the chgres tests showed very small (ie. floating point) differences with the baseline. So, I don't think there is any real problem with these updates. But we should check.

@DeniseWorthen
Copy link
Contributor

I also saw roundoff level differences in my spot check. I'm curious...the weights files that we just moved into the fix directory are now slightly different. Isn't the idea that they are fix files in conflict with needing new ones potentially each time SS updates?

@GeorgeGayno-NOAA
Copy link
Collaborator

I also saw roundoff level differences in my spot check. I'm curious...the weights files that we just moved into the fix directory are now slightly different. Isn't the idea that they are fix files in conflict with needing new ones potentially each time SS updates?

Agree with your statement. Things change so rapidly at EMC now, that the concept of a 'fix' directory can have little meaning.

If the differences are roundoff level, then I would not bother updating the 'fix' directory.

@DavidHuber-NOAA
Copy link
Collaborator Author

I reran regression tests on Hera after realizing that I was running develop only. All regression tests failed, though consistency differences all appear to be at the roundoff level. The regression tests were performed in /scratch1/NCEPDEV/global/David.Huber/para/stmp/ufs-utils/UFS_UTILS.

Comparing results against @GeorgeGayno-NOAA's tests revealed that only one test differs significantly between Orion and Hera: grid_gen/c96.uniform. Consistency differences on Hera are no larger than 10^-7. On Orion, the grid_gen c96.uniform test is returning very large differences for the generated fix files. All other grid_gen tests seem to be reasonable for both platforms.

@GeorgeGayno-NOAA
Copy link
Collaborator

I reran regression tests on Hera after realizing that I was running develop only. All regression tests failed, though consistency differences all appear to be at the roundoff level. The regression tests were performed in /scratch1/NCEPDEV/global/David.Huber/para/stmp/ufs-utils/UFS_UTILS.

Comparing results against @GeorgeGayno-NOAA's tests revealed that only one test differs significantly between Orion and Hera: grid_gen/c96.uniform. Consistency differences on Hera are no larger than 10^-7. On Orion, the grid_gen c96.uniform test is returning very large differences for the generated fix files. All other grid_gen tests seem to be reasonable for both platforms.

The differences on Orion can be explained. The previous merge changed results for the c96.uniform test. However, before I could update the baseline on Orion, the libraries changed so that 'develop' would not compile.

@DavidHuber-NOAA
Copy link
Collaborator Author

@GeorgeGayno-NOAA OK, that's good to know. Would you like me to rerun regression tests on Jet as well?

@GeorgeGayno-NOAA
Copy link
Collaborator

@GeorgeGayno-NOAA OK, that's good to know. Would you like me to rerun regression tests on Jet as well?

I have already run them on Jet: /lfs4/HFIP/emcda/George.Gayno/stmp/huber/UFS_UTILS. I have not checked results yet.

@GeorgeGayno-NOAA
Copy link
Collaborator

@GeorgeGayno-NOAA OK, that's good to know. Would you like me to rerun regression tests on Jet as well?

I have already run them on Jet: /lfs4/HFIP/emcda/George.Gayno/stmp/huber/UFS_UTILS. I have not checked results yet.

Fewer tests failed on Jet - just the snow2mdl tests and the cpld_gridgen tests. According to the log file, a few cpld_gridgen files differed from the baseline. However, when I manually checked them using nccmp -dmfqS there were no differences. I checked the snow2mdl file differences using Grads and differences were minor.

I don't see any problems on Jet.

@DavidHuber-NOAA
Copy link
Collaborator Author

@GeorgeGayno-NOAA Is there anything else that needs to happen for this PR?

@GeorgeGayno-NOAA GeorgeGayno-NOAA merged commit 892b693 into ufs-community:develop Nov 14, 2023
4 checks passed
@DavidHuber-NOAA DavidHuber-NOAA deleted the feature/spack-stack_gw branch December 19, 2023 13:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants