-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade libraries to spack-stack/1.5.0 to match the UFS versions #866
Upgrade libraries to spack-stack/1.5.0 to match the UFS versions #866
Conversation
I do not understand this CI error from the GCC ctests. The same tests passed successfully on commit, but are failing only in this PR and it looks like it is running the same workflows. Is there a difference between the tests run in the PR and those run on commit? |
I see a lot of 'illegal' instructions. I can't explain it, but I have seen that error before. Rerunning the workflow usually fixes it. |
@GeorgeGayno-NOAA Thanks! I will likely push an update for Jet today, so I will let it run then. |
Note that the chgres RTs violate the debug QOS policy, so I changed the QOS to batch.
Ran regression tests on all tier-1 platforms -- all passed with the only needed changes on Orion (time limits, memory) and Jet (different QOS). The GCC CI ctests continue to fail, but I think that is because the environment needs to be built from scratch again, but the setup job does not clean the environment. @GeorgeGayno-NOAA Any suggestions on how to clean the environment? |
Rerunning that particular job usually fixes the problem. @aerorahul - Any idea why the GCC test fails occasionally? Example: |
@DavidHuber-NOAA The head of 'develop' no longer compiles on Orion. I am guessing I need to merge your PR next. When were the libraries used by UFS_UTILS removed? |
Not really. |
@GeorgeGayno-NOAA IIRC, EPIC does not actually own /work/noaa/epic-ps and the libraries that were installed there by EPIC were done so by accident. I'm not sure when they were deleted exactly, but the 'correct' hpc-stack installation is here: |
That said, I do not see netcdf/4.9.2 under that path, so it likely won't work for UFS_utils. |
Ok. Is your branch ready to merge? |
@GeorgeGayno-NOAA Yes, I just verified there are no new commits in develop that need to be merged/tested. It is ready to merge. |
Ok. To be safe I recommend you merge yesterday's updates from 'develop' to your branch. |
@GeorgeGayno-NOAA Thanks for the heads up, I missed that. I will rerun RTs on Jet today just to verify everything is still OK. |
@GeorgeGayno-NOAA All regression tests passed on Jet after merging in develop. I believe this PR is now ready to merge. |
Several regression tests failed on Orion. I suspect the failures do not indicate a problem with your updates. I will confirm tomorrow. @DeniseWorthen - can you check the failed cpld_gridgen tests. My branch is here: |
@GeorgeGayno-NOAA Are other regression tests on orion passing, but the cpld_gridgen are not? Because I see no reason why these tests in particular would fail comparison. I appears that anything using ESMF weight generation is B4B different. I know when we updated the UWM to SS1.5.0 we had baseline changes. Do the cpld_gridgen baselines pass on hera but not on orion? |
EDIT: Disregard the below comment about location, I was looking at the hera location. Regarding non-B4B in UWM w/ the move to SS, I think those changes were all put down to the change in the MAPL library. I'm also remembering that there were issues w/ ESMF meshes (which this code uses) when they first began testing SS for UWM. Are we sure there are no build options required in UFS-UTILS that we're missing? Another question---UWM uses this location as module file on orion:
This doesn't seem to be the same location as in this PR. |
@GeorgeGayno-NOAA @DeniseWorthen The regression tests failed for me on Orion yesterday as well -- I should have spoken to that. Since the develop branch fails to build on Orion due to missing libraries (i.e. the recently deleted tree that was in |
Other regression tests are failing on Orion, not just cpld_gridgen. A quick look at one of the chgres tests showed very small (ie. floating point) differences with the baseline. So, I don't think there is any real problem with these updates. But we should check. |
I also saw roundoff level differences in my spot check. I'm curious...the weights files that we just moved into the fix directory are now slightly different. Isn't the idea that they are fix files in conflict with needing new ones potentially each time SS updates? |
Agree with your statement. Things change so rapidly at EMC now, that the concept of a 'fix' directory can have little meaning. If the differences are roundoff level, then I would not bother updating the 'fix' directory. |
I reran regression tests on Hera after realizing that I was running develop only. All regression tests failed, though consistency differences all appear to be at the roundoff level. The regression tests were performed in Comparing results against @GeorgeGayno-NOAA's tests revealed that only one test differs significantly between Orion and Hera: |
The differences on Orion can be explained. The previous merge changed results for the c96.uniform test. However, before I could update the baseline on Orion, the libraries changed so that 'develop' would not compile. |
@GeorgeGayno-NOAA OK, that's good to know. Would you like me to rerun regression tests on Jet as well? |
I have already run them on Jet: /lfs4/HFIP/emcda/George.Gayno/stmp/huber/UFS_UTILS. I have not checked results yet. |
Fewer tests failed on Jet - just the snow2mdl tests and the cpld_gridgen tests. According to the log file, a few cpld_gridgen files differed from the baseline. However, when I manually checked them using I don't see any problems on Jet. |
@GeorgeGayno-NOAA Is there anything else that needs to happen for this PR? |
DESCRIPTION OF CHANGES:
This upgrades the libraries to those currently used by the UFS as managed by spack-stack version 1.5.0 on Orion and Hera. In particular, this upgrades netCDF-c to 4.9.2, netCDF-Fortran to 4.6.0, HDF5 to 1.14.0, ip to 4.3.0, w3emc to 2.10.0, and ESMF to 8.4.2. The CI library versions were also updated to match these versions.
Lastly, during testing, it was determined that two tests needed additional resources. chgres_cube with sigio used close to 60GB on Orion (increased memory request from 50GB to 75GB) and took just less than 15 minutes (increased time request to 20 minutes from 15) and cpld_gridgen at 025 took a little longer than 15 minutes on Orion (increased time request from 15 minutes to 20 minutes).
TESTS CONDUCTED:
If there are changes to the build or source code, the tests below must be conducted. Contact a repository manager if you need assistance.
Describe any additional tests performed.
DEPENDENCIES:
None
DOCUMENTATION:
N/A
ISSUE:
#859