[develop] Update ufs-weather-model hash and UPP hash and use upp-addon-env spack-stack environment #1136

MichaelLueken · 2024-10-08T14:29:33Z

DESCRIPTION OF CHANGES:

Update ufs-weather-model hash to 38a29a6 (September 19)
Update UPP hash to 81b38a8 (August 13)
All Tier-1 modulefiles/build_*.lua files have been updated to use the upp-addon-env spack-stack environment
srw_common.lua was updated to use g2/3.5.1 and g2tmpl/1.13.0 (these are required for UPP)
.cicd/JENKINSFILE was updated to replace cheyenne entries with derecho.
The doc/tables/Tests.csv table had nco-mode WE2E tests removed
The doc/UsersGuide/CustomizingTheWorkflow/ConfigWorkflow.rst documentation was updated to updated ush/config_defaults.yaml file.
The .github/CODEOWNERS file was updated to add Bruce Kropp to the list of reviewers
The exregional_plot_allvars.py and exregional_plot_allvars_diff.py scripts were updated to address changes made to the postxconfig-NT-fv3lam.txt file.
Updated ush/config_defaults.yaml to update PE_MEMBER01 calculation and documentation for OMP_NUM_THREADS_RUN_FCST to allow for the run_fcst task to properly run on Tier-1 platforms after updates to allow threading to function properly.
The ush/machine/*.yaml files were updated to allow for the run_fcst task to properly run on Tier-1 platforms after updates to allow threading to function properly.
There are not enough resources on Jet to run the high resolution WE2E tests (136 (ReqNodeNotAvail)). Commented out the tests in the comprehensive.jet test suite and removed one test from the coverage.jet test suite.
The ufs-case-studies WE2E tests are currently failing on Derecho. The failure is due to the file not being available. This is an issue because the file in question is named correctly and is available, but the tests fail in the get_extrn_ics/lbs tasks stating that the files aren't present. Commented out these tests in comprehensive.derecho and moved WE2E tests to remove from coverage.derecho. Issue ufs-case-studies WE2E tests fail on Derecho in get_extrn_ics/lbcs #1144 was opened to track this issue on Derecho.

Type of change

New feature (non-breaking change which adds functionality)

TESTS CONDUCTED:

DOCUMENTATION:

Updated documentation related to the table defining the WE2E tests currently available in the SRW App. The nco-mode WE2E tests had been removed, but were still present in the Tests.csv file. Removed these entries. Additional documentation updates were made following updates to the config_defaults.yaml file.

CHECKLIST

My code follows the style guidelines in the Contributor's Guide
I have performed a self-review of my own code using the Code Reviewer's Guide
My changes generate no new warnings
New and existing tests pass with my changes

…19) and UPP hash to 81b38a8 (Aug 13). Point to upp-addon-env spack-stack environment on Hera. Update srw_common.lua to use g2/3.5.1 and g2tmpl/1.13.0. Updated exregional_plot_allvars.py to handle updates made to postxconfig-NT-fv3lam.txt.

… for Hercules in build_hercules_intel.lua

… in build_jet_intel.lua

… in build_gaea_intel.lua

* .cicd/Jenkinsfile - Replaced cheyenne with derecho in commented sections and commented out Derecho. * doc/tables/Tests.csv - Removed nco-mode WE2E tests since these have been removed from the repository. * modulefiles/build_derecho_intel.lua - Update spack-stack environment to upp-addon-env.

… upp-addon-env in build_noaacloud_intel.lua

* .github/CODEOWNERS - Added Bruce Kropp as a reviewer from the Platform team. * modulefiles/build_orion_intel.lua - Udated spack-stack environment to upp-addon-env.

…ace SLP name and comment out REFC sections

* scripts/exregional_plot_allvars.py - Found method to successfully plot REFC using seek() and readline() pygrib commands * scripts/exregional_plot_allvars_diff.py - Same

MichaelLueken · 2024-10-11T20:57:33Z

I was able to successfully find a method to read the composite radar reflectivity from the post GRIB2 file. The issue was with the pygrib indexing of the GRIB2 file. Using wgrib2, the 37th entry in the post GRIB2 file was the REFC entry:

wgrib2 srw.t00z.prslev.f000.rrfs_conus_25km.grib2 -match REF
37:640531:d=2019070100:REFC:entire atmosphere (considered as a single layer):anl:

Using pygrib to generate an index of the GRIB2 file, we see the following for the 37th entry:

37:5:5 (instant):lambert:atmosphereSingleLayer:level 0 considered as a single layer:fcst time 0 hrs:from 201907010000

Clearly, there is something going on with the pygrib index. To get around this issue, I used the rewind() function to move to the start of the index, followed by seek(36), which moves to the 36th entry in the GRIB2 index. Finally, using readline().values allowed me to read in the information for the 37th (Maximum/Composite radar reflectivity) entry and successfully plot the data.

The following image was generated from the grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot
WE2E test using the current develop branch:

This is the image generated using my feature/hash_update branch with the changes to exregional_plot_allvars.py:

The issue has been corrected and the composite radar reflectivity has successfully been plotted.

…ther than the deprecated atmos_nthreads, to correct issue with threading in the weather model

MichaelLueken · 2024-10-15T18:24:06Z

@mkavulich -

While all 6 fundamental WE2E tests successfully pass following the latest updates (example given was run on Hercules):

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_2  COMPLETE              21.74
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20241  COMPLETE               8.50
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot  COMPLETE              25.67
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR_2024101  COMPLETE              46.18
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0_20241015114  COMPLETE              88.54
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16_2024101511493  COMPLETE              62.54
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             253.17

the comprehensive tests are failing with the following error:

FATAL from PE 15: mpp_domains_define.inc: At least one pe in pelist is not used by any tile in the mosaic

It isn't clear what the issue is. Since the indicated include file is part of the FV3, I'll reach out to them and see what might be happening for the tests that are now failing with this error message.

mkavulich · 2024-10-15T19:46:59Z

@MichaelLueken Thanks for your work on the OMP problem. I came here to report a similar problem: I applied your changes and noticed there is still at least one fundamental test failing on Hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR. It's failing with the not all the pe_end are in the pelist error, even though all the ufs.configure and other settings appear correct. I can't figure out why as the rest of the fundamental tests are succeeding and only this one is failing, and it's especially worrysome that the failures seem to be different on different platforms. The only thought that immediately comes to mind is maybe there's an off-by-one or similar edge case based on node size?

MichaelLueken · 2024-10-15T20:08:54Z

@mkavulich That could certainly be the issue. Another potential issue is with respect to layout and io_layout in the input.nml file. While layout is set to be LAYOUT_X, LAYOUT_Y, io_layout is set to be 1,1. Should IO_LAYOUT_Y and IO_LAYOUT_X be set to 1, or should the value of IO_LAYOUT_X be set to WRTCMP_write_groups and IO_LAYOUT_Y be set to WRTCMP_write_tasks_per_group? Looking through some of the issues of others who have encountered this error message, it looks like they didn't properly set up one or more of the layout and io_layout entries in input.nml, which is why I'm focusing more on this direction.

MichaelLueken · 2024-10-15T20:56:28Z

Issue #362 was opened in NOAA-GFDL/GFDL_atmos_cubed_sphere asking about the strange FATAL from PE *: mpp_domains_define.inc: At least one pe in pelist is not used by any tile in the mosaic error message.

MichaelLueken · 2024-10-22T18:24:29Z

@mkavulich -

The OMP issue appears to be related to the configuration of the job_card in rocoto. In order for a WE2E test to properly run the run_fcst job, it requires a total number of MPI tasks equal to OMP_NUM_THREADS_RUN_FCST(LAYOUT_XLAYOUT_Y+WRTCMP_write_groups*WRTCMP_write_tasks_per_group). For the grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR WE2E test, PE_MEMBER01 comes out to be 84 MPI tasks. This test was failing on Hera because the run_fcst job was being submitted to only run with --ntasks=84.

For threading purposes, the number of nodes required is equal to the tasks per node (40 on Hera) divided by the number of threads used. This gives us a PPN_RUN_FCST value of 20 (40 tasks per node on Hera/2 threads). The total number of nodes used is equal to the PE_MEMBER01+PPN_RUN_FCST-1/PPN_RUN_FCST. This comes out to 84+20-1/20=103/20~5.

While submitting the run_fcst job, since there are 40 total tasks per node on Hera and we were calling for only 84 tasks, we were only using 3 nodes, with all 40 tasks per node. In reality, we required 5 nodes, with 20 tasks per node. When changing the rocoto XML to use nodes: rather than cores:, the expected 5 nodes were used and the test successfully passed. I'm currently running the comprehensive tests on Hera and things are looking good thus far. Once testing is complete, I'll push the modification and being testing on other machines.

…nsive WE2E tests to successfully run on Hera

… task to run on Orion

… build and run following update to Cray PE. Allow all comprehensive WE2E tests to successfully run on Gaea.

MichaelLueken · 2024-10-24T21:01:30Z

Derecho, Gaea, Hera, and Orion have been successfully updated and tested. There are still issues with Jet (high resolution tests needing 136 tasks are still in queue) and Hercules (AQM). Will continue to monitor these two issues tomorrow.

…cst task to run on Hercules; updates for AQM WE2E

…sing a single thread

…o properly run on Jet

…. The tests are failing due to being unable to find the required files. Rearrange WE2E test suites to allow tests to successfully run on all platforms.

rickgrubin-tomorrow

LGTM

rickgrubin-noaa

LGTM

RatkoVasic-NOAA

Tested on Hera:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used
----------------------------------------------------------------------------------------------------
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_2  COMPLETE              14.56
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20241  COMPLETE              12.92
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot  COMPLETE              29.21
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR_2024103  COMPLETE              47.55
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0_20241031170  COMPLETE              26.38
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16_2024103117095  COMPLETE              51.00
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             181.62

Approved.

MichaelLueken · 2024-11-01T14:08:35Z

The Jenkins tests successfully passed for all platforms that they were run on - Derecho, Gaea, Hera GNU, Hera Intel, Jet, and Orion. The Hercules label was down yesterday, leading to the Jenkins tests skipping Hercules.

Manual runs of the WE2E coverage suite on Hercules have successfully passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used
----------------------------------------------------------------------------------------------------
custom_GFDLgrid__GFDLgrid_USE_NUM_CELLS_IN_FILENAMES_eq_FALSE_202  COMPLETE              16.64
grid_CONUS_25km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_202  COMPLETE              15.94
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_202  COMPLETE              69.40
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot  COMPLETE              25.94
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_2024110108  COMPLETE              17.88
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP_20241101081  COMPLETE              47.56
grid_RRFS_NA_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP_20241101081947  COMPLETE             169.87
grid_SUBCONUS_Ind_3km_ics_NAM_lbcs_NAM_suite_GFS_v16_202411010819  COMPLETE              40.41
MET_verification_only_vx_20241101081948                            COMPLETE               0.36
specify_EXTRN_MDL_SYSBASEDIR_ICS_LBCS_20241101081950               COMPLETE               9.21
2019_hurricane_lorenzo_20241101081950                              COMPLETE              73.14
2020_denver_radiation_inversion_20241101081951                     COMPLETE              72.80
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             559.15

Moving forward with merging this PR now.

MichaelLueken and others added 11 commits September 30, 2024 19:27

Merge remote-tracking branch 'upstream/develop' into feature/hash_update

fe148f7

[feature/hash_update] Update spack-stack environment to upp-addon-env…

f3f81a4

… for Hercules in build_hercules_intel.lua

[feature/hash_update] Update spack-stack environment to upp-addon-env…

70cba8e

… in build_jet_intel.lua

[feature/hash_update] Update spack-stack environment to upp-addon-env…

725a33b

… in build_gaea_intel.lua

[feature/hash_update] Update spack-stack environment on NOAA Cloud to…

8326eb7

… upp-addon-env in build_noaacloud_intel.lua

[feature/hash_update] - Update:

6be2a93

* .github/CODEOWNERS - Added Bruce Kropp as a reviewer from the Platform team. * modulefiles/build_orion_intel.lua - Udated spack-stack environment to upp-addon-env.

[feature/hash_update] Updated exregional_plot_allvars_diff.py to repl…

d68d941

…ace SLP name and comment out REFC sections

[feature/hash_update] Updated:

208d2bf

* scripts/exregional_plot_allvars.py - Found method to successfully plot REFC using seek() and readline() pygrib commands * scripts/exregional_plot_allvars_diff.py - Same

Merge remote-tracking branch 'upstream/develop' into feature/hash_update

1c9f349

MichaelLueken mentioned this pull request Oct 15, 2024

SRW forecast runs do not use threading even if OMP_NUM_THREADS_RUN_FCST is set > 1 #1138

Closed

[feature/hash_update] Update threading to use ATM_omp_num_threads, ra…

77e2452

…ther than the deprecated atmos_nthreads, to correct issue with threading in the weather model

MichaelLueken linked an issue Oct 15, 2024 that may be closed by this pull request

SRW forecast runs do not use threading even if OMP_NUM_THREADS_RUN_FCST is set > 1 #1138

Closed

mkavulich mentioned this pull request Oct 16, 2024

[develop] Add Community Fire Behavior Model #1139

Merged

21 tasks

MichaelLueken added 4 commits October 23, 2024 13:32

[feature/hash_update] Applied necessary updates to allow all comprehe…

fc188c0

…nsive WE2E tests to successfully run on Hera

Merge remote-tracking branch 'upstream/develop' into feature/hash_update

7da953b

[feature/hash_update] Update ush/machine/orion.yaml to allow run_fcst…

b01f41f

… task to run on Orion

[feature/hash_update] Update Gaea modulefiles to allow the SRW App to…

d87c930

… build and run following update to Cray PE. Allow all comprehensive WE2E tests to successfully run on Gaea.

MichaelLueken and others added 4 commits October 25, 2024 09:53

[feature/hash_update] Update ush/machine/hercules.yaml to allow run_f…

e829827

…cst task to run on Hercules; updates for AQM WE2E

[feature/hash_update] Update ush/config.aqm.yaml to allow it to run u…

268c40e

…sing a single thread

[feature/hash_update] Update ush/machine/jet.yaml to allow run_fcst t…

3e823c1

…o properly run on Jet

[feature/hash_update] Remove ufs-case-studies WE2E tests from Derecho…

c733c04

…. The tests are failing due to being unable to find the required files. Rearrange WE2E test suites to allow tests to successfully run on all platforms.

MichaelLueken marked this pull request as ready for review October 29, 2024 14:19

rickgrubin-tomorrow approved these changes Oct 31, 2024

View reviewed changes

MichaelLueken marked this pull request as draft October 31, 2024 17:30

MichaelLueken marked this pull request as ready for review October 31, 2024 17:30

MichaelLueken requested a review from rickgrubin-noaa as a code owner October 31, 2024 17:30

rickgrubin-noaa approved these changes Oct 31, 2024

View reviewed changes

RatkoVasic-NOAA approved these changes Oct 31, 2024

View reviewed changes

MichaelLueken added the run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests label Oct 31, 2024

MichaelLueken merged commit 87b26cc into ufs-community:develop Nov 1, 2024
4 of 5 checks passed

MichaelLueken deleted the feature/hash_update branch November 1, 2024 14:10

MichaelLueken mentioned this pull request Nov 1, 2024

Updates to the authoritative develop branch since the SRW v2.2 release (10/31/2023) #981

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[develop] Update ufs-weather-model hash and UPP hash and use upp-addon-env spack-stack environment #1136

[develop] Update ufs-weather-model hash and UPP hash and use upp-addon-env spack-stack environment #1136

MichaelLueken commented Oct 8, 2024 •

edited

Loading

MichaelLueken commented Oct 11, 2024

MichaelLueken commented Oct 15, 2024

mkavulich commented Oct 15, 2024

MichaelLueken commented Oct 15, 2024

MichaelLueken commented Oct 15, 2024

MichaelLueken commented Oct 22, 2024

MichaelLueken commented Oct 24, 2024

rickgrubin-tomorrow left a comment

rickgrubin-noaa left a comment

RatkoVasic-NOAA left a comment

MichaelLueken commented Nov 1, 2024

[develop] Update ufs-weather-model hash and UPP hash and use upp-addon-env spack-stack environment #1136

[develop] Update ufs-weather-model hash and UPP hash and use upp-addon-env spack-stack environment #1136

Conversation

MichaelLueken commented Oct 8, 2024 • edited Loading

DESCRIPTION OF CHANGES:

Type of change

TESTS CONDUCTED:

DOCUMENTATION:

CHECKLIST

MichaelLueken commented Oct 11, 2024

MichaelLueken commented Oct 15, 2024

mkavulich commented Oct 15, 2024

MichaelLueken commented Oct 15, 2024

MichaelLueken commented Oct 15, 2024

MichaelLueken commented Oct 22, 2024

MichaelLueken commented Oct 24, 2024

rickgrubin-tomorrow left a comment

Choose a reason for hiding this comment

rickgrubin-noaa left a comment

Choose a reason for hiding this comment

RatkoVasic-NOAA left a comment

Choose a reason for hiding this comment

MichaelLueken commented Nov 1, 2024

MichaelLueken commented Oct 8, 2024 •

edited

Loading