Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] Update ufs-weather-model hash and UPP hash and use upp-addon-env spack-stack environment #1136

Merged

Conversation

MichaelLueken
Copy link
Collaborator

@MichaelLueken MichaelLueken commented Oct 8, 2024

DESCRIPTION OF CHANGES:

  • Update ufs-weather-model hash to 38a29a6 (September 19)
  • Update UPP hash to 81b38a8 (August 13)
  • All Tier-1 modulefiles/build_*.lua files have been updated to use the upp-addon-env spack-stack environment
  • srw_common.lua was updated to use g2/3.5.1 and g2tmpl/1.13.0 (these are required for UPP)
  • .cicd/JENKINSFILE was updated to replace cheyenne entries with derecho.
  • The doc/tables/Tests.csv table had nco-mode WE2E tests removed
  • The doc/UsersGuide/CustomizingTheWorkflow/ConfigWorkflow.rst documentation was updated to updated ush/config_defaults.yaml file.
  • The .github/CODEOWNERS file was updated to add Bruce Kropp to the list of reviewers
  • The exregional_plot_allvars.py and exregional_plot_allvars_diff.py scripts were updated to address changes made to the postxconfig-NT-fv3lam.txt file.
  • Updated ush/config_defaults.yaml to update PE_MEMBER01 calculation and documentation for OMP_NUM_THREADS_RUN_FCST to allow for the run_fcst task to properly run on Tier-1 platforms after updates to allow threading to function properly.
  • The ush/machine/*.yaml files were updated to allow for the run_fcst task to properly run on Tier-1 platforms after updates to allow threading to function properly.
  • There are not enough resources on Jet to run the high resolution WE2E tests (136 (ReqNodeNotAvail)). Commented out the tests in the comprehensive.jet test suite and removed one test from the coverage.jet test suite.
  • The ufs-case-studies WE2E tests are currently failing on Derecho. The failure is due to the file not being available. This is an issue because the file in question is named correctly and is available, but the tests fail in the get_extrn_ics/lbs tasks stating that the files aren't present. Commented out these tests in comprehensive.derecho and moved WE2E tests to remove from coverage.derecho. Issue ufs-case-studies WE2E tests fail on Derecho in get_extrn_ics/lbcs #1144 was opened to track this issue on Derecho.

Type of change

  • New feature (non-breaking change which adds functionality)

TESTS CONDUCTED:

  • derecho.intel - Fundamental and Comprehensive WE2E tests were successfully run
  • gaea.intel - Fundamental and Comprehensive WE2E tests were successfully run
  • hera.gnu - Fundamental and Comprehensive WE2E tests were successfully run
  • hera.intel - Fundamental, Comprehensive, AQM WE2E, and AQM sample configuration tests were successfully run
  • hercules.intel - Fundamental, Comprehensive, AQM WE2E, and AQM sample configuration tests were successfully run
  • jet.intel - Fundamental WE2E tests were successfully run
  • orion.intel - Fundamental, Comprehensive, AQM WE2E, and AQM sample configuration tests were successfully run
  • wcoss2.intel
  • NOAA Cloud (indicate which platform)
  • Jenkins
  • fundamental test suite
  • comprehensive tests (specify which if a subset was used)

DOCUMENTATION:

Updated documentation related to the table defining the WE2E tests currently available in the SRW App. The nco-mode WE2E tests had been removed, but were still present in the Tests.csv file. Removed these entries. Additional documentation updates were made following updates to the config_defaults.yaml file.

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • My changes generate no new warnings
  • New and existing tests pass with my changes

MichaelLueken and others added 11 commits September 30, 2024 19:27
…19) and UPP hash to 81b38a8 (Aug 13). Point to upp-addon-env spack-stack environment on Hera. Update srw_common.lua to use g2/3.5.1 and g2tmpl/1.13.0. Updated exregional_plot_allvars.py to handle updates made to postxconfig-NT-fv3lam.txt.
 * .cicd/Jenkinsfile - Replaced cheyenne with derecho in commented
   sections and commented out Derecho.
 * doc/tables/Tests.csv - Removed nco-mode WE2E tests since these have
   been removed from the repository.
 * modulefiles/build_derecho_intel.lua - Update spack-stack environment
   to upp-addon-env.
 * .github/CODEOWNERS - Added Bruce Kropp as a reviewer from the Platform team.
 * modulefiles/build_orion_intel.lua - Udated spack-stack environment to upp-addon-env.
 * scripts/exregional_plot_allvars.py - Found method to successfully plot REFC using seek() and readline() pygrib commands
 * scripts/exregional_plot_allvars_diff.py - Same
@MichaelLueken
Copy link
Collaborator Author

I was able to successfully find a method to read the composite radar reflectivity from the post GRIB2 file. The issue was with the pygrib indexing of the GRIB2 file. Using wgrib2, the 37th entry in the post GRIB2 file was the REFC entry:

wgrib2 srw.t00z.prslev.f000.rrfs_conus_25km.grib2 -match REF
37:640531:d=2019070100:REFC:entire atmosphere (considered as a single layer):anl:

Using pygrib to generate an index of the GRIB2 file, we see the following for the 37th entry:

37:5:5 (instant):lambert:atmosphereSingleLayer:level 0 considered as a single layer:fcst time 0 hrs:from 201907010000

Clearly, there is something going on with the pygrib index. To get around this issue, I used the rewind() function to move to the start of the index, followed by seek(36), which moves to the 36th entry in the GRIB2 index. Finally, using readline().values allowed me to read in the information for the 37th (Maximum/Composite radar reflectivity) entry and successfully plot the data.

The following image was generated from the grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot
WE2E test using the current develop branch:
FV3_GFS_v17_p8_refc_baseline

This is the image generated using my feature/hash_update branch with the changes to exregional_plot_allvars.py:
FV3_GFS_v17_p8_refc_hash_update

The issue has been corrected and the composite radar reflectivity has successfully been plotted.

…ther than the deprecated atmos_nthreads, to correct issue with threading in the weather model
@MichaelLueken
Copy link
Collaborator Author

@mkavulich -

While all 6 fundamental WE2E tests successfully pass following the latest updates (example given was run on Hercules):

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_2  COMPLETE              21.74
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20241  COMPLETE               8.50
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot  COMPLETE              25.67
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR_2024101  COMPLETE              46.18
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0_20241015114  COMPLETE              88.54
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16_2024101511493  COMPLETE              62.54
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             253.17

the comprehensive tests are failing with the following error:

FATAL from PE 15: mpp_domains_define.inc: At least one pe in pelist is not used by any tile in the mosaic

It isn't clear what the issue is. Since the indicated include file is part of the FV3, I'll reach out to them and see what might be happening for the tests that are now failing with this error message.

@mkavulich
Copy link
Collaborator

@MichaelLueken Thanks for your work on the OMP problem. I came here to report a similar problem: I applied your changes and noticed there is still at least one fundamental test failing on Hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR. It's failing with the not all the pe_end are in the pelist error, even though all the ufs.configure and other settings appear correct. I can't figure out why as the rest of the fundamental tests are succeeding and only this one is failing, and it's especially worrysome that the failures seem to be different on different platforms. The only thought that immediately comes to mind is maybe there's an off-by-one or similar edge case based on node size?

@MichaelLueken
Copy link
Collaborator Author

@mkavulich That could certainly be the issue. Another potential issue is with respect to layout and io_layout in the input.nml file. While layout is set to be LAYOUT_X, LAYOUT_Y, io_layout is set to be 1,1. Should IO_LAYOUT_Y and IO_LAYOUT_X be set to 1, or should the value of IO_LAYOUT_X be set to WRTCMP_write_groups and IO_LAYOUT_Y be set to WRTCMP_write_tasks_per_group? Looking through some of the issues of others who have encountered this error message, it looks like they didn't properly set up one or more of the layout and io_layout entries in input.nml, which is why I'm focusing more on this direction.

@MichaelLueken
Copy link
Collaborator Author

Issue #362 was opened in NOAA-GFDL/GFDL_atmos_cubed_sphere asking about the strange FATAL from PE *: mpp_domains_define.inc: At least one pe in pelist is not used by any tile in the mosaic error message.

@MichaelLueken
Copy link
Collaborator Author

@mkavulich -

The OMP issue appears to be related to the configuration of the job_card in rocoto. In order for a WE2E test to properly run the run_fcst job, it requires a total number of MPI tasks equal to OMP_NUM_THREADS_RUN_FCST(LAYOUT_XLAYOUT_Y+WRTCMP_write_groups*WRTCMP_write_tasks_per_group). For the grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR WE2E test, PE_MEMBER01 comes out to be 84 MPI tasks. This test was failing on Hera because the run_fcst job was being submitted to only run with --ntasks=84.

For threading purposes, the number of nodes required is equal to the tasks per node (40 on Hera) divided by the number of threads used. This gives us a PPN_RUN_FCST value of 20 (40 tasks per node on Hera/2 threads). The total number of nodes used is equal to the PE_MEMBER01+PPN_RUN_FCST-1/PPN_RUN_FCST. This comes out to 84+20-1/20=103/20~5.

While submitting the run_fcst job, since there are 40 total tasks per node on Hera and we were calling for only 84 tasks, we were only using 3 nodes, with all 40 tasks per node. In reality, we required 5 nodes, with 20 tasks per node. When changing the rocoto XML to use nodes: rather than cores:, the expected 5 nodes were used and the test successfully passed. I'm currently running the comprehensive tests on Hera and things are looking good thus far. Once testing is complete, I'll push the modification and being testing on other machines.

@MichaelLueken
Copy link
Collaborator Author

Derecho, Gaea, Hera, and Orion have been successfully updated and tested. There are still issues with Jet (high resolution tests needing 136 tasks are still in queue) and Hercules (AQM). Will continue to monitor these two issues tomorrow.

MichaelLueken and others added 4 commits October 25, 2024 09:53
…cst task to run on Hercules; updates for AQM WE2E
…. The tests are failing due to being unable to find the required files. Rearrange WE2E test suites to allow tests to successfully run on all platforms.
Copy link

@rickgrubin-tomorrow rickgrubin-tomorrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@MichaelLueken MichaelLueken marked this pull request as draft October 31, 2024 17:30
@MichaelLueken MichaelLueken marked this pull request as ready for review October 31, 2024 17:30
Copy link
Collaborator

@rickgrubin-noaa rickgrubin-noaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@RatkoVasic-NOAA RatkoVasic-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested on Hera:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used
----------------------------------------------------------------------------------------------------
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_2  COMPLETE              14.56
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20241  COMPLETE              12.92
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot  COMPLETE              29.21
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR_2024103  COMPLETE              47.55
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0_20241031170  COMPLETE              26.38
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16_2024103117095  COMPLETE              51.00
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             181.62

Approved.

@MichaelLueken MichaelLueken added the run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests label Oct 31, 2024
@MichaelLueken
Copy link
Collaborator Author

The Jenkins tests successfully passed for all platforms that they were run on - Derecho, Gaea, Hera GNU, Hera Intel, Jet, and Orion. The Hercules label was down yesterday, leading to the Jenkins tests skipping Hercules.

Manual runs of the WE2E coverage suite on Hercules have successfully passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used
----------------------------------------------------------------------------------------------------
custom_GFDLgrid__GFDLgrid_USE_NUM_CELLS_IN_FILENAMES_eq_FALSE_202  COMPLETE              16.64
grid_CONUS_25km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_202  COMPLETE              15.94
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_202  COMPLETE              69.40
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot  COMPLETE              25.94
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_2024110108  COMPLETE              17.88
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP_20241101081  COMPLETE              47.56
grid_RRFS_NA_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP_20241101081947  COMPLETE             169.87
grid_SUBCONUS_Ind_3km_ics_NAM_lbcs_NAM_suite_GFS_v16_202411010819  COMPLETE              40.41
MET_verification_only_vx_20241101081948                            COMPLETE               0.36
specify_EXTRN_MDL_SYSBASEDIR_ICS_LBCS_20241101081950               COMPLETE               9.21
2019_hurricane_lorenzo_20241101081950                              COMPLETE              73.14
2020_denver_radiation_inversion_20241101081951                     COMPLETE              72.80
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             559.15

Moving forward with merging this PR now.

@MichaelLueken MichaelLueken merged commit 87b26cc into ufs-community:develop Nov 1, 2024
4 of 5 checks passed
@MichaelLueken MichaelLueken deleted the feature/hash_update branch November 1, 2024 14:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SRW forecast runs do not use threading even if OMP_NUM_THREADS_RUN_FCST is set > 1
5 participants