Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a new Pair-Stat tool to compute statistics for already paired forecast and observation data #3006

Closed
9 of 22 tasks
JohnHalleyGotway opened this issue Nov 4, 2024 · 9 comments · Fixed by #3057
Closed
9 of 22 tasks
Assignees
Labels
MET: Statistics priority: high High Priority reporting: NRL METplus Naval Research Laboratory METplus Project requestor: Navy/NRL Naval Research Laboratory type: new feature Make it do something new
Milestone

Comments

@JohnHalleyGotway
Copy link
Collaborator

JohnHalleyGotway commented Nov 4, 2024

Describe the New Feature

Create a new statistics tool named Pair-Stat to compute statistics for already paired forecast and observation data. The initial version of this tool should support the following input datasets, although additional ones can be added in the future:

  1. IODA NetCDF files from the JEDI data assimilation system
  2. the ASCII MPR line type written by the Point-Stat tool
  3. Python embedding to supply MPR data

This new tool is driven primarily by the need to compute statistics for the already paired data in IODA files. Also supporting the MPR line type makes the functionality of this tool intersect with Stat-Analysis, which can already derive statistics from MPR data. The goal is to make the configuration of this tool more user-friendly instead of requiring users to wade through the details of defining many, many Stat-Analysis jobs.

The functionality of this tool overlaps with Point-Stat a lot. Although Pair-Stat will do no interpolation and no matching to message types. However care should be given to support filtering the input data:

  • vertically by model level... separately or aggregating multiple levels together
  • spatially by defining geographic masking regions and/or compute stats separately for each station
  • temporally since data for multiple times can be passed as input

In the configuration file, let users define a list of variables names to be processed, or allow for an empty list to process all variables found in the input.

Remember to add a new chapter to the MET User's Guide for the new Pair-Stat tool.

List of questions to be considered:

  • Should externally climatology data be supported?
  • Should sample data percentile thresholds be supported?

Acceptance Testing

List input data types and sources.
Describe tests required for new functionality.

Time Estimate

Estimate the amount of work required here.
Issues should represent approximately 1 to 3 days of work.

Sub-Issues

Consider breaking the new feature down into sub-issues.

  • Add a checkbox for each sub-issue here.

Relevant Deadlines

Work described in this issue should be completed by 12/30/2024

Funding Source

NRL METplus 7730022

Define the Metadata

Assignee

  • Select engineer(s) or no engineer required
  • Select scientist(s) or no scientist required

Labels

  • Review default alert labels
  • Select component(s)
  • Select priority
  • Select requestor(s)

Milestone and Projects

  • Select Milestone as a MET-X.Y.Z version, Consider for Next Release, or Backlog of Development Ideas
  • For a MET-X.Y.Z version, select the MET-X.Y.Z Development project

Define Related Issue(s)

Consider the impact to the other METplus components.

New Feature Checklist

See the METplus Workflow for details.

  • Complete the issue definition above, including the Time Estimate and Funding source.
  • Fork this repository or create a branch of develop.
    Branch name: feature_<Issue Number>_<Description>
  • Complete the development and test your changes.
  • Add/update log messages for easier debugging.
  • Add/update unit tests.
  • Add/update documentation.
  • Push local changes to GitHub.
  • Submit a pull request to merge into develop.
    Pull request: feature <Issue Number> <Description>
  • Define the pull request metadata, as permissions allow.
    Select: Reviewer(s) and Development issue
    Select: Milestone as the next official version
    Select: MET-X.Y.Z Development project for development toward the next official release
  • Iterate until the reviewer(s) accept and merge your changes.
  • Delete your fork or branch.
  • Close this issue.
@DanielAdriaansen
Copy link
Contributor

Funding source added and deadline added.

@DanielAdriaansen
Copy link
Contributor

Work in #3007 will support IODA files with Pair-Stat.

JohnHalleyGotway added a commit that referenced this issue Nov 5, 2024
…ol with all instances of point_stat renamed as pair_stat.
JohnHalleyGotway added a commit that referenced this issue Nov 19, 2024
JohnHalleyGotway added a commit that referenced this issue Nov 19, 2024
@JohnHalleyGotway
Copy link
Collaborator Author

@willmayfield I'm wondering about the use of a grid within the Pair-Stat tool.

One of the first things done in the other MET statistics tools (e.g. Point-Stat, Grid-Stat, Series-Analysis, MODE, ...) is deciding on a common grid to be used for the verification. That can be defined as the "forecast" grid, "observation" grid, or some other grid, defined by it's name, grid specification string, or the path to a gridded data file. All gridded data is regridded to the common vx grid prior to be used and that includes:

  • gridded forecast data
  • gridded observation data, when applicable
  • gridded climo data
  • land/sea mask data
  • topography data
  • gridded masking regions created by Gen-Vx-Mask

Since Pair-Stat won't use gridded forecast/observation data, defining a verification grid is NOT REQUIRED. Instead, when extracting data from climo, land/sea mask, topography, gridded masks we could just use whatever grid that data happens to be defined on and interpolate to the (lat, lon) location of the pair.

The advantage is that avoiding those regridding steps will be a little faster and will introduce less "interpolation error".
The disadvantage is that it'll be less consistent with the logic of the other MET statistics tools.

Shall I proceed WITHOUT defining a common "verification grid"?
Or should I use one to maintain more consistency with the logic of other tools?

@JohnHalleyGotway
Copy link
Collaborator Author

JohnHalleyGotway commented Nov 22, 2024

As discussed on Nov 22, 2024 with @DanielAdriaansen and @willmayfield, recommend NOT using a common verification grid since no doing so seems to be the simpler approach. If adding back in this functionality is requested in the future, it can be added at that time.

JohnHalleyGotway added a commit that referenced this issue Dec 3, 2024
…DataConfig_default file to store default settings for reading IODA data.
@JohnHalleyGotway
Copy link
Collaborator Author

JohnHalleyGotway commented Dec 4, 2024

As discussed on Dec 4, 2024 with @georgemccabe, for setting up config options to filter input paired data, recommend:

  1. Reusing the existing mpr_column and mpr_thresh config options from Point-Stat and Grid-Stat to filter numeric columns (or differences or abs value of differences) from MPR data.
  2. Adding new mpr_str_inc and mpr_str_exc config options to filter input paired data by string matching inclusion and exclusion. These are arrays of dictionaries with name and value entries:
mpr_str_inc = [ { name = "DESC"; value = "NA"; } ];
mpr_str_exc = [ { name = "VX_MASK"; value = "CONUS"; } ];

Note that this introduces some inconsistency since mpr_str_inc/exc are arrays of dictionaries while mpr_column/thresh are arrays of strings and thresholds. However we agree that this is a preferable design and users will set these via METplus Wrappers anyway.

JohnHalleyGotway added a commit that referenced this issue Dec 4, 2024
@JohnHalleyGotway
Copy link
Collaborator Author

As discussed on Dec 6, 2024 (see meeting notes), add a new group_name config option to specify the group name from which the variable name should be extracted.

@willmayfield
Copy link

willmayfield commented Dec 9, 2024

@JohnHalleyGotway After our discussion on Friday, I dug into some of the files in https://github.com/JCSDA-internal/ufo-data/tree/develop/testinput_tier_1.

An instructive file might be amsua_n19_hofxnm_2018041500_m_rttovcpp.nc4.

This file has one variable, brightness_temperature, with observation group ObsValue, possible "forecast" groups HofX and MPASJEDIHofX, dimension "Location" (size 100), as well as "Channel" (size 15) which may be desired to specify for the verification task. Channel takes values in the MetaData group along with coordinates of height, latitude, longitude, and datetime.

There are several other MetaData available such as sensorZenithAngle(Location), sensorPolarizationDirection(Channel), etc. which I am not sure if they would be desirable to be used in, for example, a filter job. That may need to be left to the user to perform independently.

For a very simple file with a more traditional variable, you could look at sondes_q_obs_2020121500_singular.nc4.

This file has the variable specificHumidity, with groups ObsValue, hofx, GsiHofx, etc, and within MetaData there are variables datetime, latitude, longitude, and possible vertical coordinates height, pressure, and stationElevation. There are also, for example, MetaData information in stationIdentification which again might be useful in a filter job, but I'm not sure if that's within our immediate scope of capabilities.

Please let me know if you have any questions or would like to discuss (I'll find a meeting time in the next few days either way).

@JohnHalleyGotway
Copy link
Collaborator Author

As discussed on 20241211 (see meeting notes), @willmayfield encountered some data that includes an additional dimension named Channel with different satellite irradiance channels. Recommend adding new config options to enable the user to specify the name and value for extra dimensions, like channel.
Some thoughts:

  • Could provide matched dimension name/value pairs (e.g. ioda_dim_name = [ "Channel" ]; ioda_dim_value = [ "@2" ];)
  • Or could define this in the config file as a map of name/value pairs.
  • Recommend processing individual channels separately and NOT combining pairs from multiple channels.
  • Recommend INCLUDING that channel name and value in the output variable names. For example, instead of writing FCST_VAR as brightnessTemperature write it as brightnessTemperature_Channel4.

@JohnHalleyGotway
Copy link
Collaborator Author

JohnHalleyGotway commented Jan 8, 2025

Met on Jan 8, 2025 to discuss the development status. Based on available funds, recommend that we NOT support -format python or -format ioda in the initial version.

  • For beta1:

    • Use few remaining hours to finish up MPR support.
    • Finalize initial version of the documentation for pair_stat.
      • Chapter is provided but config file details should be added by future work.
    • Have tool error out if -format python or -format ioda are requested.
    • Add a unit test for pair_stat.
      • Examples of calling it with -format mpr, -format python, and -format ioda are added, but the last 2 error out as expected.
    • Additional development details:
      • Change -outdir path to -out base since we have no reference timing info with which to populate output file names.
    • Merge Create a new Pair-Stat tool to compute statistics for already paired forecast and observation data #3006 changes into develop.
    • Include METplus pair_stat wrapper in beta1.
  • For beta2 - write a new MET GitHub issue to document details:

    • Coordinate with @DanielAdriaansen and @michelleharrold to identify remaining NRL funds (prior to March 30, 2025), DTC funds, or alternative project funds to enhance this tool.
    • Add in support for -format python (DTC) and -format ioda (NRL), as funds allow.

Future related work:

  • Could add support for -format scm for Single-Column-Model output.
  • Enhance Point-Stat to write NetCDF matched pair output and enhance Pair-Stat to read that output.

JohnHalleyGotway added a commit that referenced this issue Jan 15, 2025
…tat-Analysis does. Committing the current state of this branch prior to merging in changes from the #3007 feature branch and completing development for the beta1 cycle.
JohnHalleyGotway added a commit that referenced this issue Jan 16, 2025
…t tests demonstrating those errors. Also note this in the user's guide.
JohnHalleyGotway added a commit that referenced this issue Jan 21, 2025
…king all that well right now because it's based on G004.
JohnHalleyGotway added a commit that referenced this issue Jan 21, 2025
…degree. This still isn't good enough though. Instead, we need to get rid of the reference grid altogether and keep track of the grid information separately for each mask.
@JohnHalleyGotway JohnHalleyGotway linked a pull request Jan 21, 2025 that will close this issue
17 tasks
JohnHalleyGotway added a commit that referenced this issue Jan 22, 2025
JohnHalleyGotway added a commit that referenced this issue Jan 23, 2025
* Per #3006, add new pair_stat tool as a full copy of the point_stat tool with all instances of point_stat renamed as pair_stat.

* Per #3006, add pair_stat to the list of things for which no 'make test' command is run.

* Per #3006, saving work in progress prior to seneca reboot

* Per #3006, revert back to using FileType instead of GrdFileType. That change was not meaningful or warranted.

* Per #3006, revert back to using FileType instead of GrdFileType. That change was not meaningful or warranted.

* Per #3006, committing changes since the code is compiling. Added IODADataConfig_default file to store default settings for reading IODA data.

* Per #3006, starting to tweak config options. Saving progress while it's successfully compiling

* Per #3006, add fcst.pairs and obs.pairs config entries.

* #3007 Added vx_ioda

* #3007 Added vx_ioda

* #3007 Added vx_ioda

* #3007 Derived from IODADataConfInfo

* #3007 Reduced the code smells (SonarQube findings)

* #3007 Added station_value_base_t and point_pair_t

* Initial release

* #3007 Changed ack the location of nc_point_obs.set_nc_out_data

* Changed station_value_base_t::clear() to station_value_base_t::clear_base()

* Changed bAPI names

* #3007 Reduced code smells

* #3007 CLeanup

* #3007 Cahnged API for IODADataConfInfo

* #3007 Renamed ioda_file to ioda_reader

* #3007 Corrected comment

* #3007 Added -lvx_statistics again

* #3007 Added get_nc_data(NcVar *, unixtime)

* #3007 Cleanup

* #3007 Added add_to_unixtime((unixtime)

* #3007 Reduced the complexiity of read_time. Added read_time_as_number

* #3007 Added read_time_as_number

* #3007 Added add_to_unixtime(unixtime)

* #3007 Cleanup

* #3007 Set bad_data_int to qc_buf

* #3007 Cleanup

* Per #3006, define new GrdFileType::FileType_Pairs enumerated value to be used in the pair_stat tool.

* Per #3006, update pair_stat to use the newly added GrdFileType::FileType_Pairs enumerated value.

* #3007 Temporarily removed pair_stat

* Per #3006, rerun bootstrap on seneca to incorporate the compilation of the vx_ioda library.

* Per #3006, make docs build without warning

* Per #3006, saving compiling state

* Per #3006, use ConcatString instead of std::string for consistency.

* Per #3006, work in progress

* Unrelated to #3006, but fix typo in log message.

* Per #3006, default_column_union was defined in 2 spots. Renaming one of them to avoid compilation conflict.

* Per #3006, move StatHdrInfo out of aggr_stat_line.h/.cc and into vx_stat_out/stat_hdr_info.h/.cc. This make it available to both Stat-Analysis and the Pair-Stat tool to track the unique STAT headers elements read.

* Per #3006, remove the unused land/topo/msg_type type config options from the pair_stat tool's configuration file and code that parses it. If needed, we can add it back in the future.

* Per #3006, update VarInfoPairs::set_dict() to also call VarInfo::set_magic().

* Per #3006, since python_line.h lives in src/basic/vx_util, the vx_util library now also depends on the

* Per #3006, saving off version that compiles before trying changes that may not.

* #3007 Deleted commented out cpde

* Changed data typo (float to double)

* #3007 Resio;lved SonarQube finding

* Per #3006, added logic to track ck unique header input columns like Stat-Analysis does. Committing the current state of this branch prior to merging in changes from the #3007 feature branch and completing development for the beta1 cycle.

* Per #3006, fix indexing for vx_opt

* Per #3006, error out for -format ioda and -format python

* Per #3006, make -format ioda or -format python error out, but add unit tests demonstrating those errors. Also note this in the user's guide.

* Per #3006, replace -outdir with -out and remove output_prefix config option.

* Per #3006, remove output_prefix from Pair-Stat config files.

* Per #3006, update unit_pair_stat.xml to use the -out option.

* Per #3006, working version. However, the filtering by grid is not working all that well right now because it's based on G004.

* Per #3006, expand the Pair-Stat example.

* Per #3006, switch from using global 0.5 degree reference grid to 0.1 degree. This still isn't good enough though. Instead, we need to get rid of the reference grid altogether and keep track of the grid information separately for each mask.

* Per #3006, fix for loop typo in 3 spots

* Per #3006, remove one line from bad merge

* Per #3006, SonarQube updates.

* Per #3006, more SonarQube fixes

* Per #3006, remove desctrutor as recommended by SonarQube

---------

Co-authored-by: Howard Soh <[email protected]>
Co-authored-by: MET Tools Test Account <[email protected]>
@JohnHalleyGotway JohnHalleyGotway changed the title Create new Pair-Stat tool to compute statistics for already paired forecast and observation data Create a new Pair-Stat tool to compute statistics for already paired forecast and observation data Jan 24, 2025
@JohnHalleyGotway JohnHalleyGotway removed the alert: NEED MORE DEFINITION Not yet actionable, additional definition required label Jan 27, 2025
@github-project-automation github-project-automation bot moved this to 🩺 Needs Triage in METplus-6.1.0 Development Jan 28, 2025
@JohnHalleyGotway JohnHalleyGotway moved this from 🩺 Needs Triage to 🏁 Done in METplus-6.1.0 Development Jan 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
MET: Statistics priority: high High Priority reporting: NRL METplus Naval Research Laboratory METplus Project requestor: Navy/NRL Naval Research Laboratory type: new feature Make it do something new
Projects
Status: 🏁 Done
Development

Successfully merging a pull request may close this issue.

3 participants