diff --git a/README.md b/README.md deleted file mode 100644 index a3f4ba2..0000000 --- a/README.md +++ /dev/null @@ -1,4071 +0,0 @@ -CFA Forecast Tools (Python) -================ - - - - polars.config.Config - -Summary of `forecasttools-py`: - -- A Python package. -- Primarily supports the Short Term Forecast’s team. -- Intended to support wider Real Time Monitoring branch operations. -- Has tools for pre- and post-processing. - - Conversion of `az.InferenceData` forecast to Hubverse format. - - Addition of time and or dates to `az.InferenceData`. - -Notes: - -- This repository is a WORK IN PROGRESS. - -# Installation - -Install `forecasttools` via: - - pip3 install git+https://github.com/CDCgov/forecasttools-py@main - -# Vignettes - -- [Format Arviz Forecast Output For FluSight - Submission](https://github.com/CDCgov/forecasttools-py/blob/main/notebooks/flusight_from_idata.qmd) -- [Community Meeting Utilities Demonstration - (2024-11-19)](https://github.com/CDCgov/forecasttools-py/blob/main/notebooks/forecasttools_community_demo_2024-11-19.qmd) - -*Coming soon as webpages, once [Issue -26](https://github.com/CDCgov/forecasttools-py/issues/26) is completed*. - -# Datasets - -Within `forecasttools-py`, one finds several packaged datasets. These -datasets can aid with experimentation; some are directly necessary to -other utilities provided by `forecasttools-py`. - -``` python -import forecasttools -``` - -Summary of datasets: - -- `forecasttools.location_table` - - A Polars dataframe of location abbreviations, codes, and names for - Hubverse formatted forecast submissions. -- `forecasttools.example_flusight_submission` - - An example Hubverse formatted influenza forecast submission (as a - Polars dataframe) submitted to the FluSight Hub. -- `forecasttools.nhsn_hosp_COVID` - - A Polars dataframe of NHSN COVID hospital admissions data. -- `forecasttools.nhsn_hosp_flu` - - A Polars dataframe of NHSN influenza hospital admissions data. -- `forecasttools.nhsn_flu_forecast_wo_dates` - - An `az.InferenceData` object containing a forecast made using NSHN - influenza data for Texas. -- `forecasttools.nhsn_flu_forecast_w_dates` - - An modified (with dates as coordinates) `az.InferenceData` object - containing a forecast made using NSHN influenza data for Texas. - -See below for more information on the datasets. - -## Location Table - -The location table contains abbreviations, codes, and extended names for -the US jurisdictions for which the FluSight and COVID forecasting hubs -require users to generate forecasts. - -The location table is stored in `forecasttools` as a `polars` dataframe -and is accessed via: - -``` python -loc_table = forecasttools.location_table -loc_table -``` - -
-
location_codeshort_namelong_name
strstrstr
"US""US""United States"
"01""AL""Alabama"
"02""AK""Alaska"
"04""AZ""Arizona"
"05""AR""Arkansas"
"66""GU""Guam"
"69""MP""Northern Mariana Islands"
"72""PR""Puerto Rico"
"74""UM""U.S. Minor Outlying Islands"
"78""VI""U.S. Virgin Islands"
- -Using `./forecasttools/data.py`, the location table was created by -running the following: - -``` python -make_census_dataset( - file_save_path=os.path.join( - os.getcwd(), - "location_table.csv" - ), -) -``` - -## Example FluSight Hub Submission - -The example FluSight submission comes from the [following 2023-24 -submission](https://raw.githubusercontent.com/cdcepi/FluSight-forecast-hub/main/model-output/cfa-flumech/2023-10-14-cfa-flumech.csv). - -The example FluSight submission is stored in `forecasttools` as a -`polars` dataframe and is accessed via: - -``` python -submission = forecasttools.example_flusight_submission -submission -``` - -
-
reference_datetargethorizontarget_end_datelocationoutput_typeoutput_type_idvalue
strstri64strstrstrf64f64
"2023-10-14""wk inc flu hosp"-1"2023-10-07""01""quantile"0.017.670286
"2023-10-14""wk inc flu hosp"-1"2023-10-07""01""quantile"0.0259.968043
"2023-10-14""wk inc flu hosp"-1"2023-10-07""01""quantile"0.0512.022354
"2023-10-14""wk inc flu hosp"-1"2023-10-07""01""quantile"0.114.497646
"2023-10-14""wk inc flu hosp"-1"2023-10-07""01""quantile"0.1516.119813
"2023-10-14""wk inc flu hosp"2"2023-10-28""US""quantile"0.852451.874899
"2023-10-14""wk inc flu hosp"2"2023-10-28""US""quantile"0.92806.928588
"2023-10-14""wk inc flu hosp"2"2023-10-28""US""quantile"0.953383.74799
"2023-10-14""wk inc flu hosp"2"2023-10-28""US""quantile"0.9753940.392536
"2023-10-14""wk inc flu hosp"2"2023-10-28""US""quantile"0.994761.757385
- -Using `data.py`, the example FluSight submission was created by running -the following: - -``` python -get_and_save_flusight_submission( - file_save_path=os.path.join( - os.getcwd(), - "example_flusight_submission.csv" - ), -) -``` - -## NHSN COVID And Flu Hospital Admissions - -NHSN hospital admissions fitting data for COVID and Flu is included in -`forecasttools-py` as well, for user experimentation. - -This data: - -- Is current as of `2024-04-27` -- Comes from the website [HealthData.gov COVID-19 Reported Patient - Impact and Hospital Capacity by State - Timeseries](https://healthdata.gov/Hospital/COVID-19-Reported-Patient-Impact-and-Hospital-Capa/g62h-syeh). - -For influenza, the `previous_day_admission_influenza_confirmed` column -is retained and for COVID the -`previous_day_admission_adult_covid_confirmed` column is retained. As -can be seen in the example below, some early dates for each jurisdiction -do not have data. - -The fitting data is stored in `forecasttools` as a `polars` dataframe -and is accessed via: - -``` python -# access COVID data -covid_nhsn_data = forecasttools.nhsn_hosp_COVID - -# access flu data -flu_nhsn_data = forecasttools.nhsn_hosp_flu - -# display flu data -flu_nhsn_data -``` - -
-
statedatehosp
strstrstr
"AK""2020-03-23"null
"AK""2020-03-24"null
"AK""2020-03-25"null
"AK""2020-03-26"null
"AK""2020-03-27"null
"WY""2024-04-23""1"
"WY""2024-04-24""1"
"WY""2024-04-25""0"
"WY""2024-04-26""0"
"WY""2024-04-27""0"
- -The data was created by placing a csv file called -`NHSN_RAW_20240926.csv` (the full NHSN dataset) into `./forecasttools/` -and running, in `data.py`, the following: - -``` python -# generate COVID dataset -make_nshn_fitting_dataset( - dataset="COVID", - nhsn_dataset_path="NHSN_RAW_20240926.csv", - file_save_path=os.path.join( - os.getcwd(), - "nhsn_hosp_COVID.csv" - ) -) - -# generate flu dataset -make_nshn_fitting_dataset( - dataset="flu", - nhsn_dataset_path="NHSN_RAW_20240926.csv", - file_save_path=os.path.join( - os.getcwd(), - "nhsn_hosp_flu.csv" - ) -) -``` - -## Influenza Hospitalizations Forecast(s) - -Two example forecasts stored in Arviz `InferenceData` objects are -included for vignettes and user experimentation. Both are 28 day -influenza hospital admissions forecasts for Texas made using a spline -regression model fitted to NHSN data between 2022-08-08 and 2022-12-08. -The only difference between the forecasts is that -`example_flu_forecast_w_dates.nc` has had dates added as its coordinates -(this is not a native Arviz feature). The `idata` objects which includes -the observed data and posterior predictive samples is given below: - - Inference data with groups: - > posterior - > posterior_predictive - > log_likelihood - > sample_stats - > prior - > prior_predictive - > observed_data - -The forecast `idata`s are accessed via: - -``` python -# idata with dates as coordinates -idata_w_dates = forecasttools.nhsn_flu_forecast_w_dates -idata_w_dates -``` - -
-
-
arviz.InferenceData
-
- -
- - -``` python -# show dates -print(idata_w_dates["observed_data"]["obs"]["obs_dim_0"][:15]) -``` - - Size: 120B - 2022-08-08 2022-08-09 2022-08-10 2022-08-11 ... 2022-08-20 2022-08-21 2022-08-22 - Coordinates: - * obs_dim_0 (obs_dim_0) datetime64[ns] 120B 2022-08-08 ... 2022-08-22 - -``` python -# idata without dates as coordinates -idata_wo_dates = forecasttools.nhsn_flu_forecast_wo_dates -print(idata_wo_dates["observed_data"]["obs"]["obs_dim_0"][:15]) -``` - - Size: 120B - 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 - Coordinates: - * obs_dim_0 (obs_dim_0) int64 120B 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 - -The forecast was generated following the creation of `nhsn_hosp_flu.csv` -(see previous section) by running `data.py` with the following added: - -``` python -make_forecast( - nhsn_data=forecasttools.nhsn_hosp_flu, - start_date="2022-08-08", - end_date="2022-12-08", - juris_subset=["TX"], - forecast_days=28, - save_path="../forecasttools/example_flu_forecast_w_dates.nc", - save_idata=True, - use_log=False, -) -``` - -(note: `make_forecast` is no longer included in `forecasttools-py`, -given the expectation that no one would ever call it; however, for -reproducibility’s sake, the following is included here) - -
- -Some Of The Forecast Code - - -``` python -""" -Creating a new idata object with -dates to change the functionality -of idata_w_dates_to_df. -""" - -# %% IMPORTS - -import os -from datetime import datetime, timedelta - -import arviz as az -import forecasttools -import jax.numpy as jnp -import jax.random as jr -import matplotlib.pyplot as plt -import numpy as np -import numpyro -import numpyro.distributions as dist -import patsy -import polars as pl -from numpy.typing import NDArray - -# %% CHECK FILE PATH - - -def check_file_save_path( - file_save_path: str, -) -> None: - """ - Checks whether a file path is valid. - - file_save_path - The file path to be checked. - """ - directory = os.path.dirname(file_save_path) - if not os.path.exists(directory): - raise FileNotFoundError(f"Directory does not exist: {directory}") - if not os.access(directory, os.W_OK): - raise PermissionError(f"Directory is not writable: {directory}") - if os.path.exists(file_save_path): - raise FileExistsError(f"File already exists at: {file_save_path}") - - -# %% SPLINE REGRESSION MODEL - - -def model(basis_matrix, y=None): - # priors - shift = numpyro.sample("shift", dist.Normal(0.0, 2.0)) - beta_coeffs = numpyro.sample( - "beta_coeffs", - dist.Normal(jnp.zeros(basis_matrix.shape[1]), 2.0), - ) - shift_mu = jnp.dot(basis_matrix, beta_coeffs) + shift - mu_exp = jnp.exp(shift_mu) - alpha = numpyro.sample("alpha", dist.Exponential(1.0)) - # likelihood - numpyro.sample( - "obs", - dist.NegativeBinomial2(mu_exp, alpha), - obs=y, - ) - - -# %% SPLINE BASIS MATRIX - - -def spline_basis(X, degree: int = 4, df: int = 8) -> NDArray: - basis = patsy.dmatrix( - "bs(x, df=df, degree=degree, include_intercept=True) - 1", - {"x": X, "df": df, "degree": degree}, - return_type="matrix", - ) - return np.array(basis) - - -# %% PLOT AND OR SAVE FORECAST - - -def plot_and_or_save_forecast( - idata: az.InferenceData, - X: NDArray, - y: NDArray, - title: str, - start_date: str, - end_date: str, - last_fit: int, - X_act: NDArray, - y_act: NDArray, - save_to_pdf: bool = False, - use_log: bool = False, -): - """ - Includes hard-coded variables. For the - author's testing and no more. - """ - x_data = idata.posterior_predictive["obs_dim_0"] - y_data = idata.posterior_predictive["obs"] - fig, axes = plt.subplots(1, 1, figsize=(8, 6)) - az.plot_hdi( - x_data, - y_data, - hdi_prob=0.95, - color="skyblue", - smooth=False, - fill_kwargs={ - "alpha": 0.2, - "label": "95% Credible", - }, - ax=axes, - ) - az.plot_hdi( - x_data, - y_data, - hdi_prob=0.75, - color="skyblue", - smooth=False, - fill_kwargs={ - "alpha": 0.4, - "label": "75% Credible", - }, - ax=axes, - ) - az.plot_hdi( - x_data, - y_data, - hdi_prob=0.5, - color="C0", - smooth=False, - fill_kwargs={ - "alpha": 0.6, - "label": "50% Credible", - }, - ax=axes, - ) - axes.plot( - X, - y, - marker="o", - color="black", - linewidth=1.0, - markersize=3.0, - label="Observed", - ) - if (X_act is not None) and (y_act is not None): - axes.plot( - X_act, - y_act, - marker="o", - color="red", - linewidth=1.0, - markersize=3.0, - label="Actual", - ) - if use_log: - axes.set_yscale("log") - axes.set_ylabel( - "(Log) Hospital Admissions", - fontsize=17.5, - ) - if not use_log: - axes.set_ylabel("Hospital Admissions", fontsize=17.5) - median_ts = y_data.median(dim=["chain", "draw"]) - axes.plot( - x_data, - median_ts, - color="blue", - label="Median", - ) - axes.legend() - axes.axvline(last_fit, color="black", linestyle="--") - axes.set_title( - f"{title}", - fontsize=20, - ) - axes.set_xlabel("Time", fontsize=17.5) - - plt.show() - - -# %% ADD DATES TO AN INFERENCE DATA OBJECT - - -def add_dates_to_idata_object( - idata: az.InferenceData, - start_date: str, -) -> az.InferenceData: - """ - Takes an InferenceData object w/ - observed_data and posterior_predictive - groups and adds date indexing - """ - pass - - -# %% MAKE A FORECAST - - -def make_forecast( - nhsn_data: str, - start_date: str, - end_date: str, - juris_subset: list[str], - forecast_days: int, - save_path: str = os.path.join(os.getcwd(), "forecast.nc"), - show_plot: bool = True, - save_idata: bool = False, - use_log: bool = False, -) -> None: - """ - Generates a forecast for specified - dates using a spline regression model. - """ - # check dataset path - check_file_save_path(save_path) - # clean data and organize data, cleaning null values - nhsn_data = nhsn_data.with_columns( - pl.col("hosp").cast(pl.Int64), - pl.col("date").str.strptime(pl.Date, "%Y-%m-%d"), - ).filter( - pl.col("hosp").is_not_null(), - pl.col("state").is_in(juris_subset), - ) - nhsn_data_ready = nhsn_data.filter( - ( - pl.col("date") - >= pl.lit(start_date).str.strptime(pl.Date, "%Y-%m-%d") - ) - & ( - pl.col("date") - <= pl.lit(end_date).str.strptime(pl.Date, "%Y-%m-%d") - ) - ) - # get the actual values, if they exist - try: - forecast_end_date = datetime.strptime( - end_date, "%Y-%m-%d" - ) + timedelta(days=forecast_days) - nhsn_data_actual = nhsn_data.filter( - ( - pl.col("date") - >= pl.lit(end_date).str.strptime(pl.Date, "%Y-%m-%d") - ) - & (pl.col("date") <= pl.lit(forecast_end_date)) - ) - except Exception as e: - nhsn_data_actual = None - print(f"The following error occurred: {e}") - # define some shared inference values - random_seed = 2134312 - num_samples = 1000 - num_warmup = 500 - # get posterior samples and make forecasts for each selected state - for state in juris_subset: - # get the state data - state_nhsn = nhsn_data_ready.filter(pl.col("state") == state) - # get observation (fitting) data y, X - y = state_nhsn["hosp"].to_numpy() - X = np.arange(y.shape[0]) - # set up inference, NUTS/MCMC - kernel = numpyro.infer.NUTS( - model=model, - max_tree_depth=12, - target_accept_prob=0.85, - init_strategy=numpyro.infer.init_to_uniform(), - ) - mcmc = numpyro.infer.MCMC( - kernel, - num_warmup=num_warmup, - num_samples=num_samples, - ) - # create spline basis for obs period and forecast period - last = X[-1] - X_future = np.hstack( - ( - X, - np.arange( - last + 1, - last + 1 + forecast_days, - ), - ) - ) - sbm = spline_basis(X_future) - # get posterior samples - mcmc.run( - rng_key=jr.key(random_seed), - basis_matrix=sbm[: len(X)], - y=y, - ) - posterior_samples = mcmc.get_samples() - # get prior predictive - prior_pred = numpyro.infer.Predictive(model, num_samples=num_samples)( - rng_key=jr.key(random_seed), - basis_matrix=sbm[: len(X)], - ) - # get posterior predictive forecast - posterior_pred_for = numpyro.infer.Predictive( - model, - posterior_samples=posterior_samples, - )( - rng_key=jr.key(random_seed), - basis_matrix=sbm, - ) - # create initial inference data object(s) and store - idata = az.from_numpyro( - posterior=mcmc, - posterior_predictive=posterior_pred_for, - prior=prior_pred, - ) - # get actual data, if it exists - if isinstance(nhsn_data_actual, pl.DataFrame): - actual_data = nhsn_data_actual.filter(pl.col("state") == state) - y_act = actual_data["hosp"].to_numpy() - X_act = np.arange(last - 1, last + forecast_days) - if not isinstance(nhsn_data_actual, pl.DataFrame): - y_act = None - X_act = None - # add dates to idata object - - # save idata object(s) - if save_idata: - idata.to_netcdf(save_path) - # plot forecast (if desired) from idata light - if show_plot: - plot_and_or_save_forecast( - idata=idata, - X=X, - y=y, - title=f"Hospital Admissions ({state}, {start_date}-{end_date})", - start_date=start_date, - end_date=end_date, - last_fit=last, - X_act=X_act, - y_act=y_act, - use_log=use_log, - ) - - -# %% EXECUTE MODE - -make_forecast( - nhsn_data=forecasttools.nhsn_hosp_flu, - start_date="2022-08-08", - end_date="2022-12-08", - juris_subset=["TX"], - forecast_days=28, - save_path="../forecasttools/example_flu_forecast_w_dates.nc", - save_idata=False, - use_log=True, -) -``` - -
- -The forecast looks like: - -
- - -
- -# CDC Open Source Considerations - -**General disclaimer** This repository was created for use by CDC -programs to collaborate on public health related projects in support of -the [CDC mission](https://www.cdc.gov/about/organization/mission.htm). -GitHub is not hosted by the CDC, but is a third party website used by -CDC and its partners to share information and collaborate on software. -CDC use of GitHub does not imply an endorsement of any one particular -service, product, or enterprise. - -
- -Rules, Policy, And Collaboration - - -- [Open Practices](./rules-and-policies/open_practices.md) -- [Rules of Behavior](./rules-and-policies/rules_of_behavior.md) -- [Thanks and Acknowledgements](./rules-and-policies/thanks.md) -- [Disclaimer](DISCLAIMER.md) -- [Contribution Notice](CONTRIBUTING.md) -- [Code of Conduct](./rules-and-policies/code-of-conduct.md) - -
-
- -Public Domain Standard Notice - -This repository constitutes a work of the United States Government and -is not subject to domestic copyright protection under 17 USC § 105. This -repository is in the public domain within the United States, and -copyright and related rights in the work worldwide are waived through -the [CC0 1.0 Universal public domain -dedication](https://creativecommons.org/publicdomain/zero/1.0/). All -contributions to this repository will be released under the CC0 -dedication. By submitting a pull request you are agreeing to comply with -this waiver of copyright interest. -
-
- -License Standard Notice - - -The repository utilizes code licensed under the terms of the Apache -Software License and therefore is licensed under ASL v2 or later. - -This source code in this repository is free: you can redistribute it -and/or modify it under the terms of the Apache Software License version -2, or (at your option) any later version. - -This source code in this repository is distributed in the hope that it -will be useful, but WITHOUT ANY WARRANTY; without even the implied -warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the -Apache Software License for more details. - -You should have received a copy of the Apache Software License along -with this program. If not, see -http://www.apache.org/licenses/LICENSE-2.0.html - -The source code forked from other open source projects will inherit its -license. -
-
- -Privacy Standard Notice - -This repository contains only non-sensitive, publicly available data and -information. All material and community participation is covered by the -[Disclaimer](DISCLAIMER.md) and [Code of Conduct](code-of-conduct.md). -For more information about CDC’s privacy policy, please visit -[http://www.cdc.gov/other/privacy.html](https://www.cdc.gov/other/privacy.html). -
-
- -Contributing Standard Notice - - -Anyone is encouraged to contribute to the repository by -[forking](https://help.github.com/articles/fork-a-repo) and submitting a -pull request. (If you are new to GitHub, you might start with a [basic -tutorial](https://help.github.com/articles/set-up-git).) By contributing -to this project, you grant a world-wide, royalty-free, perpetual, -irrevocable, non-exclusive, transferable license to all users under the -terms of the [Apache Software License -v2](http://www.apache.org/licenses/LICENSE-2.0.html) or later. - -All comments, messages, pull requests, and other submissions received -through CDC including this GitHub page may be subject to applicable -federal law, including but not limited to the Federal Records Act, and -may be archived. Learn more at . -
-
- -Records Management Standard Notice - -This repository is not a source of government records, but is a copy to -increase collaboration and collaborative potential. All government -records will be published through the [CDC web -site](http://www.cdc.gov). -
-
- -Additional Standard Notices - -Please refer to [CDC’s Template -Repository](https://github.com/CDCgov/template) for more information -about [contributing to this -repository](https://github.com/CDCgov/template/blob/main/CONTRIBUTING.md), -[public domain notices and -disclaimers](https://github.com/CDCgov/template/blob/main/DISCLAIMER.md), -and [code of -conduct](https://github.com/CDCgov/template/blob/main/code-of-conduct.md). -