![chart](https://github.com/reichlab/covid19-forecast-hub/raw/master/visualization/vis-master/chart.png)
We are grateful to the teams who have generated these forecasts. They have spent a huge amount of time and effort in a short amount of time to operationalize these important real-time forecasts. The groups have graciously and courageously made their public data available under different terms and licenses. You will find the licenses (when provided) within the model-specific folders in the data-raw directory. Please consult these licenses before using these data to ensure that you follow the terms under which these data were released.
We have stored the raw datafiles here as they were made available on the various websites or provided directly to us. We are working on creating standardized versions of these files and on building a queryable API for easy access to the data contained in the forecasts.
Different groups are making forecasts at different times, and for different geographic scales. After looking over what groups are doing, we have settled (for the time being) on the following specifications, although not all models make forecasts for each of the following locations and targets.
What do we consider to be "gold standard" death data? We will use the daily reports containing death data from the JHU CSSE group as the gold standard reference data for deaths in the US. Note that there are not insignificant differences (especially in daily incident death data) between the JHU data and another commonly used source, from the New York Times. The team at UTexas-Austin is tracking this issue on a separate GitHub repository.
When will forecast data be updated? We will be storing any new forecasts from each group as they are either provided to us directly (by pull request) or available for download online. We will attempt to make every version of each team's forecasts available in "processed" form in the GitHub repo. Teams are encouraged to submit data as often has they have it available, although we only support one upload for each day. Every Monday at 6pm ET, we will upate our ensemble forecast and interactive visualization using the most recent forecast from each team. Therefore, at the very least we encourage teams to provide a new forecast on Mondays that uses the most recent data. Depending on how the project evolves, we may add additional weekly builds for the ensemble and visualization.
What locations will have forecasts?
Forecasts may be submitted for any location that can be tagged with a FIPS code. Currently, our focus is on cataloguing forecasts for the United States, although we are starting to look at global forecast data as well. For the US, we are collecting forecast data at the national level (e.g., FIPS code = "US") and state level (FIPS code 2-digit character string). A file with FIPS codes for states in the US is available through the fips_code
dataset in the tigris
R package, and saved as a public CSV file. Please note that when reading in FIPS codes, they should be read in as characters to preserve any leading zeroes.
How will probabilistic forecasts be represented?
Forecasts will be represented in a standard format using quantile-based representations of predictive distributions. We encourage all groups to make available the following 23 quantiles for each distribution: c(0.01, 0.025, seq(0.05, 0.95, by = 0.05), 0.975, 0.99)
. If this is infeasible, we ask teams to prioritize making available at least the following quantiles: c(0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99)
. One goal of this effort is to create probabilistic ensemble forecasts, and having high-resolution component distributions will provide data to create better ensembles.
What forecast targets will be stored? We will store forecasts on 1 through 130 day-ahead incident and cumulative deaths, 1 through 20 week-ahead incident and cumulative deaths, and 1 through 130 day-ahead incident hospitalizations. The targets should be labeled in files as, e.g., "1 day ahead inc death", "1 day ahead cum death", "1 wk ahead inc death", "1 wk ahead cum death", or "1 wk ahead inc hosp".
To be clear about how the time periods relate to the time at which a forecast was made, we provide the following specficiations (which are subject to change or re-evaluation as we get further into the project). Every submitted forecast will have an associated forecast_date
that corresponds to the day the forecast was made. For day-ahead forecasts with a forecast date of a Monday, a 1 day ahead forecast corresponds to incident deaths on Tuesday or cumulative deaths by the end of Tuesday, 2 day ahead to Wednesday, etc....
For week-ahead forecasts, we will use the specification of epidemiological weeks (EWs) defined by the US CDC. There are standard software packages to convert from dates to epidemic weeks and vice versa. E.g. MMWRweek for R and pymmwr and epiweeks for python.
For week-ahead forecasts with forecast_date
of Sunday or Monday of EW12, a 1 week ahead forecast corresponds to EW12 and should have target_end_date
of the Saturday of EW12. For week-ahead forecasts with forecast_date
of Tuesday through Saturday of EW12, a 1 week ahead forecast corresponds to EW13 and should have target_end_date
of the Saturday of EW13. A week-ahead forecast should represent the total number of incident deaths or hospitalizations within a given epiweek (from Sunday through Saturday, inclusive) or the cumulative number of deaths reported on the Saturday of a given epiweek. We have created a csv file describing forecast collection dates and dates for which forecasts refer to can be found.
Most groups are providing their forecasts in a quantile-based format. We have developed a general data model that can be used to represent all of the forecasts that have been made publicly available. The tabular version of the data model is a simple, long-form data format, with six required columns and several optional columns.
forecast_date
: the date on which the forecast was made inYYYY-MM-DD
format. should correspond and be redundant with the date in the filename, but included here by request from some analyststarget
: a unique id for the targettarget_end_date
: the date corresponding to the end time of the target, inYYYY-MM-DD
format. E.g. if the target is "1 wk ahead inc hosp" and this forecast is submitted on Monday2020-04-20
, then this field should correspond to the Saturday that ends the current week2020-04-25
.location
: a unique id for the location (we have standardized to FIPS codes)location_name
: (optional) if desired to have a human-readable name for the location, this column may be specified. Note that thelocation
column will be considered to be authoritative and for programmatic reading and importing of data, this column will be ignored.type
: one of either"point"
or"quantile"
quantile
: a value between 0 and 1 (inclusive), stating which quantile is displayed in this row. iftype=="point"
thenNA
.value
: a numeric value representing the value of the quantile function evaluated at the probability specified inquantile
For example, if quantile
is 0.3 and value
is 10, then this row is saying that the 30th percentile of the distribution is 10. If type
is "point"
and value
is 15, then this row is saying that the point estimate from this model is 15.
Raw data from the data-raw
subfolders will be processed and put into corresponding subfolders in data-processed
. All files must follow the format outlined above. A template file in the correct format for two targets in a single location has been included for clarity.
Each file must have a specific naming scheme that represents when the forecast was made and what model made the forecast. Files will follow the following name scheme: YYYY-MM-DD-[team]-[model].csv
. Where YYYY-MM-DD
is the date for the Monday on which the forecast was collected. For now, we will only accept a single file for each Monday for a given model (in general, this will be the most recent file generated by that team). For example, a forecast generated from the CU
team for the 80contact
model on Sunday April 5, 2020, the filename would be 2020-04-06-CU-80contact.csv
.
So far, we have identified a number of experienced teams that are creating forecasts of COVID-19-related deaths in the US and globally. Our list of groups whose forecasts are currently standardized and in the repository are (with data reuse license):
- Columbia University (Apache2.0)
- GLEAM from Northeastern University (CC-BY-4.0)
- IHME (CC-AT-NC4.0)
- LANL (custom)
- Imperial (none given)
- MIT (Apache 2.0)
- Notre Dame (none given)
- University of Geneva / Swiss Data Science Center (none given)
- University of Massachusetts - Expert Model (MIT)
- University of Massachusetts - Mechanistic Bayesian model (MIT)
- University of Texas-Austin (BSD-3)
- YYG (MIT)
- COVIDhub ensemble forecast: this is a combination of the above models.
Participating teams must provide a metadata file (see example), including methodological detail about their approach and a link to a file (or a file itself) describing the methods used.
Carefully curating these datasets into a standard format has taken a Herculean team effort. The following lists those who have helped out, in reverse alphabetical order:
- Nutcha Wattanachit (ensemble model, data processing)
- Nicholas Reich (project lead, ensemble model, data processing)
- Jarad Niemi (data processing and organization)
- Khoa Le (validation, automation)
- Katie House (visualization, validation, project management)
- Matt Cornell (validation, Zoltar integration)
- Andrea Brennen (metadata curation)
- Johannes Bracher (evaluation, data processing)