Skip to content

Open source aggregation pipeline for public COVID-19 data, including hospitalization/ICU/ventilator numbers for many countries.

License

Notifications You must be signed in to change notification settings

thepanacealab/open-covid-19-data

 
 

Repository files navigation

Open COVID-19 Data

Google Research's Open COVID-19 Data project is an open source pipeline that aggregates public COVID-19 data sources into a single dataset. The data includes time series data for COVID-19 cases, deaths, tests, hospitalizations, discharges, intensive case unit (ICU) cases, ventilator cases, government interventions, and Google's Community Mobility Reports and Search Trends symptoms dataset.

Table of Contents

About

COVID-19 data is published from many distinct sources with highly heterogenous formats. The goal of this pipeline is to accept data in many different formats, and to process it into a standardized and consistent schema. Having data in a consistent schema allows researchers to build models quickly, while the pipeline is designed for engineers to add new data sources quickly.

The pipeline supports three ways of ingesting data:

  • Automatic downloads: data that can be downloaded as a .csv or .xslx file from a consistent url
  • Manual downloads: data that can be downloaded as a .csv or .xslx, but must be downloaded manually because the url changes
  • Scraped data: data that is not machine-readable and must be scraped by a human (e.g. from charts, tables, pdfs, or occasionally tweets)

For each data source, this repository has a configuration file located in src/config/sources that specifies how the pipeline should map the original data into our schema. Raw data is fetched from the data source and written into a directory within data/inputs. Exported data that has been transformed into our schema is found in the data/exports directory.

Using the data

Latest data

If you just want to use the latest data for models, visualizations, or research, we provide aggregated data files under different licenses. This is to provide you with options so that you can use data with a license that is acceptable for your use case, while respecting the original licenses of the data sources.

  • Aggregated data under a CC-BY license can be downloaded from this link
  • Aggregated data under a CC-BY-SA license can be downloaded from this link.
  • Aggregated data under a CC-BY-NC license can be downloaded from this link.
  • There are two data sources released under Google Terms of Service. To download or use the data, you must agree to the Google Terms of Service.
    • Google's Community Mobility Reports can be downloaded from this directory
    • Google's Search Trends Symptoms Dataset can be downloaded from this directory

Attributions and Licenses

Please see the Data Sources section of this README to note the attributions and licenses for each source.

Data Schema

Locations

Every location is assigned an open_covid_region_code, which is a unique hierarchical location code that can be used to join data across tables in this repository. The full list of locations that are assigned an open_covid_region_code can be found at data/exports/locations/locations.csv. Where available, we also provide a datacommons_id and wikidata_id field for each location.

Each open_covid_region_code has up to three levels:

  • The first-level region codes are ISO-3166-1 codes, e.g. IT for Italy
  • The second-level region codes are, by default, ISO-3166-2 codes. For example, US-AL for Alabama. However, in some locations, COVID-19 data is reported in administrative regions other than ISO-3166-2, so the choice of sub-country regions is informed partially by data availability.
  • Third-level regions include cities and counties - within the United States counties are coded using FIPS 6-4 codes.

Dates

All dates are mapped to ISO 8601 format during data loading, e.g. 2020-08-15.

For Data Owners

We have carefully checked the license and attribution information on each data source included in this repository, and in many cases have contacted the data owners directly to ask how they would like to be attributed.

If you are the owner of a data source included here and would like us to remove data, add or alter an attribution, or add or alter license information, please do not hesitate to email us at [email protected] and we will happily consider your request.

Development

If you would like to run the pipeline locally or to contribute to the codebase, here are instructions for installation and adding new data sources.

Installation

To install Python dependencies:

pip install pandas xlrd pyyaml python3-wget

Usage

To run the main script that runs the entire pipeline on the data that is in data/inputs:

python src/scripts/export_data.py

In addition, there are two scripts that can be run to fetch new data and write it into data/inputs.

To fetch data that can be automatically downloaded:

python src/scripts/fetch_automatic_downloads.py

To fetch data from a spreadsheet in data/inputs/scraped/spreadsheets/:

python src/scripts/fetch_scraped_data.py

Pipeline Structure

The pipeline is structured so that raw data is always fetched into data/inputs before being consumed by the rest of the pipeline. Data sources for each data type are then loaded into pandas dataframes with a standardized schema for dates, locations, and columns. These dataframes are joined into a single dataframe, which is then exported. pipeline

Adding a new data source

Before adding a new data source, we go through an internal approval within Google to ensure compliance with licensing and terms. Once a data source is approved, you can add the data to the pipeline as follows:

1. Register new data types in src/config/data.yaml:
  • If the source includes a data type that isn't yet included in the data schema, register the data type in the schema by adding an entry to src/config/data.yaml.
2. Add a new yaml file to src/config/sources.
  • Specify the fetch parameters:
    • source_url: where to download the data
    • method: one of AUTOMATIC_DOWNLOAD, MANUAL_DOWNLOAD, SCRAPED, STATIC
    • file: filename for the data source
  • Specify the load parameters.
    • function: which function in load_functions.py to use to load the data. Most data sources can be loaded with default_load_function, but some data sources will have formatting that requires implementing a new function in load_functions.py.
    • read: data sources are read using the pandas.read_csv() or pandas.read_excel() functions. The read field accepts key/val parameters that are passed to the appropriate pandas read function.
    • dates:
      • columns: list of column names in the original data source that are required as arg to a function that will return the date in ISO-8601 format. This is often just a single column, but sometimes the year/month/date are in separate columns in the original data.
      • date_format: the format of the date in the original data source
      • parse_function: most dates can be parsed using the default function in date_utils.py. If the data source has a date format that requires a parser that doesn't exist in date_utils.py, implement a separate function in that file.
    • regions:
      • mapping_keys: if a data source contains multiple regions but not ISO-3166 codes for the regions, the locations file at data/exports/locations/locations.csv must contain a column or list of columns that can be uniquely map the locations in the data to the region_code for that location. The mapping_keys field takes key/value fields where the key is the column in the locations file, and the value is the string name of the column in the original data source.
  • Specify the data parameters:
    • These parameters follow the data schema specified in src/config/data.yaml, where the keys come from the data schema and the values are the column name in the original data source for the corresponding data.
  • Specify the attribution parameters. These are used to generate the data source section of the README. The fields for existing data sources serve as an example of what to include.
  • Specify the license parameters. These are used to generate the LICENSE file. The fields for existing data sources serve as an example of what to include.
  • Specify the cc_by and cc_by_sa fields: we produce two aggregated csv files, one is licensed under CC-BY and the other is under CC-BY-SA. These fields specify whether the data can appear in each file.
3. Update docs and licenses:
  • When you run src/scripts/export_data.py, it will update the README.md as well as the LICENSE files within data/exports.

Authors

This repository is created and maintained by Katie Everett, Dan Nanas, Maddy Myers (UCSD), Sumit Arora, and Ian Fischer.

Data Sources

Australia

Source name: covid19data.com.au (link)
Link to data: https://www.covid19data.com.au/hospitalisations-icu
Description: Data is scraped manually from the charts provided at the source link. Data for Australia consists of time series data for current hospitalizations, ICU and ventilator cases.
License: Creative Commons Attribution 4.0 International (link)
Last accessed: 2020-08-31

COVIDTracking

Source name: COVID-19 Tracking Project (link)
Link to data: https://github.com/COVID19Tracking/covid-tracking-data/tree/master/data
Description: Data is downloaded automatically from the source link. Data for the United States consists of time series data for current and cumulative hospitalizations.
License: Apache 2.0 (link)
Last accessed: 2020-09-01

Colombia

Original data source: GOV.CO (link)
Link to original data: https://www.datos.gov.co/Salud-y-Protecci-n-Social/Casos-positivos-de-COVID-19-en-Colombia/gt2j-8ykr/data
Data aggregated by: COVID-19 Colombia (link)
License: Creative Commons Attribution-ShareAlike 4.0 International (link)
Last accessed: 2020-09-01

Czech Republic

Source name: National Health Information System, Regional Hygiene Stations, Ministry of Health of the Czech Republic (link)
Link to data: https://onemocneni-aktualne.mzcr.cz/covid-19
Description: Data is scraped manually from the charts provided at the source link. Data for the Czech Republic consists of time series data for current ICU cases, and current and cumulative hospitalizations.
Citation:

Komenda M., Karolyi M., Bulhart V., Žofka J., Brauner T., Hak J., Jarkovský J., Mužík J., Blaha M., Kubát J., Klimeš D., Langhammer P., Daňková Š ., Májek O., Bartůňková M., Dušek L. COVID ‑ 19: Přehled aktuální situace v ČR. Onemocnění aktuálně [online]. Praha: Ministerstvo zdravotnictví ČR, 2020 [cit. 25.04.2020]. Dostupné z: https://onemocneni-aktualne.mzcr.cz/covid-19. Vývoj: společné pracoviště ÚZIS ČR a IBA LF MU. ISSN 2694-9423.

Last accessed: 2020-08-31

Denmark

Source name: Statens Serum Institute (link)
Link to data: https://www.sst.dk/da/corona/tal-og-overvaagning
Description: Data is manually scraped from charts at the source link. Data for Denmark consists of time series data for current hospitalizations and ICU cases.
Last accessed: 2020-08-31

Finland

Source name: Finnish institute for health and welfare (link)
Link to data: https://thl.fi/en/web/infectious-diseases/what-s-new/coronavirus-covid-19-latest-updates
License: Creative Commons Attribution 4.0 International (link)
Last accessed: 2020-08-31

France

Source name: data.gouv.fr (link)
Link to data: https://www.data.gouv.fr/en/datasets/donnees-hospitalieres-relatives-a-lepidemie-de-covid-19/
Description: Data is scraped manually from the charts provided at the source link. Data for France consists of time series data for cumulative hospitalizations and ICU cases.
License: Open License 2.0 (link)
Last accessed: 2020-09-01

Google's COVID19 Community Mobility Reports

Source name: Google's COVID19 Community Mobility Reports (link)
Link to data: https://www.gstatic.com/covid19/mobility/Global_Mobility_Report.csv
Help Center: https://support.google.com/covid19-mobility
Description: These Community Mobility Reports aim to provide insights into what has changed in response to policies aimed at combating COVID-19. The reports chart movement trends over time by geography, across different categories of places.
Terms: In order to download or use the data or reports, you must agree to the Google Terms of Service.
License: Google Terms of Service (link)
Citation:

Google LLC "Google COVID-19 Community Mobility Reports".
https://www.google.com/covid19/mobility/ Accessed: <date>.

Last accessed: 2020-08-28

Google's COVID19 Search Trends symptoms dataset

Source name: Google's COVID19 Search Trends symptoms dataset (link)
Link to data: http://goo.gle/covid19symptomdataset
Description: The COVID-19 Search Trends symptoms dataset shows aggregated, anonymized trends in Google searches for symptoms, signs and some health conditions. The dataset provides a daily or weekly time series for each region showing the relative volume of searches for each symptom.
Terms: In order to download or use the data or reports, you must agree to the Google Terms of Service.
License: Google Terms of Service (link)
Citation:

Google LLC "Google COVID-19 Search Trends symptoms dataset".
http://goo.gle/covid19symptomdataset, Accessed: <date>.

Last accessed: 2020-08-30

Iceland

Source name: Directorate of Health in Iceland (Embaetti landlaeknis) (link)
Link to data: https://www.covid.is/data
Description: Data is downloaded manually from the source link. Data for Iceland consists of time series data for current ICU cases, and current and cumulative hospitalizations.
Last accessed: 2020-06-22

Ireland

Source name: Health Protection Surveillance Centre (link)
Link to data: https://www.hpsc.ie/a-z/respiratory/coronavirus/novelcoronavirus/casesinireland/epidemiologyofcovid-19inireland/
Description: Data is scraped manually from daily situation reports. Data for Ireland consists of time series data for cumulative hospitalizations.
License: Creative Commons Attribution ShareAlike 3.0 (link)
Last accessed: 2020-08-31

Italy

Source name: Dipartimento della Protezione Civile (link)
Link to data: https://github.com/pcm-dpc/COVID-19
Description: Data is downloaded automatically from the source repository. Data for Italy consists of time series data for current hospitalizations, but we can also compute cumulative hospitalizations.
License: Creative Commons Attribution 4.0 International (link)
Last accessed: 2020-09-01

Japan

Source name: Toyo Keizai Online (link)
Link to data: https://github.com/kaz-ogiwara/covid19
Copyright notice: Copyright (c) 2020 Kazuki OGIWARA / 荻原 和樹
Description: Data is downloaded automatically from the source repository. Data for Japan consists of time series data for current hospitalizations and ICU cases.
License: MIT (link)
Last accessed: 2020-08-03

Luxembourg

Source name: Luxembourg Ministry of Health (link)
Link to data: https://data.public.lu/fr/datasets/donnees-covid19/#_
Description: Data is downloaded automatically from the source link. Data for Luxembourg consists of time series data for current hospitalizations and ICU cases.
License: Creative Commons Zero 1.0 Universal (link)
Last accessed: 2020-09-01

Moldova

Source name: Ministry of Health, Labour and Social Protection (link)
Link to data: https://msmps.gov.md/ro/advanced-page-type/comunicate-de-presa
Last accessed: 2020-08-31

Netherlands

Source name: National Institute for Public Health and The Environment (link)
Link to data: https://www.rivm.nl/coronavirus-covid-19/grafieken
Description: Data is downloaded manually from the source link. Data for the Netherlands consists of time series data for current hospitalizations.
Last accessed: 2020-06-29

New Zealand

Source name: New Zealand Ministry of Health (link)
Link to data: https://www.health.govt.nz/our-work/diseases-and-conditions/covid-19-novel-coronavirus/covid-19-current-situation/covid-19-current-cases
Last accessed: 2020-08-31

Norway

Source name: Norwegian Institute of Public Health (link)
Link to data: https://www.fhi.no/en/id/infectious-diseases/coronavirus/daily-reports/daily-reports-COVID19/
Last accessed: 2020-06-22

Our World in Data

Source name: Our World in Data (link)
Link to data: https://github.com/owid/covid-19-data/tree/master/public/data
License: Creative Commons Attribution 4.0 International (link)
Citation:

Data from Our World in Data has been collected, aggregated, and documented by Diana Beltekian, Daniel Gavrilov, Charlie Giattino, Joe Hasell, Bobbie Macdonald, Edouard Mathieu, Esteban Ortiz-Ospina, Hannah Ritchie, and Max Roser.

Last accessed: 2020-09-01

Oxford Covid-19 Government Response Tracker

Source name: Oxford Covid-19 Government Response Tracker (link)
Link to data: https://github.com/OxCGRT/covid-policy-tracker/blob/master/data/OxCGRT_latest.csv
License: Creative Commons Attribution 4.0 International (link)
Citation:

Thomas Hale, Sam Webster, Anna Petherick, Toby Phillips, and Beatriz Kira. (2020). Oxford COVID-19 Government Response Tracker. Blavatnik School of Government.

Last accessed: 2020-09-01

Philippines

Source name: Philippines Department of Health (link)
Link to data: http://www.doh.gov.ph/covid19tracker
Last accessed: 2020-08-31

Spain

Source name: Ministerio de Sanidad, Consumo y Bienestar Social (link)
Link to data: https://cnecovid.isciii.es/covid19/resources/agregados.csv
Description: The data is downloaded automatically from the source link. Due to regional differences in hospitalization reporting, we do not aggregate across regions to produce country-level statistics for Spain.
Last accessed: 2020-09-01

Sweden

Source name: Public Health Agency of Sweden (link)
Link to data: https://www.arcgis.com/sharing/rest/content/items/b5e7488e117749c19881cce45db13f7e/data
Description: Data is downloaded automatically from the source link. Data for Sweden consists of time series data for current ICU cases.
Last accessed: 2020-09-01

Switzerland

Source name: Switzerland Federal Office of Public Health BAG (link)
Link to data: https://www.bag.admin.ch/bag/de/home/krankheiten/ausbrueche-epidemien-pandemien/aktuelle-ausbrueche-epidemien/novel-cov/situation-schweiz-und-international.html
Last accessed: 2020-06-29

The New York Times

Source name: The New York Times COVID-19 Data (link)
Link to data: https://github.com/nytimes/covid-19-data
License: Creative Commons Attribution-NonCommercial 4.0 International (link)
Citation:

Data from The New York Times, based on reports from state and local health agencies.

Last accessed: 2020-09-01

United Kingdom

Source name: GOV.UK (link)
Link to data: https://www.gov.uk/government/publications/
Description: Data is downloaded manually from the publications provided at the source link. Data is aggregated across regions in England and reported at the country level for England, Scotland, Wales and Northern Ireland. Data consists of time series data for current hospitalizations.
License: Open Government License 3.0 (link)
Last accessed: 2020-06-23

About

Open source aggregation pipeline for public COVID-19 data, including hospitalization/ICU/ventilator numbers for many countries.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 93.5%
  • CSS 2.0%
  • JavaScript 1.7%
  • HTML 1.3%
  • Other 1.5%