-
Notifications
You must be signed in to change notification settings - Fork 18
Adding a regional data source
Thank you for contributing to covidregionaldata! Please make sure you have read our contributing guide before reading on (see it here).
If you are adding data for an individual country read on. If you wish to add national level data (data spanning multiple countries) checkout the guide on adding national level data in addition to reading this.
Our datasets are implemented using R6 methods. You can read more about these methods here.
This document will not describe in detail the mechanics of the top-level DataClass
class, but
the main (and sometimes only) thing you need to do to add a new data source to covidregionaldata
is to create a new class which inherits from DataClass
and "fills out" the framework it provides.
Lots of the hard work of downloading, processing and returning is done by functions in the DataClass
ancestor or in some additional functions in utils.R
.
With this basis, there are two approaches to making a new class. The first is to work from
the CountryTemplate.R
class. The second is to model your class on an existing one.
To help get you started we have provided a template here. As a first step copy this template into the R
folder of your local copy of covidregionaldata
and rename it to the country or data source you are adding support for. You should also rename all the CountryTemplate
uses in the template (either using camel case if in code or using title case if written as text). For the next steps see here for an example simple dataset class and here for a complex class.
covidregionaldata
already provides data from several different sources
and classes have been written to function
with a variety of input. Someone has probably already done something similar to what you need to
do, and in this case you can see what steps have been put into each method, and which methods have been left out.
This approach will let you start with a framework with the parts you need already and you'll be able to see what other classes have done in each function to get an idea of what you need to do for yours.
- Q1: Do you have data only for level 1?
- Yes
- Model based on Cuba or Canada or SouthAfrica or Italy or India [or ...]
- No
- Q2: Is your data available as separate sources for levels 1 and 2?
- Model on USA, look at France
- Q3: Is your data available is only for level 2 but then aggregated to get level 1 data?
- Model on Lithuania or Brazil or Germany
- Q4: Is some data available only at certain levels?
- Model on Belgium (see also France)
- Q5: Are you doing something different from all the others?
- Look at UK, but this probably isn't what you want to do.
- Q2: Is your data available as separate sources for levels 1 and 2?
- Yes
Most users will never create DataClass
objects or interact with it directly, but
will rely on its work when they call get_regional_data
.
When get_regional_data
is called for a particular source, the source is identified
and the object of the correct class is created. The params
specified in the call to
get_regional_data
are transferred into the class' internal fields
(e.g. self$verbose
) and then get
is called, which then calls the following methods
in the following hierarchical order.
-
get
-
download
-
clean
clean_common
-
clean_level_1
ORclean_level_2
-
process
-
filter
-
In most cases you will need to implement at least clean_common
. clean_common
and the level-specific clean
functions if they exist should work to take data from
self$data$raw
(where there may be several named tibbles coming from separate data
sources) and put it into self$data$clean
as a single tibble with the
data columns renamed and formatted (conversions of dates, standardization of names)
and the columns level_1_region_code
, region_level_1
. For level 2 data, region_level_2
is also required; level_2_region_code
is optional.
You may only need to
provide one of clean_level_1
and clean_level_2
; these will be called after
clean_common
, so if there is common cleaning logic that can be applied, it should
be in clean_common
.
You will only need a custom download
method if your data is not available on static
urls in csv format (as in the code for Mexico).
If you provide a new method for download
, call super$download()
first within it.
process
does much of the work of filling in NA
values, adding empty columns
and providing totals where requested.
You should not need to write new methods for clean
, filter
or process
.
You need an open and accessible data source, preferably in the form of a CSV file updated on a regular basis and accessible for download with a fixed (or predictable) URL.
You will place these urls into the common_data_urls
named list, or into the
level-specific data urls named lists, or into some combination of the two.
DataClass$download
will download all files listed in common_data_urls
and place
the contents of each into self$data$raw
, each as a tibble with the name of which
was applied to the url in common_data_urls
list, so
common_data_urls = list(
"main" = "https://covid19cubadata.github.io/data/covid19-casos.csv"
)
results in the downloading of the data from https://covid19cubadata.github.io/data/covid19-casos.csv and it being
placed into self$data$raw$main
Below is a list of the columns which get_regional_data
will return
for each country. You probably won't have data for all these columns
and you do not need to generate empty or NA
columns. Gaps in your data
will be filled with NA
, and cumulative sums will be calculated where
necessary.
At a minimum, your get_regional_data_*
function should provide date
,
one of region_level_1
or region_level_2
(as appropriate),
one of level_1_region_code
or level_2_region_code
(as appropriate),
and one of cases_new
, cases_total
, deaths_new
, deaths_total
,
recovered_new
, recovered_total
, tested_new
, tested_total
.
date
: the date that the counts were reported (YYYY-MM-DD).
region_level_1
: the level 1 region name. This column will be named differently
for different countries (e.g. state, province),
but this renaming is done by the function which calls
your get_regional_data_*
function, based on what is present in
get_info_covidregionaldata
(see below)
level_1_region_code
: a standard code for the level 1 region. The column name
reflects the specific administrative code used. Typically
data returns the
ISO 3166-2 standard,
although where not available the column will be named differently
to reflect its source.
region_level_2
: the level 2 region name. This column will be named differently
for different countries (e.g. city, county).
This renaming is done by DataClass
functions based on what is
stored
level_2_region_code
: a standard code for the level 2 region. The column will be named
differently for different countries (e.g. fips
in the USA).
cases_new
: new reported cases for that day
cases_total
: total reported cases up to and including that day
deaths_new
: new reported deaths for that day
deaths_total
: total reported deaths up to and including that day
recovered_new
: new reported recoveries for that day
recovered_total
: total reported recoveries up to and including that day
hosp_new
: new reported hospitalisations for that day
hosp_total
: total reported hospitalisations up to and including that day
(note this is cumulative total of new reported, not total
currently in hospital)
tested_new
: tests for that day
tested_total
: total tests completed up to and including that day
The R6 structure of DataClass
means that you probably don't have to write code to download the data. By listing the source urls for CSV files of your data in the common_data_urls
and/or the level_data_urls
named lists, you invoke a generic download
function which downloads the contents of each url and places it into self$data$raw$[name]
.
Looking at an example from Belgium, the following instructs the download
function to download data on cases (main
) and hospitalization (hosp
) into self$data$raw$main
and self$data$raw$hosp
for every invocation. If data for level 1 is being generated, then self$data$raw$deaths
is also filled with the contents of the specified downloaded CSV file.
#' @field common_data_urls List of named links to raw data that are common
#' across levels.
common_data_urls = list(
"main" = "https://epistat.sciensano.be/Data/COVID19BE_CASES_AGESEX.csv",
"hosp" = "https://epistat.sciensano.be/Data/COVID19BE_HOSP.csv"
),
#' @field level_data_urls List of named links to raw data specific to
#' each level of regions. For Belgium, there are only additional data for
#' level 1 regions.
level_data_urls = list(
"1" = list(
"deaths" = "https://epistat.sciensano.be/Data/COVID19BE_MORT.csv"
)
),
For more complex examples, have a look at the code for Mexico and the UK.
Clean your data. You should probably use lubridate::as_date
(or another function) to generate the date
field.
You may need to convert local region names, or adjust them.
You may want to remove fields which have no use to the end user (e.g., codes used for regions which do not correspond to ISO:3166 standards or the codes you will return).
General practice is to only return the data fields which
covidregionaldata
processes and provides, as listed above. If you don't
have all these fields, they will be calculated where possible or replaced
with NA
where appropriate.
Ideally we use ISO 3166-2 codes for sub-national regions at levels 1 and 2. Wikipedia provides a list of countries with the ISO 3166-2 codes available. The ISO provides an online browsing platform of country codes giving authoritative data, but the results mix levels 1 and 2 making it less straightforward for conversion and use.
The following code, adjusted from a version for France, was initially used to create lookup tables of Lithuanian municipality and county codes.
iso_3166_2_url <- "https://en.wikipedia.org/wiki/ISO_3166-2:LT"
iso_3166_2_table <- iso_3166_2_url %>%
xml2::read_html() %>%
rvest::html_nodes(xpath = '//*[@id=\"mw-content-text\"]/div/table') %>%
rvest::html_table(fill = TRUE)
iso_3166_2_table
#> [[1]]
#> # A tibble: 10 x 3
#> Code `Subdivision Name (lt)` `Subdivision Name (en)[note 1]`
#> <chr> <chr> <chr>
#> 1 LT-AL Alytaus apskritis Alytus County
#> 2 LT-KU Kauno apskritis Kaunas County
#> 3 LT-KL Klaipėdos apskritis Klaipėda County
#> 4 LT-MR Marijampolės apskritis Marijampolė County
#> 5 LT-PN Panevėžio apskritis Panevėžys County
#> 6 LT-SA Šiaulių apskritis Šiauliai County
#> 7 LT-TA Tauragės apskritis Tauragė County
#> 8 LT-TE Telšių apskritis Telšiai County
#> 9 LT-UT Utenos apskritis Utena County
#> 10 LT-VL Vilniaus apskritis Vilnius County
#>
#> [[2]]
#> # A tibble: 60 x 3
#> Code `Subdivision name` `Subdivision category`
#> <chr> <chr> <chr>
#> 1 LT-01 Akmenė district municipality
#> 2 LT-02 Alytaus miestas city municipality
#> 3 LT-03 Alytus district municipality
#> 4 LT-04 Anykščiai district municipality
#> 5 LT-05 Birštono municipality
#> 6 LT-06 Biržai district municipality
#> 7 LT-07 Druskininkai municipality
#> 8 LT-08 Elektrėnai municipality
#> 9 LT-09 Ignalina district municipality
#> 10 LT-10 Jonava district municipality
#> # … with 50 more rows
lintr
will pedantically check your code for style. Note that the default
line length for covidregionaldata
is now set to 120, so you can ignore
some of the warnings about line lengths.
lintr("R/CountryName.R")
styler
will apply most of the fixes which you would have to do by
hand to make lintr
happier with your code.
styler::style_file("R/CountryName.R")
Your data source or your region names may use non-ASCII characters. The prefixer:: add-in has a handy tool for converting non-ASCII characters to escaped versions.
This is a work in progress. Please comment in this issue if interested in expanding this guide.