Using str_c(), str_glue(), and unite() to combine strings.
-
-
Using str_order() to arrange strings.
-
-
Using str_split() and separate() to split strings.
+
Using str_c(), str_glue(), and unite() to combine strings
+
Using str_order() to arrange strings
+
Using str_split() and separate() to split strings
@@ -959,7 +952,7 @@
Dynamic strings
str_glue("Data include {nrow(linelist)} cases and are current to {format(Sys.Date(), '%d %b %Y')}.")
-
Data include 5888 cases and are current to 30 Sep 2024.
+
Data include 5888 cases and are current to 18 Oct 2024.
An alternative format is to use placeholders within the brackets and define the code in separate arguments at the end of the str_glue() function, as below. This can improve code readability if the text is long.
Linelist as of 18 Oct 2024.
Last case hospitalized on 30 Apr 2015.
256 cases are missing date of onset and not shown
@@ -987,8 +980,8 @@
Dynamic strings
-
-
+
+
Use str_glue_data(), which is specially made for taking data from data frame rows:
@@ -1053,8 +1046,8 @@
Unite columns
Here is the example data frame:
-
-
+
+
Below, we unite the three symptom columns:
@@ -1168,8 +1161,8 @@
Split columns
Let’s say we have a simple data frame df (defined and united in the unite section) containing a case_ID column, one character column with many symptoms, and one outcome column. Our goal is to separate the symptoms column into many columns - each one containing one symptom.
-
-
+
+
Assuming the data are piped into separate(), first provide the column to be separated. Then provide into = as a vector c( ) containing the new columns names, as shown below.
@@ -1433,20 +1426,17 @@
Extract by character position
Use str_sub() to return only a part of a string. The function takes three main arguments:
-
the character vector(s).
-
-
start position.
-
end position.
+
The character vector(s)
+
Start position
+
End position
A few notes on position numbers:
-
If a position number is positive, the position is counted starting from the left end of the string.
-
-
If a position number is negative, it is counted starting from the right end of the string.
-
-
Position numbers are inclusive.
+
If a position number is positive, the position is counted starting from the left end of the string
-
Positions extending beyond the string will be truncated (removed).
+
If a position number is negative, it is counted starting from the right end of the string
+
Position numbers are inclusive
+
Positions extending beyond the string will be truncated (removed)
Below are some examples applied to the string “pneumonia”:
@@ -1859,13 +1849,10 @@
-
Character sets.
-
-
Meta characters.
-
-
Quantifiers.
-
-
Groups.
+
Character sets
+
Meta characters
+
Quantifiers
+
Groups
Character sets
Character sets, are a way of expressing listing options for a character match, within brackets. So any a match will be triggered if any of the characters within the brackets are found in the string. For example, to look for vowels one could use this character set: “[aeiou]”. Some other common character sets are:
@@ -1948,15 +1935,11 @@
will return instances of two capital A letters.
-
-
"A{2,4}" will return instances of between two and four capital A letters (do not put spaces!).
-
-
"A{2,}" will return instances of two or more capital A letters.
-
-
"A+" will return instances of one or more capital A letters (group extended until a different character is encountered).
-
-
Precede with an * asterisk to return zero or more matches (useful if you are not sure the pattern is present).
+
"A{2}" will return instances of two capital A letters
+
"A{2,4}" will return instances of between two and four capital A letters (do not put spaces!)
+
"A{2,}" will return instances of two or more capital A letters
+
"A+" will return instances of one or more capital A letters (group extended until a different character is encountered)
+
Precede with an * asterisk to return zero or more matches (useful if you are not sure the pattern is present)
Using the + plus symbol as a quantifier, the match will occur until a different character is encountered. For example, this expression will return all words (alpha characters: "[A-Za-z]+"
Such chains utilize dplyr “verb” functions and the magrittr pipe operator %>%. This pipe begins with the “raw” data (“linelist_raw.xlsx”) and ends with a “clean” R data frame (linelist) that can be used, saved, exported, etc.
In a cleaning pipeline the order of the steps is important. Cleaning steps might include:
-
Importing of data.
-
-
Column names cleaned or changed.
-
-
De-duplication.
-
-
Column creation and transformation (e.g. re-coding or standardising values).
-
-
Rows filtered or added.
+
Importing of data
+
Column names cleaned or changed
+
De-duplication
+
Column creation and transformation (e.g. re-coding or standardising values)
+
Rows filtered or added
@@ -1033,8 +1029,8 @@
Import
You can view the first 50 rows of the the data frame below. Note: the base R function head(n) allow you to view just the first n rows in the R console.
-
-
+
+
@@ -1480,19 +1476,16 @@
Other statistical software such as SAS and STATA use “labels” that co-exist as longer printed versions of the shorter column names. While R does offer the possibility of adding column labels to the data, this is not emphasized in most practice. To make column names “printer-friendly” for figures, one typically adjusts their display within the plotting commands that create the outputs (e.g. axis or legend titles of a plot, or column headers in a printed table - see the scales section of the ggplot tips page and Tables for presentation pages). If you want to assign column labels in the data, read more online here and here.
As R column names are used very often, so they must have “clean” syntax. We suggest the following:
-
Short names.
-
No spaces (replace with underscores _ ).
-
No unusual characters (&, #, <, >, …).
-
-
Similar style nomenclature (e.g. all date columns named like date_onset, date_report, date_death…).
+
Short names
+
No spaces (replace with underscores _ )
+
No unusual characters (&, #, <, >, …)
+
Similar style nomenclature (e.g. all date columns named like date_onset, date_report, date_death…)
The columns names of linelist_raw are printed below using names() from base R. We can see that initially:
-
Some names contain spaces (e.g. infection date).
-
-
Different naming patterns are used for dates (date onset vs. infection date).
-
-
There must have been a merged header across the two last columns in the .xlsx. We know this because the name of two merged columns (“merged_header”) was assigned by R to the first column, and the second column was assigned a placeholder name “…28” (as it was then empty and is the 28th column).
+
Some names contain spaces (e.g. infection date)
+
Different naming patterns are used for dates (date onset vs. infection date)
+
There must have been a merged header across the two last columns in the .xlsx. We know this because the name of two merged columns (“merged_header”) was assigned by R to the first column, and the second column was assigned a placeholder name “…28” (as it was then empty and is the 28th column)
names(linelist_raw)
@@ -1511,15 +1504,11 @@
Automatic cleaning
The function clean_names() from the package janitor standardizes column names and makes them unique by doing the following:
-
Converts all names to consist of only underscores, numbers, and letters.
-
-
Accented characters are transliterated to ASCII (e.g. german o with umlaut becomes “o”, spanish “enye” becomes “n”).
-
-
Capitalization preference for the new column names can be specified using the case = argument (“snake” is default, alternatives include “sentence”, “title”, “small_camel”…).
-
-
You can specify specific name replacements by providing a vector to the replace = argument (e.g. replace = c(onset = "date_of_onset")).
-
Converts all names to consist of only underscores, numbers, and letters
+
Accented characters are transliterated to ASCII (e.g. german o with umlaut becomes “o”, spanish “enye” becomes “n”)
+
Capitalization preference for the new column names can be specified using the case = argument (“snake” is default, alternatives include “sentence”, “title”, “small_camel”…)
+
You can specify specific name replacements by providing a vector to the replace = argument (e.g. replace = c(onset = "date_of_onset"))
Below, the cleaning pipeline begins by using clean_names() on the raw linelist.
@@ -1606,11 +1595,9 @@
Transition to R, merged cells can be nice for human reading of data, but are not “tidy data” and cause many problems for machine reading of data. R cannot accommodate merged cells.
Remind people doing data entry that human-readable data is not the same as machine-readable data. Strive to train users about the principles of tidy data. If at all possible, try to change procedures so that data arrive in a tidy format without merged cells.
-
Each variable must have its own column.
-
-
Each observation must have its own row.
-
-
Each value must have its own cell.
+
Each variable must have its own column
+
Each observation must have its own row
+
Each value must have its own cell
When using rio’s import() function, the value in a merged cell will be assigned to the first cell and subsequent cells will be empty.
One solution to deal with merged cells is to import the data with the function readWorkbook() from the package openxlsx. Set the argument fillMergedCells = TRUE. This gives the value in a merged cell to all cells within the merge range.
@@ -1683,35 +1670,35 @@
“tidyselect
Here are other “tidyselect” helper functions that also work withindplyr functions like select(), across(), and summarise():
-
everything() - all other columns not mentioned.
+
everything() - all other columns not mentioned
-
last_col() - the last column.
-
where() - applies a function to all columns and selects those which are TRUE.
+
last_col() - the last column
+
where() - applies a function to all columns and selects those which are TRUE
-
contains() - columns containing a character string.
+
contains() - columns containing a character string
-
example: select(contains("time")).
+
example: select(contains("time"))
-
starts_with() - matches to a specified prefix.
+
starts_with() - matches to a specified prefix
-
example: select(starts_with("date_")).
+
example: select(starts_with("date_"))
-
ends_with() - matches to a specified suffix.
+
ends_with() - matches to a specified suffix
-
example: select(ends_with("_post")).
+
example: select(ends_with("_post"))
-
matches() - to apply a regular expression (regex).
+
matches() - to apply a regular expression (regex)
-
example: select(matches("[pt]al")).
+
example: select(matches("[pt]al"))
-
num_range() - a numerical range like x01, x02, x03.
+
num_range() - a numerical range like x01, x02, x03
-
any_of() - matches IF column exists but returns no error if it is not found.
+
any_of() - matches IF column exists but returns no error if it is not found
In addition, use normal operators such as c() to list several columns, : for consecutive columns, ! for opposite, & for AND, and | for OR.
@@ -1901,8 +1888,8 @@
New columns
Review the new columns. For demonstration purposes, only the new columns and the columns used to create them are shown:
-
-
+
+
TIP: A variation on mutate() is the function transmute(). This function adds a new column just like mutate(), but also drops/removes all other columns that you do not mention within its parentheses.
@@ -1995,13 +1982,11 @@
a
across() functions
You can read the documentation with ?across for details on how to provide functions to across(). A few summary points: there are several ways to specify the function(s) to perform on a column and you can even define your own functions:
-
You can provide the function name alone (e.g. mean or as.character).
-
-
You can provide the function in purrr-style (e.g. ~ mean(.x, na.rm = TRUE)) (see this page).
-
-
You can specify multiple functions by providing a list (e.g. list(mean = mean, n_miss = ~ sum(is.na(.x))).
+
You can provide the function name alone (e.g. mean or as.character)
+
You can provide the function in purrr-style (e.g. ~ mean(.x, na.rm = TRUE)) (see this page)
+
You can specify multiple functions by providing a list (e.g. list(mean = mean, n_miss = ~ sum(is.na(.x)))
-
If you provide multiple functions, multiple transformed columns will be returned per input column, with unique names in the format col_fn. You can adjust how the new columns are named with the .names = argument using glue syntax (see page on Characters and strings) where {.col} and {.fn} are shorthand for the input column and function.
+
If you provide multiple functions, multiple transformed columns will be returned per input column, with unique names in the format col_fn. You can adjust how the new columns are named with the .names = argument using glue syntax (see page on Characters and strings) where {.col} and {.fn} are shorthand for the input column and function
Here are a few scenarios where you need to re-code (change) values:
-
to edit one specific value (e.g. one date with an incorrect year or format).
-
-
to reconcile values not spelled the same.
-
to create a new column of categorical values.
-
-
to create a new column of numeric categories (e.g. age categories).
+
to edit one specific value (e.g. one date with an incorrect year or format)
+
to reconcile values not spelled the same
+
to create a new column of categorical values
+
to create a new column of numeric categories (e.g. age categories)
Specific values
@@ -2193,8 +2176,8 @@
Specific values
By logic
Below we demonstrate how to re-code values in a column using logic and conditions:
-
Using replace(), ifelse() and if_else() for simple logic.
-
Using case_when() for more complex logic.
+
Using replace(), ifelse() and if_else() for simple logic
+
Using case_when() for more complex logic
@@ -2321,11 +2304,9 @@
Cleaning di
Create a cleaning dictionary with 3 columns:
-
A “from” column (the incorrect value).
-
-
A “to” column (the correct value).
-
-
A column specifying the column for the changes to be applied (or “.global” to apply to all columns).
+
A “from” column (the incorrect value)
+
A “to” column (the correct value)
+
A column specifying the column for the changes to be applied (or “.global” to apply to all columns)
Note: .global dictionary entries will be overridden by column-specific dictionary entries.
@@ -2360,8 +2341,8 @@
Cleaning di
Now scroll to the right to see how values have changed - particularly gender (lowercase to uppercase), and all the symptoms columns have been transformed from yes/no to 1/0.
-
-
+
+
Note that your column names in the cleaning dictionary must correspond to the names at this point in your cleaning script. See this online reference for the linelist package for more details.
@@ -2436,13 +2417,10 @@
Add to pipe
8.9 Numeric categories
Here we describe some special approaches for creating categories from numerical columns. Common examples include age categories, groups of lab values, etc. Here we will discuss:
-
age_categories(), from the epikit package.
-
-
cut(), from base R.
-
-
case_when().
-
-
quantile breaks with quantile() and ntile().
+
age_categories(), from the epikit package
+
cut(), from base R
+
case_when()
+
quantile breaks with quantile() and ntile()
Review distribution
@@ -2576,13 +2554,11 @@
age_catego
cut()
cut() is a base R alternative to age_categories(), but I think you will see why age_categories() was developed to simplify this process. Some notable differences from age_categories() are:
-
You do not need to install/load another package.
-
-
You can specify whether groups are open/closed on the right/left.
+
You do not need to install/load another package
+
You can specify whether groups are open/closed on the right/left
-
You must provide accurate labels yourself.
-
-
If you want 0 included in the lowest group you must specify this.
+
You must provide accurate labels yourself
+
If you want 0 included in the lowest group you must specify this
The basic syntax within cut() is to first provide the numeric column to be cut (age_years), and then the breaks argument, which is a numeric vector c() of break points. Using cut(), the resulting column is an ordered factor.
By default, the categorization occurs so that the right/upper side is “open” and inclusive (and the left/lower side is “closed” or exclusive). This is the opposite behavior from the age_categories() function. The default labels use the notation “(A, B]”, which means A is not included but B is.Reverse this behavior by providing the right = TRUE argument.
As you specify the column to evaluate, you may want to use the “tidyselect” helper functions described in the select() section of this page. You just have to make one adjustment (because you are not using them within a dplyr function like select() or summarise()).
Put the column-specification criteria within the dplyr function c_across(). This is because c_across (documentation) is designed to work with rowwise() specifically. For example, the following code:
-
Applies rowwise() so the following operation (sum()) is applied within each row (not summing entire columns).
-
-
Creates new column num_NA_dates, defined for each row as the number of columns (with name containing “date”) for which is.na() evaluated to TRUE (they are missing data).
-
-
ungroup() to remove the effects of rowwise() for subsequent steps.
+
Applies rowwise() so the following operation (sum()) is applied within each row (not summing entire columns)
+
Creates new column num_NA_dates, defined for each row as the number of columns (with name containing “date”) for which is.na() evaluated to TRUE (they are missing data)
+
ungroup() to remove the effects of rowwise() for subsequent steps
linelist %>%
@@ -3973,7 +3947,7 @@
-
+
+
@@ -872,8 +873,8 @@
Re-format valu
View the new data. Note the two columns towards the right end - the pasted combined values, and the list.
-
-
+
+
@@ -1564,7 +1565,7 @@
var lightboxQuarto = GLightbox({"descPosition":"bottom","openEffect":"zoom","selector":".lightbox","closeEffect":"zoom","loop":false});
(function() {
let previousOnload = window.onload;
window.onload = () => {
diff --git a/html_outputs/new_pages/combination_analysis_files/figure-html/unnamed-chunk-1-1.png b/html_outputs/new_pages/combination_analysis_files/figure-html/unnamed-chunk-1-1.png
index eba3e4ee..9877da62 100644
Binary files a/html_outputs/new_pages/combination_analysis_files/figure-html/unnamed-chunk-1-1.png and b/html_outputs/new_pages/combination_analysis_files/figure-html/unnamed-chunk-1-1.png differ
diff --git a/html_outputs/new_pages/data_table.html b/html_outputs/new_pages/data_table.html
index fb6526dc..6c3752f8 100644
--- a/html_outputs/new_pages/data_table.html
+++ b/html_outputs/new_pages/data_table.html
@@ -285,7 +285,7 @@
The Epidemiologist R Handbook
-
+
@@ -771,12 +771,12 @@
The structure is DT[i, j, by], separated by 3 parts; the i, j and by arguments. The i argument allows for subsetting of required rows, the j argument allows you to operate on columns and the by argument allows you operate on columns by groups.
This page will address the following topics:
-
Importing data and use of fread() and fwrite().
-
Selecting and filtering rows using the i argument.
-
Using helper functions %like%, %chin%, %between%.
-
Selecting and computing on columns using the j argument.
-
Computing by groups using the by argument.
-
Adding and updating data to data tables using :=.
+
Importing data and use of fread() and fwrite()
+
Selecting and filtering rows using the i argument
+
Using helper functions %like%, %chin%, %between%
+
Selecting and computing on columns using the j argument
+
Computing by groups using the by argument
+
Adding and updating data to data tables using :=
@@ -833,9 +833,9 @@
linelist[hospital %like%"Hospital"] #filter rows where the hospital variable contains “Hospital”
@@ -891,9 +891,9 @@
linelist[, .N, .(hospital)] #the number of cases by hospital
diff --git a/html_outputs/new_pages/data_used.html b/html_outputs/new_pages/data_used.html
index c059ba3d..8cd7cc5a 100644
--- a/html_outputs/new_pages/data_used.html
+++ b/html_outputs/new_pages/data_used.html
@@ -289,7 +289,7 @@
The Epidemiologist R Handbook
NOTE: Structured contact tracing data from other software (e.g. KoBo, DHIS2 Tracker, CommCare) may look different. If you would like to contribute alternative sample data or content for this page, please contact us.
NOTE: Structured contact tracing data from other software (e.g. KoBo, DHIS2 Tracker, CommCare) may look different. If you would like to contribute alternative sample data or content for this page, please contact us.
# assign the current time to a column
-time_now <-Sys.time()
-time_now
+
# assign the current time to a column
+time_now <-Sys.time()
+time_now
-
[1] "2024-09-30 19:33:19 PDT"
-
-
# use with_tz() to assign a new timezone to the column, while CHANGING the clock time
-time_london_real <-with_tz(time_now, "Europe/London")
-
-# use force_tz() to assign a new timezone to the column, while KEEPING the clock time
-time_london_local <-force_tz(time_now, "Europe/London")
-
-
-# note that as long as the computer that was used to run this code is NOT set to London time,
-# there will be a difference in the times
-# (the number of hours difference from the computers time zone to london)
-time_london_real - time_london_local
+
[1] "2024-10-14 17:00:55 PDT"
+
+
# use with_tz() to assign a new timezone to the column, while CHANGING the clock time
+time_london_real <-with_tz(time_now, "Europe/London")
+
+# use force_tz() to assign a new timezone to the column, while KEEPING the clock time
+time_london_local <-force_tz(time_now, "Europe/London")
+
+
+# note that as long as the computer that was used to run this code is NOT set to London time,
+# there will be a difference in the times
+# (the number of hours difference from the computers time zone to london)
+time_london_real - time_london_local
Time difference of 8 hours
@@ -1468,8 +1486,8 @@
-
-
+
+
When using lag() or lead() the order of rows in the dataframe is very important! - pay attention to whether your dates/numbers are ascending or descending.
@@ -1482,25 +1500,25 @@
-
counts <- counts %>%
-mutate(cases_prev_wk =lag(cases_wk, n =1))
+
counts <- counts %>%
+mutate(cases_prev_wk =lag(cases_wk, n =1))
-
-
+
+
Next, create a new column which is the difference between the two cases columns:
You can also positively specify the columns to consider. Below, only rows that have the same values in the name and purpose columns are returned. Notice how “amrish” now has dupe_count equal to 3 to reflect his three “contact” encounters.
There are several variations: These should be provided with a column and a number of rows to return (to n =).
-
slice_min() and slice_max() keep only the row(s) with the minimium or maximum value(s) of the specified column. This also works to return the “min” and “max” of ordered factors.
+
slice_min() and slice_max() keep only the row(s) with the minimium or maximum value(s) of the specified column. This also works to return the “min” and “max” of ordered factors
-
slice_head() and slice_tail() - keep only the first or last row(s).
+
slice_head() and slice_tail() - keep only the first or last row(s)
-
slice_sample() - keep only a random sample of the rows.
+
slice_sample() - keep only a random sample of the rows
-
obs %>%
-slice_max(encounter, n =1) # return rows with the largest encounter number
+
obs %>%
+slice_max(encounter, n =1) # return rows with the largest encounter number
recordID personID name date time encounter purpose symptoms_ever
1 5 2 amrish 2020-01-05 16:10 3 case Yes
@@ -1073,13 +1082,10 @@
TIP: When using slice_max() and slice_min(), be sure to specify/write the n = (e.g. n = 2, not just 2). Otherwise you may get an error Error:…is not empty.
NOTE: You may encounter the function top_n(), which has been superseded by the slice functions.
@@ -1094,16 +1100,16 @@
Slice with gr
CAUTION: If using arrange(), specify .by_group = TRUE to have the data arranged within each group.
DANGER: If with_ties = FALSE, the first row of a tie is kept. This may be deceptive. See how for Mariah, she has two encounters on her latest date (6 Jan) and the first (earliest) one was kept. Likely, we want to keep her later encounter on that day. See how to “break” these ties in the next example.
-
obs %>%
-group_by(name) %>%# group the rows by 'name'
-slice_max(date, # keep row per group with maximum date value
-n =1, # keep only the single highest row
-with_ties = F) # if there's a tie (of date), take the first row
+
obs %>%
+group_by(name) %>%# group the rows by 'name'
+slice_max(date, # keep row per group with maximum date value
+n =1, # keep only the single highest row
+with_ties = F) # if there's a tie (of date), take the first row
-
-
+
+
Above, for example we can see that only Amrish’s row on 5 Jan was kept, and only Brian’s row on 7 Jan was kept. See the original data.
@@ -1111,20 +1117,20 @@
Slice with gr
Multiple slice statements can be run to “break ties”. In this case, if a person has multiple encounters on their latest date, the encounter with the latest time is kept (lubridate::hm() is used to convert the character times to a sortable time class).
Note how now, the one row kept for “Mariah” on 6 Jan is encounter 3 from 08:32, not encounter 2 at 07:25.
-
# Example of multiple slice statements to "break ties"
-obs %>%
-group_by(name) %>%
-
-# FIRST - slice by latest date
-slice_max(date, n =1, with_ties =TRUE) %>%
-
-# SECOND - if there is a tie, select row with latest time; ties prohibited
-slice_max(lubridate::hm(time), n =1, with_ties =FALSE)
+
# Example of multiple slice statements to "break ties"
+obs %>%
+group_by(name) %>%
+
+# FIRST - slice by latest date
+slice_max(date, n =1, with_ties =TRUE) %>%
+
+# SECOND - if there is a tie, select row with latest time; ties prohibited
+slice_max(lubridate::hm(time), n =1, with_ties =FALSE)
-
-
+
+
In the example above, it would also have been possible to slice by encounter number, but we showed the slice on date and time for example purposes.
@@ -1141,28 +1147,28 @@
Keep all
In the original data frame, mark rows as appropriate with case_when(), based on whether their record unique identifier (recordID in this example) is present in the reduced data frame.
-
# 1. Define data frame of rows to keep for analysis
-obs_keep <- obs %>%
-group_by(name) %>%
-slice_max(encounter,
-n =1,
-with_ties =FALSE) # keep only latest encounter per person
-
-
-# 2. Mark original data frame
-obs_marked <- obs %>%
-
-# make new dup_record column
-mutate(dup_record =case_when(
-
-# if record is in obs_keep data frame
- recordID %in% obs_keep$recordID ~"For analysis",
-
-# all else marked as "Ignore" for analysis purposes
-TRUE~"Ignore"))
-
-# print
-obs_marked
+
# 1. Define data frame of rows to keep for analysis
+obs_keep <- obs %>%
+group_by(name) %>%
+slice_max(encounter,
+n =1,
+with_ties =FALSE) # keep only latest encounter per person
+
+
+# 2. Mark original data frame
+obs_marked <- obs %>%
+
+# make new dup_record column
+mutate(dup_record =case_when(
+
+# if record is in obs_keep data frame
+ recordID %in% obs_keep$recordID ~"For analysis",
+
+# all else marked as "Ignore" for analysis purposes
+TRUE~"Ignore"))
+
+# print
+obs_marked
recordID personID name date time encounter purpose symptoms_ever
1 1 1 adam 2020-01-01 09:00 1 contact <NA>
@@ -1208,8 +1214,8 @@
This involves the function rowSums() from base R. Also used is ., which within piping refers to the data frame at that point in the pipe (in this case, it is being subset with brackets []).
Scroll to the right to see more rows
-
# create a "key variable completeness" column
-# this is a *proportion* of the columns designated as "key_cols" that have non-missing values
-
-key_cols =c("personID", "name", "symptoms_ever")
-
-obs %>%
-mutate(key_completeness =rowSums(!is.na(.[,key_cols]))/length(key_cols))
+
# create a "key variable completeness" column
+# this is a *proportion* of the columns designated as "key_cols" that have non-missing values
+
+key_cols =c("personID", "name", "symptoms_ever")
+
+obs %>%
+mutate(key_completeness =rowSums(!is.na(.[,key_cols]))/length(key_cols))
How to “roll-up” values from multiple rows into just one row, with some variations.
-
-
Once you have “rolled-up” values, how to overwrite/prioritize the values in each cell.
+
How to “roll-up” values from multiple rows into just one row, with some variations
+
Once you have “rolled-up” values, how to overwrite/prioritize the values in each cell
This tab uses the example dataset from the Preparation tab.
@@ -1255,66 +1260,64 @@
Roll-up values into one row
The code example below uses group_by() and summarise() to group rows by person, and then paste together all unique values within the grouped rows. Thus, you get one summary row per person. A few notes:
-
A suffix is appended to all new columns (“_roll” in this example).
-
-
If you want to show only unique values per cell, then wrap the na.omit() with unique().
-
-
na.omit() removes NA values, but if this is not desired it can be removed paste0(.x).
+
A suffix is appended to all new columns (“_roll” in this example)
+
If you want to show only unique values per cell, then wrap the na.omit() with unique()
+
na.omit() removes NA values, but if this is not desired it can be removed paste0(.x)
-
# "Roll-up" values into one row per group (per "personID")
-cases_rolled <- obs %>%
-
-# create groups by name
-group_by(personID) %>%
-
-# order the rows within each group (e.g. by date)
-arrange(date, .by_group =TRUE) %>%
-
-# For each column, paste together all values within the grouped rows, separated by ";"
-summarise(
-across(everything(), # apply to all columns
-~paste0(na.omit(.x), collapse ="; "))) # function is defined which combines non-NA values
+
# "Roll-up" values into one row per group (per "personID")
+cases_rolled <- obs %>%
+
+# create groups by name
+group_by(personID) %>%
+
+# order the rows within each group (e.g. by date)
+arrange(date, .by_group =TRUE) %>%
+
+# For each column, paste together all values within the grouped rows, separated by ";"
+summarise(
+across(everything(), # apply to all columns
+~paste0(na.omit(.x), collapse ="; "))) # function is defined which combines non-NA values
The result is one row per group (ID), with entries arranged by date and pasted together. Scroll to the left to see more rows
# Variation - show unique values only
-cases_rolled <- obs %>%
-group_by(personID) %>%
-arrange(date, .by_group =TRUE) %>%
-summarise(
-across(everything(), # apply to all columns
-~paste0(unique(na.omit(.x)), collapse ="; "))) # function is defined which combines unique non-NA values
+
# Variation - show unique values only
+cases_rolled <- obs %>%
+group_by(personID) %>%
+arrange(date, .by_group =TRUE) %>%
+summarise(
+across(everything(), # apply to all columns
+~paste0(unique(na.omit(.x)), collapse ="; "))) # function is defined which combines unique non-NA values
-
-
+
+
This variation appends a suffix to each column.
In this case “_roll” to signify that it has been rolled:
If you then want to evaluate all of the rolled values, and keep only a specific value (e.g. “best” or “maximum” value), you can use mutate() across the desired columns, to implement case_when(), which uses str_detect() from the stringr package to sequentially look for string patterns and overwrite the cell content.
-
# CLEAN CASES
-#############
-cases_clean <- cases_rolled %>%
-
-# clean Yes-No-Unknown vars: replace text with "highest" value present in the string
-mutate(across(c(contains("symptoms_ever")), # operates on specified columns (Y/N/U)
-list(mod =~case_when( # adds suffix "_mod" to new cols; implements case_when()
-
-str_detect(.x, "Yes") ~"Yes", # if "Yes" is detected, then cell value converts to yes
-str_detect(.x, "No") ~"No", # then, if "No" is detected, then cell value converts to no
-str_detect(.x, "Unknown") ~"Unknown", # then, if "Unknown" is detected, then cell value converts to Unknown
-TRUE~as.character(.x)))), # then, if anything else if it kept as is
-.keep ="unused") # old columns removed, leaving only _mod columns
+
# CLEAN CASES
+#############
+cases_clean <- cases_rolled %>%
+
+# clean Yes-No-Unknown vars: replace text with "highest" value present in the string
+mutate(across(c(contains("symptoms_ever")), # operates on specified columns (Y/N/U)
+list(mod =~case_when( # adds suffix "_mod" to new cols; implements case_when()
+
+str_detect(.x, "Yes") ~"Yes", # if "Yes" is detected, then cell value converts to yes
+str_detect(.x, "No") ~"No", # then, if "No" is detected, then cell value converts to no
+str_detect(.x, "Unknown") ~"Unknown", # then, if "Unknown" is detected, then cell value converts to Unknown
+TRUE~as.character(.x)))), # then, if anything else if it kept as is
+.keep ="unused") # old columns removed, leaving only _mod columns
Now you can see in the column symptoms_ever that if the person EVER said “Yes” to symptoms, then only “Yes” is displayed.
var lightboxQuarto = GLightbox({"loop":false,"descPosition":"bottom","openEffect":"zoom","selector":".lightbox","closeEffect":"zoom"});
(function() {
let previousOnload = window.onload;
window.onload = () => {
diff --git a/html_outputs/new_pages/diagrams.html b/html_outputs/new_pages/diagrams.html
index 8df5589e..5e758448 100644
--- a/html_outputs/new_pages/diagrams.html
+++ b/html_outputs/new_pages/diagrams.html
@@ -331,7 +331,7 @@
The Epidemiologist R Handbook
-
+
@@ -830,11 +830,9 @@
35Import data
The first 50 rows of the linelist are displayed below.
-
-
+
+
@@ -876,15 +874,12 @@
The function grViz() is used to create a “Graphviz” diagram. This function accepts a character string input containing instructions for making the diagram. Within that string, the instructions are written in a different language, called DOT - it is quite easy to learn the basics.
Basic structure
-
Open the instructions grViz(".
-
-
Specify directionality and name of the graph, and open brackets, e.g. digraph my_flow_chart {.
-
Graph statement (layout, rank direction).
-
-
Nodes statements (create nodes).
-
Edges statements (gives links between nodes).
-
-
Close the instructions }").
+
Open the instructions grViz("
+
Specify directionality and name of the graph, and open brackets, e.g. digraph my_flow_chart {
+
Graph statement (layout, rank direction)
+
Nodes statements (create nodes)
+
Edges statements (gives links between nodes)
+
Close the instructions }")
Simple examples
@@ -903,8 +898,8 @@
Simple examples
a -> b -> c}")
-
-
+
+
An example with perhaps a bit more applied public health context:
@@ -936,8 +931,8 @@
Simple examples
}")
-
-
+
+
@@ -1075,8 +1070,8 @@
Complex exampl
-
-
+
+
Sub-graph clusters
@@ -1148,8 +1143,8 @@
Complex exampl
-
-
+
+
Node shapes
@@ -1174,8 +1169,8 @@
Complex exampl
saved_plot
-
-
+
+
@@ -1221,8 +1216,8 @@
Plotting
The dataset now look like this:
-
-
+
+
Now plot the Sankey diagram with geom_alluvium() and geom_stratum(). You can read more about each argument by running ?geom_alluvium and ?geom_stratum in the console.
@@ -1288,8 +1283,8 @@
Here is the events dataset we begin with:
-
-
+
+
@@ -1317,8 +1312,8 @@
#printpp
-
-
+
+
@@ -1931,7 +1926,7 @@
var lightboxQuarto = GLightbox({"descPosition":"bottom","selector":".lightbox","openEffect":"zoom","closeEffect":"zoom","loop":false});
(function() {
let previousOnload = window.onload;
window.onload = () => {
diff --git a/html_outputs/new_pages/editorial_style.html b/html_outputs/new_pages/editorial_style.html
index d2db1876..0c40bf3c 100644
--- a/html_outputs/new_pages/editorial_style.html
+++ b/html_outputs/new_pages/editorial_style.html
@@ -286,7 +286,7 @@
The Epidemiologist R Handbook
-
+
diff --git a/html_outputs/new_pages/epicurves.html b/html_outputs/new_pages/epicurves.html
index 335d225f..cdec069d 100644
--- a/html_outputs/new_pages/epicurves.html
+++ b/html_outputs/new_pages/epicurves.html
@@ -317,7 +317,7 @@
The Epidemiologist R Handbook
-
+
@@ -885,8 +885,8 @@
Import data
The first 50 rows are displayed below.
-
-
+
+
Case counts aggregated by hospital
@@ -902,8 +902,8 @@
Import data
The first 50 rows are displayed below:
-
-
+
+
@@ -950,11 +950,9 @@
To produce an epicurve with ggplot() there are three main elements:
-
A histogram, with linelist cases aggregated into “bins” distinguished by specific “break” points.
-
-
Scales for the axes and their labels.
-
-
Themes for the plot appearance, including titles, labels, captions, etc.
+
A histogram, with linelist cases aggregated into “bins” distinguished by specific “break” points
+
Scales for the axes and their labels
+
Themes for the plot appearance, including titles, labels, captions, etc
Specify case bins
@@ -1045,13 +1043,11 @@
Specify case
Let’s unpack the rather daunting code above:
-
The “from” value (earliest date of the sequence) is created as follows: the minimum date value (min() with na.rm=TRUE) in the column date_onset is fed to floor_date() from the lubridate package. floor_date() set to “week” returns the start date of that cases’s “week”, given that the start day of each week is a Monday (week_start = 1).
-
-
Likewise, the “to” value (end date of the sequence) is created using the inverse function ceiling_date() to return the Monday after the last case.
+
The “from” value (earliest date of the sequence) is created as follows: the minimum date value (min() with na.rm=TRUE) in the column date_onset is fed to floor_date() from the lubridate package. floor_date() set to “week” returns the start date of that cases’s “week”, given that the start day of each week is a Monday (week_start = 1)
+
Likewise, the “to” value (end date of the sequence) is created using the inverse function ceiling_date() to return the Monday after the last case.
+
The “by” argument of seq.Date() can be set to any number of days, weeks, or months
-
The “by” argument of seq.Date() can be set to any number of days, weeks, or months.
-
-
Use week_start = 7 for Sunday weeks.
+
Use week_start = 7 for Sunday weeks
As we will use these date vectors throughout this page, we also define one for the whole outbreak (the above is for Central Hospital only).
@@ -1068,15 +1064,11 @@
Specify case
Weekly epicurve example
Below is detailed example code to produce weekly epicurves for Monday weeks, with aligned bars, date labels, and vertical gridlines. This section is for the user who needs code quickly. To understand each aspect (themes, date labels, etc.) in-depth, continue to the subsequent sections. Of note:
-
The histogram bin breaks are defined with seq.Date() as explained above to begin the Monday before the earliest case and to end the Monday after the last case.
-
-
The interval of date labels is specified by date_breaks = within scale_x_date().
-
-
The interval of minor vertical gridlines between date labels is specified to date_minor_breaks =.
-
-
We use closed = "left" in the geom_histogram() to ensure the date are counted in the correct bins.
-
-
expand = c(0,0) in the x and y scales removes excess space on each side of the axes, which also ensures the date labels begin from the first bar.
+
The histogram bin breaks are defined with seq.Date() as explained above to begin the Monday before the earliest case and to end the Monday after the last case
+
The interval of date labels is specified by date_breaks = within scale_x_date()
+
The interval of minor vertical gridlines between date labels is specified to date_minor_breaks =
+
We use closed = "left" in the geom_histogram() to ensure the date are counted in the correct bins
+
expand = c(0,0) in the x and y scales removes excess space on each side of the axes, which also ensures the date labels begin from the first bar
# TOTAL MONDAY WEEK ALIGNMENT
@@ -1144,9 +1136,8 @@
Weekly
Sunday weeks
To achieve the above plot for Sunday weeks a few modifications are needed, because the date_breaks = "weeks" only work for Monday weeks.
-
The break points of the histogram bins must be set to Sundays (week_start = 7).
-
-
Within scale_x_date(), the similar date breaks should be provided to breaks = and minor_breaks = to ensure the date labels and vertical gridlines align on Sundays.
+
The break points of the histogram bins must be set to Sundays (week_start = 7)
+
Within scale_x_date(), the similar date breaks should be provided to breaks = and minor_breaks = to ensure the date labels and vertical gridlines align on Sundays
For example, the scale_x_date() command for Sunday weeks could look like this:
@@ -1175,11 +1166,10 @@
Sunday weeks
Group/color by value
The histogram bars can be colored by group and “stacked”. To designate the grouping column, make the following changes. See the ggplot basics page for details.
-
Within the histogram aesthetic mapping aes(), map the column name to the group = and fill = arguments.
+
Within the histogram aesthetic mapping aes(), map the column name to the group = and fill = arguments
-
Remove any fill = argument outside of aes(), as it will override the one inside.
-
-
Arguments insideaes() will apply by group, whereas any outside will apply to all bars (e.g. you may still want color = outside, so each bar has the same border).
+
Remove any fill = argument outside of aes(), as it will override the one inside
+
Arguments insideaes() will apply by group, whereas any outside will apply to all bars (e.g. you may still want color = outside, so each bar has the same border)
Here is what the aes() command would look like to group and color the bars by gender:
@@ -1215,18 +1205,14 @@
Group/color
Adjust colors
-
To manually set the fill for each group, use scale_fill_manual() (note: scale_color_manual() is different!).
+
To manually set the fill for each group, use scale_fill_manual() (note: scale_color_manual() is different!)
-
Use the values = argument to apply a vector of colors.
-
-
Use na.value = to specify a color for NA values.
-
-
Use the labels = argument to change the text of legend items. To be safe, provide as a named vector like c("old" = "new", "old" = "new") or adjust the values in the data itself.
-
-
Use name = to give a proper title to the legend.
-
+
Use the values = argument to apply a vector of colors
+
Use na.value = to specify a color for NA values
+
Use the labels = argument to change the text of legend items. To be safe, provide as a named vector like c("old" = "new", "old" = "new") or adjust the values in the data itself
+
Use name = to give a proper title to the legend
-
For more tips on color scales and palettes, see the page on ggplot basics.
+
For more tips on color scales and palettes, see the page on ggplot basics
ggplot(data = linelist) +# begin with linelist (many hospitals)
@@ -1342,13 +1328,13 @@
Adjust level
Adjust legend
Read more about legends and scales in the ggplot tips page. Here are a few highlights:
-
Edit legend title either in the scale function or with labs(fill = "Legend title") (if your are using color = aesthetic, then use labs(color = "")).
+
Edit legend title either in the scale function or with labs(fill = "Legend title") (if your are using color = aesthetic, then use labs(color = ""))
-
theme(legend.title = element_blank()) to have no legend title.
+
theme(legendtitle = element_blank()) to have no legend title
-
theme(legend.position = "top") (“bottom”, “left”, “right”, or “none” to remove the legend).
Often instead of a linelist, you begin with aggregated counts from facilities, districts, etc. You can make an epicurve with ggplot() but the code will be slightly different. This section will utilize the count_data dataset that was imported earlier, in the data preparation section. This dataset is the linelist aggregated to day-hospital counts. The first 50 rows are displayed below.
-
-
+
+
Plotting daily counts
We can plot a daily epicurve from these daily counts. Here are the differences to the code:
-
Within the aesthetic mapping aes(), specify y = as the counts column (in this case, the column name is n_cases).
-
Add the argument stat = "identity" within geom_histogram(), which specifies that bar height should be the y = value, not the number of rows as is the default.
+
Within the aesthetic mapping aes(), specify y = as the counts column (in this case, the column name is n_cases)
+
Add the argument stat = "identity" within geom_histogram(), which specifies that bar height should be the y = value, not the number of rows as is the default
-
Add the argument width = to avoid vertical white lines between the bars. For daily data set to 1. For weekly count data set to 7. For monthly count data, white lines are an issue (each month has different number of days) - consider transforming your x-axis to a categorical ordered factor (months) and using geom_col().
+
Add the argument width = to avoid vertical white lines between the bars For daily data set to 1 For weekly count data set to 7 For monthly count data, white lines are an issue (each month has different number of days) - consider transforming your x-axis to a categorical ordered factor (months) and using geom_col()
ggplot(data = count_data) +
@@ -1677,8 +1663,8 @@
Plotting
The first 50 rows of count_data_weekly are displayed below. You can see that the counts have been aggregated into weeks. Each week is displayed by the first day of the week (Monday by default).
-
-
+
+
Now plot so that x = the epiweek column. Remember to add y = the counts column to the aesthetic mapping, and add stat = "identity" as explained above.
@@ -1723,11 +1709,9 @@
Plotting
Moving averages
See the page on Moving averages for a detailed description and several options. Below is one option for calculating moving averages with the package slider. In this approach, the moving average is calculated in the dataset prior to plotting:
-
Aggregate the data into counts as necessary (daily, weekly, etc.) (see Grouping data page).
-
-
Create a new column to hold the moving average, created with slide_index() from slider package.
-
-
Plot the moving average as a geom_line() on top of (after) the epicurve histogram.
+
Aggregate the data into counts as necessary (daily, weekly, etc.) (see Grouping data page)
+
Create a new column to hold the moving average, created with slide_index() from slider package
+
Plot the moving average as a geom_line() on top of (after) the epicurve histogram
strip.position = (position of the strip “bottom”, “top”, “left”, or “right”).
+
stripposition = (position of the strip “bottom”, “top”, “left”, or “right”)
Strip labels
Labels of the facet plots can be modified through the “labels” of the column as a factor, or by the use of a “labeller”.
@@ -2064,22 +2048,19 @@
Use annotate():
-
For a line use annotate(geom = "segment"). Provide x, xend, y, and yend. Adjust size, linetype (lty), and color.
-
+
For a line use annotate(geom = "segment"). Provide x, xend, y, and yend. Adjust size, linetype (lty), and color
For a rectangle use annotate(geom = "rect"). Provide xmin/xmax/ymin/ymax. Adjust color and alpha.
-
Group the data by tentative status and color those bars differently.
+
Group the data by tentative status and color those bars differently
CAUTION: You might try geom_rect() to draw a rectangle, but adjusting the transparency does not work in a linelist context. This function overlays one rectangle for each observation/row!. Use either a very low alpha (e.g. 0.01), or another approach.
Using annotate()
-
Within annotate(geom = "rect"), the xmin and xmax arguments must be given inputs of class Date.
-
-
Note that because these data are aggregated into weekly bars, and the last bar extends to the Monday after the last data point, the shaded region may appear to cover 4 weeks.
-
Within annotate(geom = "rect"), the xmin and xmax arguments must be given inputs of class Date
+
Note that because these data are aggregated into weekly bars, and the last bar extends to the Monday after the last data point, the shaded region may appear to cover 4 weeks
Case counts are aggregated into weeks for aesthetic reasons. See Epicurves page (aggregated data tab) for details.
-
-
A geom_area() line is used instead of a histogram, as the faceting approach below does not work well with histograms.
+
Case counts are aggregated into weeks for aesthetic reasons. See Epicurves page (aggregated data tab) for details
+
A geom_area() line is used instead of a histogram, as the faceting approach below does not work well with histograms
Aggregate to weekly counts
@@ -2375,8 +2355,8 @@
The first 10 rows are shown below:
-
-
+
+
This cumulative column can then be plotted against date_onset, using geom_line():
@@ -2402,7 +2382,7 @@
32.7 incidence2
Below we demonstrate how to make epicurves using the incidence2 package. The authors of this package have tried to allow the user to create and modify epicurves without needing to know ggplot2 syntax. Much of this page is adapted from the package vignettes, which can be found at the incidence2github page.
-
To create an epicurve with incidence2 you need to have a column with a date value (it does not need to be of the class Date, but it should have ordering) and a column with a count variable (what is being counted). It should also not have any duplicated rows.
+
To create an epicurve with incidence2 you need to have a column with a date value (it does not need to be of the class Date, but it should have a numeric or logical order to it (i.e. “Week1”, “Week2”, etc)) and a column with a count variable (what is being counted). It should also not have any duplicated rows.
To create this, we can use the function incidence() which will summarise our data in a format that can be used to create epicurves. There are a number of different arguments to incidence(), type ?incidence in your R console to learn more.
#Load package
@@ -2508,9 +2488,9 @@
Groups
Groups are specified in the incidence() command, and can be used to color the bars or to facet the data. To specify groups in your data provide the column name(s) to the groups = argument in the incidence() command (no quotes around the column name). If specifying multiple columns, put their names within c().
You can specify that cases with missing values in the grouping columns be listed as a distinct NA group by setting na_as_group = TRUE. Otherwise, they will be excluded from the plot.
-
To color the bars by a grouping column, you must again provide the column name to fill = in the plot() command.
+
To color the bars by a grouping column, you must again provide the column name to fill = in the plot() command
-
To facet based on a grouping column, see the section below on facets with incidence2.
+
To facet based on a grouping column, see the section below on facets with incidence2
In the example below, the cases in the whole outbreak are grouped by their age category. Missing values are included as a group. The epicurve interval is weeks.
@@ -2564,11 +2544,9 @@
Groups
Filtered data
To plot the epicurve of a subset of data:
-
Filter the linelist data.
-
-
Provide the filtered data to the incidence() command.
-
-
Plot the incidence object.
+
Filter the linelist data
+
Provide the filtered data to the incidence() command
+
Plot the incidence object
The example below uses data filtered to show only cases at Central Hospital.
@@ -2596,8 +2574,8 @@
Aggregated co
For example, this data frame count_data is the linelist aggregated into daily counts by hospital. The first 50 rows look like this:
-
-
+
+
If you are beginning your analysis with daily count data like the dataset above, your incidence() command to convert this to a weekly epicurve by hospital would look like this:
@@ -2647,7 +2625,7 @@
Facets/sm
-
Note that the package ggtree (used for displaying phylogenetic trees) also has a function facet_plot() - this is why we specified incidence2::facet_plot() above.
+
Note that the package ggtree (used for displaying phylogenetic trees) also has a function facet_plot().
Modifications with plot() and using ggplot2
@@ -3278,7 +3256,7 @@
var lightboxQuarto = GLightbox({"closeEffect":"zoom","selector":".lightbox","descPosition":"bottom","loop":false,"openEffect":"zoom"});
(function() {
let previousOnload = window.onload;
window.onload = () => {
diff --git a/html_outputs/new_pages/factors.html b/html_outputs/new_pages/factors.html
index 7b8460d5..e3e67dd6 100644
--- a/html_outputs/new_pages/factors.html
+++ b/html_outputs/new_pages/factors.html
@@ -289,7 +289,7 @@
The Epidemiologist R Handbook
-
+
@@ -1938,7 +1938,7 @@
var lightboxQuarto = GLightbox({"openEffect":"zoom","closeEffect":"zoom","selector":".lightbox","loop":false,"descPosition":"bottom"});
(function() {
let previousOnload = window.onload;
window.onload = () => {
diff --git a/html_outputs/new_pages/flexdashboard.html b/html_outputs/new_pages/flexdashboard.html
index 9e96f90f..f8d78298 100644
--- a/html_outputs/new_pages/flexdashboard.html
+++ b/html_outputs/new_pages/flexdashboard.html
@@ -317,7 +317,7 @@
The Epidemiologist R Handbook
-
+
@@ -1920,7 +1920,7 @@
-
+
+
@@ -893,15 +888,15 @@
General cleani
Here are some examples of this in action:
-
# make display version of columns with more friendly names
-linelist <- linelist %>%
-mutate(
-gender_disp =case_when(gender =="m"~"Male", # m to Male
- gender =="f"~"Female", # f to Female,
-is.na(gender) ~"Unknown"), # NA to Unknown
-
-outcome_disp =replace_na(outcome, "Unknown") # replace NA outcome with "unknown"
- )
+
# make display version of columns with more friendly names
+linelist <- linelist %>%
+mutate(
+gender_disp =case_when(gender =="m"~"Male", # m to Male
+ gender =="f"~"Female", # f to Female,
+is.na(gender) ~"Unknown"), # NA to Unknown
+
+outcome_disp =replace_na(outcome, "Unknown") # replace NA outcome with "unknown"
+ )
@@ -918,32 +913,32 @@
Pivoting longer
For example, say that we want to plot data that are in a “wide” format, such as for each case in the linelist and their symptoms. Below we create a mini-linelist called symptoms_data that contains only the case_id and symptoms columns.
Here is how the first 50 rows of this mini-linelist look - see how they are formatted “wide” with each symptom as a column:
-
-
+
+
If we wanted to plot the number of cases with specific symptoms, we are limited by the fact that each symptom is a specific column. However, we can pivot the symptoms columns to a longer format like this:
-
symptoms_data_long <- symptoms_data %>%# begin with "mini" linelist called symptoms_data
-
-pivot_longer(
-cols =-case_id, # pivot all columns except case_id (all the symptoms columns)
-names_to ="symptom_name", # assign name for new column that holds the symptoms
-values_to ="symptom_is_present") %>%# assign name for new column that holds the values (yes/no)
-
-mutate(symptom_is_present =replace_na(symptom_is_present, "unknown")) # convert NA to "unknown"
+
symptoms_data_long <- symptoms_data %>%# begin with "mini" linelist called symptoms_data
+
+pivot_longer(
+cols =-case_id, # pivot all columns except case_id (all the symptoms columns)
+names_to ="symptom_name", # assign name for new column that holds the symptoms
+values_to ="symptom_is_present") %>%# assign name for new column that holds the values (yes/no)
+
+mutate(symptom_is_present =replace_na(symptom_is_present, "unknown")) # convert NA to "unknown"
Here are the first 50 rows. Note that case has 5 rows - one for each possible symptom. The new columns symptom_name and symptom_is_present are the result of the pivot. Note that this format may not be very useful for other operations, but is useful for plotting.
-
-
+
+
@@ -963,13 +958,13 @@
A simple example of skeleton code is as follows. We will explain each component in the sections below.
-
# plot data from my_data columns as red points
-ggplot(data = my_data) +# use the dataset "my_data"
-geom_point( # add a layer of points (dots)
-mapping =aes(x = col1, y = col2), # "map" data column to axes
-color ="red") +# other specification for the geom
-labs() +# here you add titles, axes labels, etc.
-theme() # here you adjust color, font, size etc of non-data plot elements (axes, title, etc.)
+
# plot data from my_data columns as red points
+ggplot(data = my_data) +# use the dataset "my_data"
+geom_point( # add a layer of points (dots)
+mapping =aes(x = col1, y = col2), # "map" data column to axes
+color ="red") +# other specification for the geom
+labs() +# here you add titles, axes labels, etc.
+theme() # here you adjust color, font, size etc of non-data plot elements (axes, title, etc.)
@@ -978,8 +973,8 @@
-
# This will create plot that is a blank canvas
-ggplot(data = linelist)
+
# This will create plot that is a blank canvas
+ggplot(data = linelist)
@@ -1008,9 +1003,9 @@
<
Below, in the ggplot() command the data are set as the case linelist. In the mapping = aes() argument the column age is mapped to the x-axis, and the column wt_kg is mapped to the y-axis.
After a +, the plotting commands continue. A shape is created with the “geom” function geom_point(). This geom inherits the mappings from the ggplot() command above - it knows the axis-column assignments and proceeds to visualize those relationships as points on the canvas.
As another example, the following commands utilize the same data, a slightly different mapping, and a different geom. The geom_histogram() function only requires a column mapped to the x-axis, as the counts y-axis is generated automatically.
In the second example, the histogram requires only the x-axis mapped to a column. The histogram binwidth =, color =, fill = (internal color), and alpha = are again set within the geom to static values.
-
# scatterplot
-ggplot(data = linelist,
-mapping =aes(x = age, y = wt_kg)) +# set data and axes mapping
-geom_point(color ="darkgreen", size =0.5, alpha =0.2) # set static point aesthetics
-
-# histogram
-ggplot(data = linelist,
-mapping =aes(x = age)) +# set data and axes
-geom_histogram( # display histogram
-binwidth =7, # width of bins
-color ="red", # bin line color
-fill ="blue", # bin interior color
-alpha =0.1) # bin transparency
+
# scatterplot
+ggplot(data = linelist,
+mapping =aes(x = age, y = wt_kg)) +# set data and axes mapping
+geom_point(color ="darkgreen", size =0.5, alpha =0.2) # set static point aesthetics
+
+# histogram
+ggplot(data = linelist,
+mapping =aes(x = age)) +# set data and axes
+geom_histogram( # display histogram
+binwidth =7, # width of bins
+color ="red", # bin line color
+fill ="blue", # bin interior color
+alpha =0.1) # bin transparency
@@ -1110,25 +1105,25 @@
Scaled
In the second example two new plot aesthetics are also mapped to columns (color = and size =), while the plot aesthetics shape = and alpha = are mapped to static values outside of any mapping = aes() function.
-
# scatterplot
-ggplot(data = linelist, # set data
-mapping =aes( # map aesthetics to column values
-x = age, # map x-axis to age
-y = wt_kg, # map y-axis to weight
-color = age)
- ) +# map color to age
-geom_point() # display data as points
-
-# scatterplot
-ggplot(data = linelist, # set data
-mapping =aes( # map aesthetics to column values
-x = age, # map x-axis to age
-y = wt_kg, # map y-axis to weight
-color = age, # map color to age
-size = age)) +# map size to age
-geom_point( # display data as points
-shape ="diamond", # points display as diamonds
-alpha =0.3) # point transparency at 30%
+
# scatterplot
+ggplot(data = linelist, # set data
+mapping =aes( # map aesthetics to column values
+x = age, # map x-axis to age
+y = wt_kg, # map y-axis to weight
+color = age)
+ ) +# map color to age
+geom_point() # display data as points
+
+# scatterplot
+ggplot(data = linelist, # set data
+mapping =aes( # map aesthetics to column values
+x = age, # map x-axis to age
+y = wt_kg, # map y-axis to weight
+color = age, # map color to age
+size = age)) +# map size to age
+geom_point( # display data as points
+shape ="diamond", # points display as diamonds
+alpha =0.3) # point transparency at 30%
@@ -1147,18 +1142,18 @@
Scaled
Note: Axes assignments are always assigned to columns in the data (not to static values), and this is always done within mapping = aes().
It becomes important to keep track of your plot layers and aesthetics when making more complex plots - for example plots with multiple geoms. In the example below, the size = aesthetic is assigned twice - once for geom_point() and once for geom_smooth() - both times as a static value.
-
ggplot(data = linelist,
-mapping =aes( # map aesthetics to columns
-x = age,
-y = wt_kg,
-color = age_years)
- ) +
-geom_point( # add points for each row of data
-size =1,
-alpha =0.5) +
-geom_smooth( # add a trend line
-method ="lm", # with linear method
-size =2) # size (width of line) of 2
+
ggplot(data = linelist,
+mapping =aes( # map aesthetics to columns
+x = age,
+y = wt_kg,
+color = age_years)
+ ) +
+geom_point( # add points for each row of data
+size =1,
+alpha =0.5) +
+geom_smooth( # add a trend line
+method ="lm", # with linear method
+size =2) # size (width of line) of 2
@@ -1178,16 +1173,16 @@
Where to
Likewise, data = specified in the top ggplot() will apply by default to any geom below, but you could also specify data for each geom (but this is more difficult).
Thus, each of the following commands will create the same plot:
-
# These commands will produce the exact same plot
-ggplot(data = linelist,
-mapping =aes(x = age)) +
-geom_histogram()
-
-ggplot(data = linelist) +
-geom_histogram(mapping =aes(x = age))
-
-ggplot() +
-geom_histogram(data = linelist, mapping =aes(x = age))
+
# These commands will produce the exact same plot
+ggplot(data = linelist,
+mapping =aes(x = age)) +
+geom_histogram()
+
+ggplot(data = linelist) +
+geom_histogram(mapping =aes(x = age))
+
+ggplot() +
+geom_histogram(data = linelist, mapping =aes(x = age))
@@ -1196,9 +1191,9 @@
Groups
Assign the “grouping” column to the appropriate plot aesthetic, within a mapping = aes(). Above, we demonstrated this using continuous values when we assigned point size = to the column age. However this works the same way for discrete/categorical columns.
For example, if you want points to be displayed by gender, you would set mapping = aes(color = gender). A legend automatically appears. This assignment can be made within the mapping = aes() in the top ggplot() command (and be inherited by the geom), or it could be set in a separate mapping = aes() within the geom. Both approaches are shown below:
-
ggplot(data = linelist,
-mapping =aes(x = age, y = wt_kg, color = gender)) +
-geom_point(alpha =0.5)
+
ggplot(data = linelist,
+mapping =aes(x = age, y = wt_kg, color = gender)) +
+geom_point(alpha =0.5)
-
# This alternative code produces the same plot
-ggplot(data = linelist,
-mapping =aes(x = age, y = wt_kg)) +
-geom_point(
-mapping =aes(color = gender),
-alpha =0.5)
+
# This alternative code produces the same plot
+ggplot(data = linelist,
+mapping =aes(x = age, y = wt_kg)) +
+geom_point(
+mapping =aes(color = gender),
+alpha =0.5)
Note that depending on the geom, you will need to use different arguments to group the data. For geom_point() you will most likely use color =, shape = or size =. Whereas for geom_bar() you are more likely to use fill =. This just depends on the geom and what plot aesthetic you want to reflect the groupings.
For your information - the most basic way of grouping the data is by using only the group = argument within mapping = aes(). However, this by itself will not change the colors, fill, or shapes. Nor will it create a legend. Yet the data are grouped, so statistical displays may be affected.
@@ -1247,15 +1242,15 @@
Facets can quickly contain an overwhelming amount of information - its good to ensure you don’t have too many levels of each variable that you choose to facet by. Here are some quick examples with the malaria dataset (see Download handbook and data) which consists of daily case counts of malaria for facilities, by age group.
Below we import and do some quick modifications for simplicity:
-
# These data are daily counts of malaria cases, by facility-day
-malaria_data <-import(here("data", "malaria_facility_count_data.rds")) %>%# import
-select(-submitted_date, -Province, -newid) # remove unneeded columns
+
# These data are daily counts of malaria cases, by facility-day
+malaria_data <-import(here("data", "malaria_facility_count_data.rds")) %>%# import
+select(-submitted_date, -Province, -newid) # remove unneeded columns
The first 50 rows of the malaria data are below. Note there is a column malaria_tot, but also columns for counts by age group (these will be used in the second, facet_grid() example).
-
-
+
+
@@ -1263,16 +1258,16 @@
facet_wrap()
For the moment, let’s focus on the columns malaria_tot and District. Ignore the age-specific count columns for now. We will plot epidemic curves with geom_col(), which produces a column for each day at the specified y-axis height given in column malaria_tot (the data are already daily counts, so we use geom_col() - see the “Bar plot” section below).
When we add the command facet_wrap(), we specify a tilde and then the column to facet on (District in this case). You can place another column on the left side of the tilde, - this will create one facet for each combination - but we recommend you do this with facet_grid() instead. In this use case, one facet is created for each unique value of District.
-
# A plot with facets by district
-ggplot(malaria_data,
-mapping =aes(x = data_date, y = malaria_tot)) +
-geom_col(width =1, fill ="darkred") +# plot the count data as columns
-theme_minimal() +# simplify the background panels
-labs( # add plot labels, title, etc.
-x ="Date of report",
-y ="Malaria cases",
-title ="Malaria cases by district") +
-facet_wrap(~District) # the facets are created
+
# A plot with facets by district
+ggplot(malaria_data,
+mapping =aes(x = data_date, y = malaria_tot)) +
+geom_col(width =1, fill ="darkred") +# plot the count data as columns
+theme_minimal() +# simplify the background panels
+labs( # add plot labels, title, etc.
+x ="Date of report",
+y ="Malaria cases",
+title ="Malaria cases by district") +
+facet_wrap(~District) # the facets are created
@@ -1286,36 +1281,36 @@
facet_wrap()
facet_grid()
We can use a facet_grid() approach to cross two variables. Let’s say we want to cross District and age. Well, we need to do some data transformations on the age columns to get these data into ggplot-preferred “long” format. The age groups all have their own columns - we want them in a single column called age_group and another called num_cases. See the page on Pivoting data for more information on this process.
-
malaria_age <- malaria_data %>%
-select(-malaria_tot) %>%
-pivot_longer(
-cols =c(starts_with("malaria_rdt_")), # choose columns to pivot longer
-names_to ="age_group", # column names become age group
-values_to ="num_cases"# values to a single column (num_cases)
- ) %>%
-mutate(
-age_group =str_replace(age_group, "malaria_rdt_", ""),
-age_group = forcats::fct_relevel(age_group, "5-14", after =1))
+
malaria_age <- malaria_data %>%
+select(-malaria_tot) %>%
+pivot_longer(
+cols =c(starts_with("malaria_rdt_")), # choose columns to pivot longer
+names_to ="age_group", # column names become age group
+values_to ="num_cases"# values to a single column (num_cases)
+ ) %>%
+mutate(
+age_group =str_replace(age_group, "malaria_rdt_", ""),
+age_group = forcats::fct_relevel(age_group, "5-14", after =1))
Now the first 50 rows of data look like this:
-
-
+
+
When you pass the two variables to facet_grid(), easiest is to use formula notation (e.g. x ~ y) where x is rows and y is columns. Here is the plot, using facet_grid() to show the plots for each combination of the columns age_group and District.
-
ggplot(malaria_age,
-mapping =aes(x = data_date, y = num_cases)) +
-geom_col(fill ="darkred", width =1) +
-theme_minimal() +
-labs(
-x ="Date of report",
-y ="Malaria cases",
-title ="Malaria cases by district and age group"
- ) +
-facet_grid(District ~ age_group)
+
ggplot(malaria_age,
+mapping =aes(x = data_date, y = num_cases)) +
+geom_col(fill ="darkred", width =1) +
+theme_minimal() +
+labs(
+x ="Date of report",
+y ="Malaria cases",
+title ="Malaria cases by district and age group"
+ ) +
+facet_grid(District ~ age_group)
@@ -1331,16 +1326,16 @@
Free or fixe
When using facet_wrap() or facet_grid(), we can add scales = "free_y" to “free” or release the y-axes of the panels to scale appropriately to their data subset. This is particularly useful if the actual counts are small for one of the subcategories and trends are otherwise hard to see. Instead of “free_y” we can also write “free_x” to do the same for the x-axis (e.g. for dates) or “free” for both axes. Note that in facet_grid, the y scales will be the same for facets in the same row, and the x scales will be the same for facets in the same column.
When using facet_grid only, we can add space = "free_y" or space = "free_x" so that the actual height or width of the facet is weighted to the values of the figure within. This only works if scales = "free" (y or x) is already applied.
-
# Free y-axis
-ggplot(malaria_data,
-mapping =aes(x = data_date, y = malaria_tot)) +
-geom_col(width =1, fill ="darkred") +# plot the count data as columns
-theme_minimal() +# simplify the background panels
-labs( # add plot labels, title, etc.
-x ="Date of report",
-y ="Malaria cases",
-title ="Malaria cases by district - 'free' x and y axes") +
-facet_wrap(~District, scales ="free") # the facets are created
+
# Free y-axis
+ggplot(malaria_data,
+mapping =aes(x = data_date, y = malaria_tot)) +
+geom_col(width =1, fill ="darkred") +# plot the count data as columns
+theme_minimal() +# simplify the background panels
+labs( # add plot labels, title, etc.
+x ="Date of report",
+y ="Malaria cases",
+title ="Malaria cases by district - 'free' x and y axes") +
+facet_wrap(~District, scales ="free") # the facets are created
@@ -1374,13 +1369,13 @@
Saving plots
By default when you run a ggplot() command, the plot will be printed to the Plots RStudio pane. However, you can also save the plot as an object by using the assignment operator <- and giving it a name. Then it will not print unless the object name itself is run. You can also print it by wrapping the plot name with print(), but this is only necessary in certain circumstances such as if the plot is created inside a for loop used to print multiple plots at once (see Iteration, loops, and lists page).
One nice thing about ggplot2 is that you can define a plot (as above), and then add layers to it starting with its name. You do not have to repeat all the commands that created the original plot!
For example, to modify the plot age_by_wt that was defined above, to include a vertical line at age 50, we would just add a + and begin adding additional layers to the plot.
-
age_by_wt +
-geom_vline(xintercept =50)
+
age_by_wt +
+geom_vline(xintercept =50)
@@ -1421,7 +1416,7 @@
Exporting plots
You can export as png, pdf, jpeg, tiff, bmp, svg, or several other file types, by specifying the file extension in the file path.
-
You can also specify the arguments width =, height =, and units = (either “in”, “cm”, or “mm”). You can also specify dpi = with a number for plot resolution (e.g. 300). See the function details by entering ?ggsave or reading the documentation online.
+
You can also specify the arguments width =, height =, and units = (either “in”, “cm”, or “mm”). You can also specify dpi = with a number for plot resolution (e.g. 300). You can also change the the background of your plot by using the argument bg =, where you specify the colour, i.e. bg = "white". See the function details by entering?ggsave` or reading the documentation online.
Remember that you can use here() syntax to provide the desired file path. see the Import and export page for more information.
@@ -1440,22 +1435,22 @@
-
age_by_wt <-ggplot(
-data = linelist, # set data
-mapping =aes( # map aesthetics to column values
-x = age, # map x-axis to age
-y = wt_kg, # map y-axis to weight
-color = age)) +# map color to age
-geom_point() +# display data as points
-labs(
-title ="Age and weight distribution",
-subtitle ="Fictional Ebola outbreak, 2014",
-x ="Age in years",
-y ="Weight in kilos",
-color ="Age",
-caption = stringr::str_glue("Data as of {max(linelist$date_hospitalisation, na.rm=T)}"))
-
-age_by_wt
+
age_by_wt <-ggplot(
+data = linelist, # set data
+mapping =aes( # map aesthetics to column values
+x = age, # map x-axis to age
+y = wt_kg, # map y-axis to weight
+color = age)) +# map color to age
+geom_point() +# display data as points
+labs(
+title ="Age and weight distribution",
+subtitle ="Fictional Ebola outbreak, 2014",
+x ="Age in years",
+y ="Weight in kilos",
+color ="Age",
+caption = stringr::str_glue("Data as of {max(linelist$date_hospitalisation, na.rm=T)}"))
+
+age_by_wt
@@ -1471,7 +1466,7 @@
30.9 Themes
One of the best parts of ggplot2 is the amount of control you have over the plot - you can define anything! As mentioned above, the design of the plot that is not related to the data shapes/geometries are adjusted within the theme() function. For example, the plot background color, presence/absence of gridlines, and the font/size/color/alignment of text (titles, subtitles, captions, axis text…). These adjustments can be done in one of two ways:
-
Add a complete theme. * theme_() function to make sweeping adjustments - these include. theme_classic(), theme_minimal(), theme_dark(), theme_light()theme_grey(), theme_bw() among others.
+
Add a complete theme. The function theme_() makes sweeping adjustments - these include: theme_classic(), theme_minimal(), theme_dark(), theme_light()theme_grey(), theme_bw() among others.
Adjust each tiny aspect of the plot individually within theme().
@@ -1480,29 +1475,29 @@
Complete themes
As they are quite straight-forward, we will demonstrate the complete theme functions below and will not describe them further here. Note that any micro-adjustments with theme() should be made after use of a complete theme.
The subtitle is italicized with element_text(face = "italic").
-
age_by_wt +
-theme_classic() +# pre-defined theme adjustments
-theme(
-legend.position ="bottom", # move legend to bottom
-
-plot.title =element_text(size =30), # size of title to 30
-plot.caption =element_text(hjust =0), # left-align caption
-plot.subtitle =element_text(face ="italic"), # italicize subtitle
-
-axis.text.x =element_text(color ="red", size =15, angle =90), # adjusts only x-axis text
-axis.text.y =element_text(size =15), # adjusts only y-axis text
-
-axis.title =element_text(size =20) # adjusts both axes titles
- )
+
age_by_wt +
+theme_classic() +# pre-defined theme adjustments
+theme(
+legend.position ="bottom", # move legend to bottom
+
+plot.title =element_text(size =30), # size of title to 30
+plot.caption =element_text(hjust =0), # left-align caption
+plot.subtitle =element_text(face ="italic"), # italicize subtitle
+
+axis.text.x =element_text(color ="red", size =15, angle =90), # adjusts only x-axis text
+axis.text.y =element_text(size =15), # adjusts only y-axis text
+
+axis.title =element_text(size =20) # adjusts both axes titles
+ )
@@ -1661,23 +1656,23 @@
The pipes that pass the dataset from function-to-function will transition to + once the ggplot() function is called. Note that in this case, there is no need to specify the data = argument, as this is automatically defined as the piped-in dataset.
Heat plots for three continuous variables (linked to Heat plots page)
+
Heat plots for three continuous variables (linked to Heat plots page).
Histograms
@@ -1716,30 +1711,30 @@
Histograms
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
If you do not want to specify a number of bins to bins =, you could alternatively specify binwidth = in the units of the axis. We give a few examples showing different bins and bin widths:
-
# A) Regular histogram
-ggplot(data = linelist,
-mapping =aes(x = age)) +# provide x variable
-geom_histogram() +
-labs(title ="A) Default histogram (30 bins)")
-
-# B) More bins
-ggplot(data = linelist,
-mapping =aes(x = age)) +# provide x variable
-geom_histogram(bins =50) +
-labs(title ="B) Set to 50 bins")
-
-# C) Fewer bins
-ggplot(data = linelist,
-mapping =aes(x = age)) +# provide x variable
-geom_histogram(bins =5) +
-labs(title ="C) Set to 5 bins")
-
-
-# D) More bins
-ggplot(data = linelist,
-mapping =aes(x = age)) +# provide x variable
-geom_histogram(binwidth =1) +
-labs(title ="D) binwidth of 1")
+
# A) Regular histogram
+ggplot(data = linelist,
+mapping =aes(x = age)) +# provide x variable
+geom_histogram() +
+labs(title ="A) Default histogram (30 bins)")
+
+# B) More bins
+ggplot(data = linelist,
+mapping =aes(x = age)) +# provide x variable
+geom_histogram(bins =50) +
+labs(title ="B) Set to 50 bins")
+
+# C) Fewer bins
+ggplot(data = linelist,
+mapping =aes(x = age)) +# provide x variable
+geom_histogram(bins =5) +
+labs(title ="C) Set to 5 bins")
+
+
+# D) More bins
+ggplot(data = linelist,
+mapping =aes(x = age)) +# provide x variable
+geom_histogram(binwidth =1) +
+labs(title ="D) binwidth of 1")
To get smoothed proportions, you can use geom_density():
-
# Frequency with proportion axis, smoothed
-ggplot(data = linelist,
-mapping =aes(x = age)) +
-geom_density(size =2, alpha =0.2) +
-labs(title ="Proportional density")
-
-# Stacked frequency with proportion axis, smoothed
-ggplot(data = linelist,
-mapping =aes(x = age, fill = gender)) +
-geom_density(size =2, alpha =0.2, position ="stack") +
-labs(title ="'Stacked' proportional densities")
+
# Frequency with proportion axis, smoothed
+ggplot(data = linelist,
+mapping =aes(x = age)) +
+geom_density(size =2, alpha =0.2) +
+labs(title ="Proportional density")
+
+# Stacked frequency with proportion axis, smoothed
+ggplot(data = linelist,
+mapping =aes(x = age, fill = gender)) +
+geom_density(size =2, alpha =0.2, position ="stack") +
+labs(title ="'Stacked' proportional densities")
@@ -1807,29 +1802,29 @@
Histograms
Each is shown below (*note use of color = vs. fill = in each):
-
# "Stacked" histogram
-ggplot(data = linelist,
-mapping =aes(x = age, fill = gender)) +
-geom_histogram(binwidth =2) +
-labs(title ="'Stacked' histogram")
-
-# Frequency
-ggplot(data = linelist,
-mapping =aes(x = age, color = gender)) +
-geom_freqpoly(binwidth =2, size =2) +
-labs(title ="Freqpoly")
-
-# Frequency with proportion axis
-ggplot(data = linelist,
-mapping =aes(x = age, y =after_stat(density), color = gender)) +
-geom_freqpoly(binwidth =5, size =2) +
-labs(title ="Proportional freqpoly")
-
-# Frequency with proportion axis, smoothed
-ggplot(data = linelist,
-mapping =aes(x = age, y =after_stat(density), fill = gender)) +
-geom_density(size =2, alpha =0.2) +
-labs(title ="Proportional, smoothed with geom_density()")
+
# "Stacked" histogram
+ggplot(data = linelist,
+mapping =aes(x = age, fill = gender)) +
+geom_histogram(binwidth =2) +
+labs(title ="'Stacked' histogram")
+
+# Frequency
+ggplot(data = linelist,
+mapping =aes(x = age, color = gender)) +
+geom_freqpoly(binwidth =2, size =2) +
+labs(title ="Freqpoly")
+
+# Frequency with proportion axis
+ggplot(data = linelist,
+mapping =aes(x = age, y =after_stat(density), color = gender)) +
+geom_freqpoly(binwidth =5, size =2) +
+labs(title ="Proportional freqpoly")
+
+# Frequency with proportion axis, smoothed
+ggplot(data = linelist,
+mapping =aes(x = age, y =after_stat(density), fill = gender)) +
+geom_density(size =2, alpha =0.2) +
+labs(title ="Proportional, smoothed with geom_density()")
@@ -1878,16 +1873,16 @@
Box plots
When using geom_boxplot() to create a box plot, you generally map only one axis (x or y) within aes(). The axis specified determines if the plots are horizontal or vertical.
In most geoms, you create a plot per group by mapping an aesthetic like color = or fill = to a column within aes(). However, for box plots achieve this by assigning the grouping column to the un-assigned axis (x or y). Below is code for a boxplot of all age values in the dataset, and second is code to display one box plot for each (non-missing) gender in the dataset. Note that NA (missing) values will appear as a separate box plot unless removed. In this example we also set the fill to the column outcome so each plot is a different color - but this is not necessary.
-
# A) Overall boxplot
-ggplot(data = linelist) +
-geom_boxplot(mapping =aes(y = age)) +# only y axis mapped (not x)
-labs(title ="A) Overall boxplot")
-
-# B) Box plot by group
-ggplot(data = linelist, mapping =aes(y = age, x = gender, fill = gender)) +
-geom_boxplot() +
-theme(legend.position ="none") +# remove legend (redundant)
-labs(title ="B) Boxplot by gender")
+
# A) Overall boxplot
+ggplot(data = linelist) +
+geom_boxplot(mapping =aes(y = age)) +# only y axis mapped (not x)
+labs(title ="A) Overall boxplot")
+
+# B) Box plot by group
+ggplot(data = linelist, mapping =aes(y = age, x = gender, fill = gender)) +
+geom_boxplot() +
+theme(legend.position ="none") +# remove legend (redundant)
+labs(title ="B) Boxplot by gender")
@@ -1909,23 +1904,23 @@
Box plots
Violin, jitter, and sina plots
Below is code for creating violin plots (geom_violin) and jitter plots (geom_jitter) to show distributions. You can specify that the fill or color is also determined by the data, by inserting these options within aes().
-
# A) Jitter plot by group
-ggplot(data = linelist %>%drop_na(outcome), # remove missing values
-mapping =aes(y = age, # Continuous variable
-x = outcome, # Grouping variable
-color = outcome)) +# Color variable
-geom_jitter() +# Create the violin plot
-labs(title ="A) jitter plot by gender")
-
-
-
-# B) Violin plot by group
-ggplot(data = linelist %>%drop_na(outcome), # remove missing values
-mapping =aes(y = age, # Continuous variable
-x = outcome, # Grouping variable
-fill = outcome)) +# fill variable (color)
-geom_violin() +# create the violin plot
-labs(title ="B) violin plot by gender")
+
# A) Jitter plot by group
+ggplot(data = linelist %>%drop_na(outcome), # remove missing values
+mapping =aes(y = age, # Continuous variable
+x = outcome, # Grouping variable
+color = outcome)) +# Color variable
+geom_jitter() +# Create the violin plot
+labs(title ="A) jitter plot by gender")
+
+
+
+# B) Violin plot by group
+ggplot(data = linelist %>%drop_na(outcome), # remove missing values
+mapping =aes(y = age, # Continuous variable
+x = outcome, # Grouping variable
+fill = outcome)) +# fill variable (color)
+geom_violin() +# create the violin plot
+labs(title ="B) violin plot by gender")
You can combine the two using the geom_sina() function from the ggforce package. The sina plots the jitter points in the shape of the violin plot. When overlaid on the violin plot (adjusting the transparencies) this can be easier to visually interpret.
-
# A) Sina plot by group
-ggplot(
-data = linelist %>%drop_na(outcome),
-mapping =aes(y = age, # numeric variable
-x = outcome)) +# group variable
-geom_violin(
-mapping =aes(fill = outcome), # fill (color of violin background)
-color ="white", # white outline
-alpha =0.2) +# transparency
-geom_sina(
-size=1, # Change the size of the jitter
-mapping =aes(color = outcome)) +# color (color of dots)
-scale_fill_manual( # Define fill for violin background by death/recover
-values =c("Death"="#bf5300",
-"Recover"="#11118c")) +
-scale_color_manual( # Define colours for points by death/recover
-values =c("Death"="#bf5300",
-"Recover"="#11118c")) +
-theme_minimal() +# Remove the gray background
-theme(legend.position ="none") +# Remove unnecessary legend
-labs(title ="B) violin and sina plot by gender, with extra formatting")
+
# A) Sina plot by group
+ggplot(
+data = linelist %>%drop_na(outcome),
+mapping =aes(y = age, # numeric variable
+x = outcome)) +# group variable
+geom_violin(
+mapping =aes(fill = outcome), # fill (color of violin background)
+color ="white", # white outline
+alpha =0.2) +# transparency
+geom_sina(
+size=1, # Change the size of the jitter
+mapping =aes(color = outcome)) +# color (color of dots)
+scale_fill_manual( # Define fill for violin background by death/recover
+values =c("Death"="#bf5300",
+"Recover"="#11118c")) +
+scale_color_manual( # Define colours for points by death/recover
+values =c("Death"="#bf5300",
+"Recover"="#11118c")) +
+theme_minimal() +# Remove the gray background
+theme(legend.position ="none") +# Remove unnecessary legend
+labs(title ="B) violin and sina plot by gender, with extra formatting")
And now the data are re-plotted, with location_name being an ordered factor:
@@ -1975,7 +1969,7 @@
var lightboxQuarto = GLightbox({"loop":false,"descPosition":"bottom","selector":".lightbox","openEffect":"zoom","closeEffect":"zoom"});
(function() {
let previousOnload = window.onload;
window.onload = () => {
diff --git a/html_outputs/new_pages/heatmaps_files/figure-html/unnamed-chunk-36-1.png b/html_outputs/new_pages/heatmaps_files/figure-html/unnamed-chunk-36-1.png
index 168ca59f..8a4364a4 100644
Binary files a/html_outputs/new_pages/heatmaps_files/figure-html/unnamed-chunk-36-1.png and b/html_outputs/new_pages/heatmaps_files/figure-html/unnamed-chunk-36-1.png differ
diff --git a/html_outputs/new_pages/heatmaps_files/figure-html/unnamed-chunk-37-1.png b/html_outputs/new_pages/heatmaps_files/figure-html/unnamed-chunk-37-1.png
index 95c42cdf..af72c647 100644
Binary files a/html_outputs/new_pages/heatmaps_files/figure-html/unnamed-chunk-37-1.png and b/html_outputs/new_pages/heatmaps_files/figure-html/unnamed-chunk-37-1.png differ
diff --git a/html_outputs/new_pages/help.html b/html_outputs/new_pages/help.html
index f5531305..dd86c56c 100644
--- a/html_outputs/new_pages/help.html
+++ b/html_outputs/new_pages/help.html
@@ -7,7 +7,7 @@
-48 Getting help – The Epidemiologist R Handbook
+49 Getting help – The Epidemiologist R Handbook
@@ -1883,8 +1883,8 @@
gtsummary
-
-
19.2.1 Cross-tabulation
+
+
19.1.1 Cross-tabulation
The gtsummary package also allows us to quickly and easily create tables of counts. This can be useful for quickly summarising the data, and putting it in context with the regression we have carried out.
Here we define stratified regression as the process of carrying out separate regression analyses on different “groups” of data.
Sometimes in your analysis, you will want to investigate whether or not there are different relationships between an outcome and variables, by different strata. This could be something like, a difference in gender, age group, or source of infection.
To do this, you will want to split your dataset into the strata of interest. For example, creating two separate datasets of gender == "f" and gender == "m", would be done by:
@@ -2588,8 +2588,8 @@
dplyr::select(explanatory_vars, outcome) ## select variables of interest
Once this has been done, you can carry out your regression in either base R or gtsummary.
-
-
19.3.1base R
+
+
19.2.1base R
To carry this out in base R, you run two different regressions, one for where gender == "f" and gender == "m".
#Run model for f
@@ -2618,8 +2618,8 @@
-
-
19.3.2gtsummary
+
+
19.2.2gtsummary
The same approach is repeated using gtsummary, however it is easier to produce publication ready tables with gtsummary and compare the two tables with the function tbl_merge().
#Run model for f
@@ -2653,23 +2653,23 @@
#Printf_and_m_table
-
-
@@ -3177,8 +3177,8 @@
-
-
19.4 Multivariable
+
+
19.3 Multivariable
For multivariable analysis, we again present two approaches:
This section shows how to produce a plot with the outputs of your regression. There are two options, you can build a plot yourself using ggplot2 or use a meta-package called easystats (a package that includes many packages).
See the page on ggplot basics if you are unfamiliar with the ggplot2 plotting package.
@@ -4690,8 +4690,8 @@
easy
-
-
19.6 Model performance
+
+
19.5 Model performance
Once you have built your regression models, you may want to assess how well the model has fit the data. There are many different approaches to do this, and many different metrics with which to assess your model fit, and how it compares with other model formulations. How you assess your model fit will depend on your model, the data, and the context in which you are conducting your work.
While there are many different functions, and many different packages, to assess model fit, one package that nicely combines several different metrics and approaches into a single source is the performance package. This package allows you to assess model assumptions (such as linearity, homogeneity, highlight outliers, etc.) and check how well the model performs (Akaike Information Criterion values, R2, RMSE, etc) with a few simple functions.
Unfortunately, we are unable to use this package with gtsummary, but it readily accepts objects generated by other packages such as stats, lmerMod and tidymodels. Here we will demonstrate its application using the function glm() for a multivariable regression. To do this we can use the function performance() to assess model fit, and compare_perfomrance() to compare the two models.
@@ -4746,8 +4746,8 @@
For further reading on the performance package, and the model tests you can carry out, see their github.
-
-
19.7 Resources
+
+
19.6 Resources
The content of this page was informed by these resources and vignettes online:
To create a R Markdown output, you need to have the following installed:
-
The rmarkdown package (knitr will also be installed automatically).
-
-
Pandoc, which should come installed with RStudio. If you are not using RStudio, you can download Pandoc here: http://pandoc.org.
+
The rmarkdown package (knitr will also be installed automatically)
+
Pandoc, which should come installed with RStudio. If you are not using RStudio, you can download Pandoc here.
If you want to generate PDF output (a bit trickier), you will need to install LaTeX. For R Markdown users who have not installed LaTeX before, we recommend that you install TinyTeX. You can use the following commands:
@@ -929,7 +929,7 @@
YAML metadata
The YAML should begin with metadata for the document. The order of these primary YAML parameters (not indented) does not matter. For example:
title:"My document"author:"Me"
-date:"2024-10-01"
+date:"2024-10-18"
You can use R code in YAML values by writing it as in-line code (preceded by r within back-ticks) but also within quotes (see above example for date:).
In the image above, because we clicked that our default output would be an html file, we can see that the YAML says output: html_document. However we can also change this to say powerpoint_presentation or word_document or even pdf_document.
@@ -945,9 +945,9 @@
New lines
Case
Surround your normal text with these character to change how it appears in the output.
-
Underscores (_text_) or single asterisk (*text*) to italicise.
-
Double asterisks (**text**) for bold text.
-
Back-ticks (text) to display text as code.
+
Underscores (_text_) or single asterisk (*text*) to italicise
+
Double asterisks (**text**) for bold text
+
Back-ticks (text) to display text as code
The actual appearance of the font can be set by using specific templates (specified in the YAML metadata; see example tabs).
@@ -999,28 +999,21 @@
Code chunks
You can create a new chunk by typing it out yourself, by using the keyboard shortcut “Ctrl + Alt + i” (or Cmd + Shift + r in Mac), or by clicking the green ‘insert a new code chunk’ icon at the top of your script editor.
Some notes about the contents of the curly brackets { }:
-
They start with ‘r’ to indicate that the language name within the chunk is R.
-
After the r you can optionally write a chunk “name” – these are not necessary but can help you organise your work. Note that if you name your chunks, you should ALWAYS use unique names or else R will complain when you try to render.
+
They start with ‘r’ to indicate that the language name within the chunk is R
+
After the r you can optionally write a chunk “name” – these are not necessary but can help you organise your work. Note that if you name your chunks, you should ALWAYS use unique names or else R will complain when you try to render
The curly brackets can include other options too, written as tag=value, such as:
-
eval = FALSE to not run the R code.
-
-
echo = FALSE to not print the chunk’s R source code in the output document.
-
-
warning = FALSE to not print warnings produced by the R code.
-
-
message = FALSE to not print any messages produced by the R code.
-
-
include = either TRUE/FALSE whether to include chunk outputs (e.g. plots) in the document.
-
out.width = and out.height = - provide in style out.width = "75%".
-
-
fig.align = "center" adjust how a figure is aligned across the page.
-
-
fig.show='hold' if your chunk prints multiple figures and you want them printed next to each other (pair with out.width = c("33%", "67%"). Can also set as fig.show='asis' to show them below the code that generates them, 'hide' to hide, or 'animate' to concatenate multiple into an animation.
-
-
A chunk header must be written in one line.
-
Try to avoid periods, underscores, and spaces. Use hyphens ( - ) instead if you need a separator.
+
eval = FALSE to not run the R code
+
echo = FALSE to not print the chunk’s R source code in the output document
+
warning = FALSE to not print warnings produced by the R code
+
message = FALSE to not print any messages produced by the R code
+
include = either TRUE/FALSE whether to include chunk outputs (e.g. plots) in the document
+
out.width = and out.height = - provide in style out.width = "75%"
+
fig.align = "center" adjust how a figure is aligned across the page
+
fig.show='hold' if your chunk prints multiple figures and you want them printed next to each other (pair with out.width = c("33%", "67%"). Can also set as fig.show='asis' to show them below the code that generates them, 'hide' to hide, or 'animate' to concatenate multiple into an animation
+
A chunk header must be written in one line
+
Try to avoid periods, underscores, and spaces. Use hyphens ( - ) instead if you need a separator
Read more extensively about the knitr options here.
Some of the above options can be configured with point-and-click using the setting buttons at the top right of the chunk. Here, you can specify which parts of the chunk you want the rendered document to include, namely the code, the outputs, and the warnings. This will come out as written preferences within the curly brackets, e.g. echo=FALSE if you specify you want to ‘Show output only’.
@@ -1129,6 +1122,16 @@
Tabbed sections
You can add an additional option .tabset-pills after .tabset to give the tabs themselves a “pilled” appearance. Be aware that when viewing the tabbed HTML output, the Ctrl+f search functionality will only search “active” tabs, not hidden tabs.
+
+
+
remedy
+
remedy is an addin for R-Studio which helps with writing R Markdown scripts. It provides a user interface and series of keyboard shortcuts to format your text.
+
This package is installed directly from GitHub.
+
+
remotes::install_github("ThinkR-open/remedy")
+
+
Once installed, the package does not need to be re-loaded. It will automatically load when you start RStudio.
Everything you need to run the R markdown is imported or created within the Rmd file, including all the code chunks and package loading. This “self-contained” approach is appropriate when you do not need to do much data processing (e.g. it brings in a clean or semi-clean data file) and the rendering of the R Markdown will not take too long.
In this scenario, one logical organization of the R Markdown script might be:
-
Set global knitr options.
-
-
Load packages.
-
-
Import data.
-
-
Process data.
-
-
Produce outputs (tables, plots, etc.).
-
-
Save outputs, if applicable (.csv, .png, etc.).
+
Set global knitr options
+
Load packages
+
Import data
+
Process data
+
Produce outputs (tables, plots, etc.)
+
Save outputs, if applicable (.csv, .png, etc.)
Source other files
One variation of the “self-contained” approach is to have R Markdown code chunks “source” (run) other R scripts. This can make your R Markdown script less cluttered, more simple, and easier to organize. It can also help if you want to display final figures at the beginning of the report. In this approach, the final R Markdown script simply combines pre-processed outputs into a document.
One way to do this is by providing the R scripts (file path and name with extension) to the base R command source().
-
source("your-script.R", local = knitr::knit_global())
-# or sys.source("your-script.R", envir = knitr::knit_global())
+
source("your-script.R", local = knitr::knit_global())
+# or sys.source("your-script.R", envir = knitr::knit_global())
Note that when using source()within the R Markdown, the external files will still be run during the course of rendering your Rmd file. Therefore, each script is run every time you render the report. Thus, having these source() commands within the R Markdown does not speed up your run time, nor does it greatly assist with de-bugging, as error produced will still be printed when producing the R Markdown.
An alternative is to utilize the child =knitr option.
@@ -1221,16 +1219,15 @@
Runfile
For instance, you can load the packages, load and clean the data, and even create the graphs of interest prior to render(). These steps can occur in the R script, or in other scripts that are sourced. As long as these commands occur in the same RStudio session and objects are saved to the environment, the objects can then be called within the Rmd content. Then the R markdown itself will only be used for the final step - to produce the output with all the pre-processed objects. This is much easier to de-bug if something goes wrong.
This approach is helpful for the following reasons:
-
More informative error messages - these messages will be generated from the R script, not the R Markdown. R Markdown errors tend to tell you which chunk had a problem, but will not tell you which line.
-
-
If applicable, you can run long processing steps in advance of the render() command - they will run only once.
+
More informative error messages - these messages will be generated from the R script, not the R Markdown. R Markdown errors tend to tell you which chunk had a problem, but will not tell you which line
+
If applicable, you can run long processing steps in advance of the render() command - they will run only once
In the example below, we have a separate R script in which we pre-process a data object into the R Environment and then render the “create_output.Rmd” using render().
-
data <-import("datafile.csv") %>%# Load data and save to environment
-select(age, hospital, weight) # Select limited columns
-
-rmarkdown::render(input ="create_output.Rmd") # Create Rmd file
+
data <-import("datafile.csv") %>%# Load data and save to environment
+select(age, hospital, weight) # Select limited columns
+
+rmarkdown::render(input ="create_output.Rmd") # Create Rmd file
@@ -1266,16 +1263,13 @@
Option 1:
Option 2: render() command
Another way to produce your R Markdown output is to run the render() function (from the rmarkdown package). You must execute this command outside the R Markdown script - so either in a separate R script (often called a “run file”), or as a stand-alone command in the R Console.
-
rmarkdown::render(input ="my_report.Rmd")
+
rmarkdown::render(input ="my_report.Rmd")
As with “knit”, the default settings will save the Rmd output to the same folder as the Rmd script, with the same file name (aside from the file extension). For instance “my_report.Rmd” when knitted will create “my_report.docx” if you are knitting to a word document. However, by using render() you have the option to use different settings. render() can accept arguments including:
-
output_format = This is the output format to convert to (e.g. "html_document", "pdf_document", "word_document", or "all"). You can also specify this in the YAML inside the R Markdown script.
-
-
output_file = This is the name of the output file (and file path). This can be created via R functions like here() or str_glue() as demonstrated below.
-
-
output_dir = This is an output directory (folder) to save the file. This allows you to chose an alternative other than the directory the Rmd file is saved to.
-
+
output_format = This is the output format to convert to (e.g. "html_document", "pdf_document", "word_document", or "all"). You can also specify this in the YAML inside the R Markdown script
+
output_file = This is the name of the output file (and file path). This can be created via R functions like here() or str_glue() as demonstrated below
+
output_dir = This is an output directory (folder) to save the file. This allows you to chose an alternative other than the directory the Rmd file is saved to
output_options = You can provide a list of options that will override those in the script YAML (e.g. )
output_yaml = You can provide path to a .yml file that contains YAML specifications
@@ -1285,9 +1279,9 @@
Option
As one example, to improve version control, the following command will save the output file within an ‘outputs’ sub-folder, with the current date in the file name. To create the file name, the function str_glue() from the stringr package is use to ‘glue’ together static strings (written plainly) with dynamic R code (written in curly brackets). For instance if it is April 10th 2021, the file name from below will be “Report_2021-04-10.docx”. See the page on Characters and strings for more details on str_glue().
As the file renders, the RStudio Console will show you the rendering progress up to 100%, and a final message to indicate that the rendering is complete.
@@ -1309,13 +1303,13 @@
Setting para
Option 1: Set parameters within YAML
Edit the YAML to include a params: option, with indented statements for each parameter you want to define. In this example we create parameters date and hospital, for which we specify values. These values are subject to change each time the report is run. If you use the “Knit” button to produce the output, the parameters will have these default values. Likewise, if you use render() the parameters will have these default values unless otherwise specified in the render() command.
In the background, these parameter values are contained within a read-only list called params. Thus, you can insert the parameter values in R code as you would another R object/value in your environment. Simply type params$ followed by the parameter name. For example params$hospital to represent the hospital name (“Central Hospital” by default).
Note that parameters can also hold values true or false, and so these can be included in your knitr options for a R chunk. For example, you can set {r, eval=params$run} instead of {r, eval=FALSE}, and now whether the chunk runs or not depends on the value of a parameter run:.
Note that for parameters that are dates, they will be input as a string. So for params$date to be interpreted in R code it will likely need to be wrapped with as.Date() or a similar function to convert to class Date.
However, typing values into this pop-up window is subject to error and spelling mistakes. You may prefer to add restrictions to the values that can be entered through drop-down menus. You can do this by adding in the YAML several specifications for each params: entry.
-
label: is the title for that particular drop-down menu.
-
-
value: is the default (starting) value.
-
-
input: set to select for drop-down menu.
-
-
choices: provide the eligible values in the drop-down menu.
+
label: is the title for that particular drop-down menu
+
value: is the default (starting) value
+
input: set to select for drop-down menu
+
choices: provide the eligible values in the drop-down menu
Below, these specifications are written for the hospital parameter.
-
---
-title: Surveillance report
-output: html_document
-params:
-date: 2021-04-10
-hospital:
-label: “Town:”
-value: Central Hospital
-input: select
-choices:[Central Hospital, Military Hospital, Port Hospital, St. Mark's Maternity Hospital (SMMH)]
----
+
---
+title: Surveillance report
+output: html_document
+params:
+date: 2021-04-10
+hospital:
+label: “Town:”
+value: Central Hospital
+input: select
+choices:[Central Hospital, Military Hospital, Port Hospital, St. Mark's Maternity Hospital (SMMH)]
+---
When knitting (either via the ‘knit with parameters’ button or by render()), the pop-up window will have drop-down options to select from.
@@ -1415,14 +1406,14 @@
If you are rendering a R Markdown file with render() from a separate script, you can actually create the impact of parameterization without using the params: functionality.
For instance, in the R script that contains the render() command, you can simply define hospital and date as two R objects (values) before the render() command. In the R Markdown, you would not need to have a params: section in the YAML, and we would refer to the date object rather than params$date and hospital rather than params$hospital.
-
# This is a R script that is separate from the R Markdown
-
-# define R objects
-hospital <-"Central Hospital"
-date <-"2021-04-10"
-
-# Render the R markdown
-rmarkdown::render(input ="create_output.Rmd")
+
# This is a R script that is separate from the R Markdown
+
+# define R objects
+hospital <-"Central Hospital"
+date <-"2021-04-10"
+
+# Render the R markdown
+rmarkdown::render(input ="create_output.Rmd")
Following this approach means means you can not “knit with parameters”, use the GUI, or include knitting options within the parameters. However it allows for simpler code, which may be advantageous.
@@ -1433,10 +1424,10 @@
We may want to run a report multiple times, varying the input parameters, to produce a report for each jurisdictions/unit. This can be done using tools for iteration, which are explained in detail in the page on Iteration, loops, and lists. Options include the purrr package, or use of a for loop as explained below.
Below, we use a simple for loop to generate a surveillance report for all hospitals of interest. This is done with one command (instead of manually changing the hospital parameter one-at-a-time). The command to render the reports must exist in a separate script outside the report Rmd. This script will also contain defined objects to “loop through” - today’s date, and a vector of hospital names to loop through.
We then feed these values one-at-a-time into the render() command using a loop, which runs the command once for each value in the hospitals vector. The letter i represents the index position (1 through 4) of the hospital currently being used in that iteration, such that hospital_list[1] would be “Central Hospital”. This information is supplied in two places in the render() command:
@@ -1445,12 +1436,12 @@
To params = such that the Rmd uses the hospital name internally whenever the params$hospital value is called (e.g. to filter the dataset to the particular hospital only). In this example, four files would be created - one for each hospital.
Unfortunately, editing powerpoint files is slightly less flexible:
A first level header (# Header 1) will automatically become the title of a new slide,
-
A ## Header 2 text will not come up as a subtitle but text within the slide’s main textbox (unless you find a way to maniuplate the Master view).
-
Outputted plots and tables will automatically go into new slides. You will need to combine them, for instance the the patchwork function to combine ggplots, so that they show up on the same page. See this blog post about using the patchwork package to put multiple images on one slide.
+
A ## Header 2 text will not come up as a subtitle but text within the slide’s main textbox (unless you find a way to maniuplate the Master view)
+
Outputted plots and tables will automatically go into new slides. You will need to combine them, for instance the the patchwork function to combine ggplots, so that they show up on the same page. See this blog post about using the patchwork package to put multiple images on one slide
See the officer package for a tool to work more in-depth with powerpoint presentations.
@@ -1502,17 +1493,17 @@
Powerpoint
Integrating templates into the YAML
Once a template is prepared, the detail of this can be added in the YAML of the Rmd underneath the ‘output’ line and underneath where the document type is specified (which goes to a separate line itself). Note reference_doc can be used for powerpoint slide templates.
It is easiest to save the template in the same folder as where the Rmd file is (as in the example below), or in a subfolder within.
Highlight: Configuring this changes the look of highlighted text (e.g. code within chunks that are shown). Supported styles include default, tango, pygments, kate, monochrome, espresso, zenburn, haddock, breezedark, and textmate.
Here is an example of how to integrate the above options into the YAML.
Below are two examples of HTML outputs which both have floating tables of contents, but different theme and highlight styles selected:
@@ -1569,14 +1560,12 @@
HTML widgets
HTML widgets for R are a special class of R packages that enable increased interactivity by utilizing JavaScript libraries. You can embed them in HTML R Markdown outputs.
Some common examples of these widgets include:
-
Plotly (used in this handbook page and in the Interative plots page).
DT (datatable()) (used to show dynamic tables with filter, sort, etc.).
+
Leaflet (used in the GIS Basics page of this handbook)
+
dygraphs (useful for interactively showing time series data)
+
DT (datatable()) (used to show dynamic tables with filter, sort, etc.)
The ggplotly() function from plotly is particularly easy to use. See the Interactive plots page.
@@ -2184,7 +2173,7 @@
-
+
+
@@ -861,8 +859,8 @@
Load popul
-
-
+
+
@@ -872,15 +870,15 @@
Load death co
Deaths in Country A
-
-
+
+
Deaths in Country B
-
-
+
+
@@ -911,8 +909,8 @@
Cl
The combined population data now look like this (click through to see countries A and B):
-
-
+
+
And now we perform similar operations on the two deaths datasets.
@@ -928,8 +926,8 @@
Cl
The deaths data now look like this, and contain data from both countries:
-
-
+
+
We now join the deaths and population data based on common columns Country, age_cat5, and Sex. This adds the column Deaths.
@@ -957,8 +955,8 @@
Cl
-
-
+
+
CAUTION: If you have few deaths per stratum, consider using 10-, or 15-year categories, instead of 5-year categories for age.
@@ -972,8 +970,8 @@
Load
-
-
+
+
@@ -990,7 +988,7 @@
Clea
age_cat5 =str_replace_all(age_cat5, "plus", ""), # remove "plus"age_cat5 =str_replace_all(age_cat5, " ", "")) %>%# remove " " space
-rename(pop = WorldStandardPopulation) # change col name to "pop"
+rename(pop = WorldStandardPopulation)
CAUTION: If you try to use str_replace_all() to remove a plus symbol, it won’t work because it is a special symbol. “Escape” the specialnes by putting two back slashes in front, as in str_replace_call(column, "\\+", "").
@@ -1005,8 +1003,8 @@
Create dataset wit
This complete dataset looks like this:
-
-
+
+
diff --git a/html_outputs/new_pages/stat_tests.html b/html_outputs/new_pages/stat_tests.html
index 656d638c..b5efa37b 100644
--- a/html_outputs/new_pages/stat_tests.html
+++ b/html_outputs/new_pages/stat_tests.html
@@ -317,7 +317,7 @@
The Epidemiologist R Handbook
-
+
@@ -860,8 +860,8 @@
Import data
The first 50 rows of the linelist are displayed below.
-
-
+
+
@@ -1113,23 +1113,23 @@
Chi-squared
1323 missing rows in the "outcome" column have been removed.
-
-
@@ -1650,23 +1650,23 @@
T-tests
1323 missing rows in the "outcome" column have been removed.
-
-
@@ -2171,23 +2171,23 @@
Wilcox
1323 missing rows in the "outcome" column have been removed.
-
-
@@ -2692,23 +2692,23 @@
Kruskal-w
1323 missing rows in the "outcome" column have been removed.
var lightboxQuarto = GLightbox({"openEffect":"zoom","descPosition":"bottom","closeEffect":"zoom","selector":".lightbox","loop":false});
(function() {
let previousOnload = window.onload;
window.onload = () => {
diff --git a/html_outputs/new_pages/transition_to_R.html b/html_outputs/new_pages/transition_to_R.html
index 219f4f16..b47605b7 100644
--- a/html_outputs/new_pages/transition_to_R.html
+++ b/html_outputs/new_pages/transition_to_R.html
@@ -283,7 +283,7 @@
The Epidemiologist R Handbook
-
+
@@ -759,7 +759,7 @@
4read this article comparing R, SPSS, SAS, STATA, and Python).
+
R was introduced in the late 1990s and has since grown dramatically in scope. Its capabilities are so extensive that commercial alternatives have reacted to R developments in order to stay competitive! (read this article comparing R, SPSS, SAS, Stata, and Python).
Moreover, R is much easier to learn than it was 10 years ago. Previously, R had a reputation of being difficult for beginners. It is now much easier with friendly user-interfaces like RStudio, intuitive code like the tidyverse, and many tutorial resources.
Do not be intimidated - come discover the world of R!
@@ -796,8 +796,8 @@
Tidy data
An example of “tidy” data would be the case linelist used throughout this handbook - each variable is contained within one column, each observation (one case) has it’s own row, and every value is in just one cell. Below you can view the first 50 rows of the linelist:
-
-
+
+
The main reason you might encounter non-tidy data is because many Excel spreadsheets are designed to prioritize easy reading by humans, not easy reading by machines/software.
@@ -864,11 +864,12 @@
Stata to R.
General notes
-
STATA
+
Stata
R
@@ -907,7 +908,7 @@
-
STATA
+
Stata
R
@@ -930,7 +931,7 @@
-
STATA
+
Stata
R
@@ -961,7 +962,7 @@
-
STATA
+
Stata
R
@@ -996,7 +997,7 @@
-
STATA
+
Stata
R
@@ -1178,7 +1179,7 @@
4.4 Data interoperability
-
See the Import and export page for details on how the R package rio can import and export files such as STATA .dta files, SAS .xpt and.sas7bdat files, SPSS .por and.sav files, and many others.
+
See the Import and export page for details on how the R package rio can import and export files such as Stata .dta files, SAS .xpt and.sas7bdat files, SPSS .por and.sav files, and many others.
This page reviews the essentials of R. It is not intended to be a comprehensive tutorial, but it provides the basics and can be useful for refreshing your memory. The section on Resources for learning links to more comprehensive tutorials.
+
Parts of this page have been adapted with permission from the R4Epis project.
+
See the page on Transition to R for tips on switching to R from STATA, SAS, or Excel.
+
+
+
3.1 Why use R?
+
As stated on the R project website, R is a programming language and environment for statistical computing and graphics. It is highly versatile, extendable, and community-driven.
+
Cost
+
R is free to use! There is a strong ethic in the community of free and open-source material.
+
Reproducibility
+
Conducting your data management and analysis through a programming language (compared to Excel or another primarily point-click/manual tool) enhances reproducibility, makes error-detection easier, and eases your workload.
+
Community
+
The R community of users is enormous and collaborative. New packages and tools to address real-life problems are developed daily, and vetted by the community of users. As one example, R-Ladies is a worldwide organization whose mission is to promote gender diversity in the R community, and is one of the largest organizations of R users. It likely has a chapter near you!
+
+
+
3.2 Key terms
+
RStudio - RStudio is a Graphical User Interface (GUI) for easier use of R. Read more in the RStudio section.
+
Objects - Everything you store in R - datasets, variables, a list of village names, a total population number, even outputs such as graphs - are objects which are assigned a name and can be referenced in later commands. Read more in the Objects section.
+
Functions - A function is a code operation that accept inputs and returns a transformed output. Read more in the Functions section.
+
Packages - An R package is a shareable bundle of functions. Read more in the Packages section.
+
Scripts - A script is the document file that hold your commands. Read more in the Scripts section.
+
+
+
3.3 Resources for learning
+
+
Resources within RStudio
+
Help documentation
+
Search the RStudio “Help” tab for documentation on R packages and specific functions. This is within the pane that also contains Files, Plots, and Packages (typically in the lower-right pane). As a shortcut, you can also type the name of a package or function into the R console after a question-mark to open the relevant Help page. Do not include parentheses.
+
For example: ?filter or ?diagrammeR.
+
Interactive tutorials
+
There are several ways to learn R interactively within RStudio.
+
RStudio itself offers a Tutorial pane that is powered by the learnr R package. Simply install this package and open a tutorial via the new “Tutorial” tab in the upper-right RStudio pane (which also contains Environment and History tabs).
+
The R package swirl offers interactive courses in the R Console. Install and load this package, then run the command swirl() (empty parentheses) in the R console. You will see prompts appear in the Console. Respond by typing in the Console. It will guide you through a course of your choice.
+
+
+
Cheatsheets
+
There are many PDF “cheatsheets” available on the RStudio website, for example:
+
+
Factors with forcats package.
+
+
Dates and times with lubridate package.
+
+
Strings with stringr package.
+
+
iterative opertaions with purrr package.
+
+
Data import.
+
+
Data transformation cheatsheet with dplyr package.
+
+
R Markdown (to create documents like PDF, Word, Powerpoint…).
+
+
Shiny (to build interactive web apps).
+
+
Data visualization with ggplot2 package.
+
+
Cartography (GIS).
+
+
leaflet package (interactive maps).
+
+
Python with R (reticulate package).
+
+
This is an online R resource specifically for Excel users.
A definitive text is the R for Data Science book by Garrett Grolemund and Hadley Wickham.
+
The R4Epis project website aims to “develop standardised data cleaning, analysis and reporting tools to cover common types of outbreaks and population-based surveys that would be conducted in an MSF emergency response setting”. You can find R basics training materials, templates for RMarkdown reports on outbreaks and surveys, and tutorials to help you set them up.
Permissions
+Note that you should install R and RStudio to a drive where you have read and write permissions. Otherwise, your ability to install R packages (a frequent occurrence) will be impacted. If you encounter problems, try opening RStudio by right-clicking the icon and selecting “Run as administrator”. Other tips can be found in the page R on network drives.
+
How to update R and RStudio
+
Your version of R is printed to the R Console at start-up. You can also run sessionInfo().
+
To update R, go to the website mentioned above and re-install R. Alternatively, you can use the installr package (on Windows) by running installr::updateR(). This will open dialog boxes to help you download the latest R version and update your packages to the new R version. More details can be found in the installrdocumentation.
+
Be aware that the old R version will still exist in your computer. You can temporarily run an older version of R by clicking “Tools” -> “Global Options” in RStudio and choosing an R version. This can be useful if you want to use a package that has not been updated to work on the newest version of R.
+
To update RStudio, you can go to the website above and re-download RStudio. Another option is to click “Help” -> “Check for Updates” within RStudio, but this may not show the very latest updates.
+
To see which versions of R, RStudio, or packages were used when this Handbook as made, see the page on Editorial and technical notes.
+
+
+
Other software you may need to install
+
+
TinyTeX (for compiling an RMarkdown document to PDF).
+
+
Pandoc (for compiling RMarkdown documents).
+
+
RTools (for building packages for R).
+
+
phantomjs (for saving still images of animated networks, such as transmission chains).
+
+
+
TinyTex
+
TinyTex is a custom LaTeX distribution, useful when trying to produce PDFs from R.
+See https://yihui.org/tinytex/ for more informaton.
+
To install TinyTex from R:
+
+
install.packages('tinytex')
+tinytex::install_tinytex()
+# to uninstall TinyTeX, run tinytex::uninstall_tinytex()
+
+
+
+
Pandoc
+
Pandoc is a document converter, a separate software from R. It comes bundled with RStudio and should not need to be downloaded. It helps the process of converting Rmarkdown documents to formats like .pdf and adding complex functionality.
+
+
+
RTools
+
RTools is a collection of software for building packages for R
This is often used to take “screenshots” of webpages. For example when you make a transmission chain with epicontacts package, an HTML file is produced that is interactive and dynamic. If you want a static image, it can be useful to use the webshot package to automate this process. This will require the external program “phantomjs”. You can install phantomjs via the webshot package with the command webshot::install_phantomjs().
+
+
+
+
+
+
3.5 RStudio
+
+
RStudio orientation
+
First, open RStudio.
+
As their icons can look very similar, be sure you are opening RStudio and not R.
+
For RStudio to work you must also have R installed on the computer (see above for installation instructions).
+
RStudio is an interface (GUI) for easier use of R. You can think of R as being the engine of a vehicle, doing the crucial work, and RStudio as the body of the vehicle (with seats, accessories, etc.) that helps you actually use the engine to move forward! You can see the complete RStudio user-interface cheatsheet (PDF) here.
+
By default RStudio displays four rectangle panes.
+
+
+
+
+
+
+
+
+
+
TIP: If your RStudio displays only one left pane it is because you have no scripts open yet.
+
The Source Pane
+This pane, by default in the upper-left, is a space to edit, run, and save your scripts. Scripts contain the commands you want to run. This pane can also display datasets (data frames) for viewing.
+
For Stata users, this pane is similar to your Do-file and Data Editor windows.
+
The R Console Pane
+
The R Console, by default the left or lower-left pane in R Studio, is the home of the R “engine”. This is where the commands are actually run and non-graphic outputs and error/warning messages appear. You can directly enter and run commands in the R Console, but realize that these commands are not saved as they are when running commands from a script.
+
If you are familiar with Stata, the R Console is like the Command Window and also the Results Window.
+
The Environment Pane
+This pane, by default in the upper-right, is most often used to see brief summaries of objects in the R Environment in the current session. These objects could include imported, modified, or created datasets, parameters you have defined (e.g. a specific epi week for the analysis), or vectors or lists you have defined during analysis (e.g. names of regions). You can click on the arrow next to a data frame name to see its variables.
+
In Stata, this is most similar to the Variables Manager window.
+
This pane also contains History where you can see commands that you can previously. It also has a “Tutorial” tab where you can complete interactive R tutorials if you have the learnr package installed. It also has a “Connections” pane for external connections, and can have a “Git” pane if you choose to interface with Github.
+
Plots, Viewer, Packages, and Help Pane
+The lower-right pane includes several important tabs. Typical plot graphics including maps will display in the Plot pane. Interactive or HTML outputs will display in the Viewer pane. The Help pane can display documentation and help files. The Files pane is a browser which can be used to open or delete files. The Packages pane allows you to see, install, update, delete, load/unload R packages, and see which version of the package you have. To learn more about packages see the packages section below.
+
This pane contains the Stata equivalents of the Plots Manager and Project Manager windows.
+
+
+
RStudio settings
+
Change RStudio settings and appearance in the Tools drop-down menu, by selecting Global Options. There you can change the default settings, including appearance/background color.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Restart
+
If your R freezes, you can re-start R by going to the Session menu and clicking “Restart R”. This avoids the hassle of closing and opening RStudio.
+
CAUTION: Everything in your R environment will be removed when you do this.
+
+
+
Keyboard shortcuts
+
Some very useful keyboard shortcuts are below. See all the keyboard shortcuts for Windows, Max, and Linux Rstudio user interface cheatsheet.
+
+
+
+
+
+
+
+
+
+
Windows/Linux
+
Mac
+
Action
+
+
+
+
+
+
Esc
+
Esc
+
Interrupt current command (useful if you accidentally ran an incomplete command and cannot escape seeing “+” in the R console).
+
+
+
Ctrl+s
+
Cmd+s
+
Save (script).
+
+
+
Tab
+
Tab
+
Auto-complete.
+
+
+
Ctrl + Enter
+
Cmd + Enter
+
Run current line(s)/selection of code.
+
+
+
Ctrl + Shift + C
+
Cmd + Shift + c
+
Comment/uncomment the highlighted lines.
+
+
+
Alt + -
+
Option + -
+
Insert <-.
+
+
+
Ctrl + Shift + m
+
Cmd + Shift + m
+
Insert %>%.
+
+
+
Ctrl + l
+
Cmd + l
+
Clear the R console.
+
+
+
Ctrl + Alt + b
+
Cmd + Option + b
+
Run from start to current. line
+
+
+
Ctrl + Alt + t
+
Cmd + Option + t
+
Run the current code section (R Markdown).
+
+
+
Ctrl + Alt + i
+
Cmd + Shift + r
+
Insert code chunk (into R Markdown).
+
+
+
Ctrl + Alt + c
+
Cmd + Option + c
+
Run current code chunk (R Markdown).
+
+
+
up/down arrows in R console
+
Same
+
Toggle through recently run commands.
+
+
+
Shift + up/down arrows in script
+
Same
+
Select multiple code lines.
+
+
+
Ctrl + f
+
Cmd + f
+
Find and replace in current script.
+
+
+
Ctrl + Shift + f
+
Cmd + Shift + f
+
Find in files (search/replace across many scripts).
+
+
+
Alt + l
+
Cmd + Option + l
+
Fold selected code.
+
+
+
Shift + Alt + l
+
Cmd + Shift + Option+l
+
Unfold selected code.
+
+
+
+
TIP: Use your Tab key when typing to engage RStudio’s auto-complete functionality. This can prevent spelling errors. Press Tab while typing to produce a drop-down menu of likely functions and objects, based on what you have typed so far.
+
+
+
+
+
3.6 Functions
+
Functions are at the core of using R. Functions are how you perform tasks and operations. Many functions come installed with R, many more are available for download in packages (explained in the packages section), and you can even write your own custom functions!
+
This basics section on functions explains:
+
+
What a function is and how they work.
+
+
What function arguments are.
+
+
How to get help understanding a function.
+
+
A quick note on syntax: In this handbook, functions are written in code-text with open parentheses, like this: filter(). As explained in the packages section, functions are downloaded within packages. In this handbook, package names are written in bold, like dplyr. Sometimes in example code you may see the function name linked explicitly to the name of its package with two colons (::) like this: dplyr::filter(). The purpose of this linkage is explained in the packages section.
+
+
+
Simple functions
+
A function is like a machine that receives inputs, carries out an action with those inputs, and produces an output. What the output is depends on the function.
+
Functions typically operate upon some object placed within the function’s parentheses. For example, the function sqrt() calculates the square root of a number:
+
+
sqrt(49)
+
+
[1] 7
+
+
+
The object provided to a function also can be a column in a dataset (see the Objects section for detail on all the kinds of objects). Because R can store multiple datasets, you will need to specify both the dataset and the column. One way to do this is using the $ notation to link the name of the dataset and the name of the column (dataset$column). In the example below, the function summary() is applied to the numeric column age in the dataset linelist, and the output is a summary of the column’s numeric and missing values.
+
+
# Print summary statistics of column 'age' in the dataset 'linelist'
+summary(linelist$age)
+
+
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
+ 0.00 6.00 13.00 16.07 23.00 84.00 86
+
+
+
NOTE: Behind the scenes, a function represents complex additional code that has been wrapped up for the user into one easy command.
+
+
+
+
Functions with multiple arguments
+
Functions often ask for several inputs, called arguments, located within the parentheses of the function, usually separated by commas.
+
+
Some arguments are required for the function to work correctly, others are optional.
+
+
Optional arguments have default settings.
+
+
Arguments can take character, numeric, logical (TRUE/FALSE), and other inputs.
+
+
Here is a fun fictional function, called oven_bake(), as an example of a typical function. It takes an input object (e.g. a dataset, or in this example “dough”) and performs operations on it as specified by additional arguments (minutes = and temperature =). The output can be printed to the console, or saved as an object using the assignment operator <-.
+
+
+
+
+
+
+
+
+
+
In a more realistic example, the age_pyramid() command below produces an age pyramid plot based on defined age groups and a binary split column, such as gender. The function is given three arguments within the parentheses, separated by commas. The values supplied to the arguments establish linelist as the data frame to use, age_cat5 as the column to count, and gender as the binary column to use for splitting the pyramid by color.
+
+
# Create an age pyramid
+age_pyramid(data = linelist, age_group ="age_cat5", split_by ="gender")
+
+
+
+
+
+
+
+
+
The above command can be equivalently written as below, in a longer style with a new line for each argument. This style can be easier to read, and easier to write “comments” with # to explain each part (commenting extensively is good practice!). To run this longer command you can highlight the entire command and click “Run”, or just place your cursor in the first line and then press the Ctrl and Enter keys simultaneously.
+
+
# Create an age pyramid
+age_pyramid(
+data = linelist, # use case linelist
+age_group ="age_cat5", # provide age group column
+split_by ="gender"# use gender column for two sides of pyramid
+ )
+
+
+
+
+
+
+
+
+
The first half of an argument assignment (e.g. data =) does not need to be specified if the arguments are written in a specific order (specified in the function’s documentation). The below code produces the exact same pyramid as above, because the function expects the argument order: data frame, age_group variable, split_by variable.
+
+
# This command will produce the exact same graphic as above
+age_pyramid(linelist, "age_cat5", "gender")
+
+
A more complex age_pyramid() command might include the optional arguments to:
+
+
Show proportions instead of counts (set proportional = TRUE when the default is FALSE)
+
+
Specify the two colors to use (pal = is short for “palette” and is supplied with a vector of two color names. See the objects page for how the function c() makes a vector)
+
+
NOTE: For arguments that you specify with both parts of the argument (e.g. proportional = TRUE), their order among all the arguments does not matter.
+
+
age_pyramid(
+ linelist, # use case linelist
+"age_cat5", # age group column
+"gender", # split by gender
+proportional =TRUE, # percents instead of counts
+pal =c("orange", "purple") # colors
+ )
+
+
+
+
+
+
+
+
+
TIP: Remember that you can put ? before a function to see what arguments the function can take, and which arguments are needed and which arguments have default values. For example `?age_pyramid’.
+
+
+
+
Writing Functions
+
R is a language that is oriented around functions, so you should feel empowered to write your own functions. Creating functions brings several advantages:
+
+
To facilitate modular programming - the separation of code in to independent and manageable pieces.
+
+
Replace repetitive copy-and-paste, which can be error prone.
+
+
Give pieces of code memorable names.
+
+
How to write a function is covered in-depth in the Writing functions page.
An R package is a shareable bundle of code and documentation that contains pre-defined functions. Users in the R community develop packages all the time catered to specific problems, it is likely that one can help with your work! You will install and use hundreds of packages in your use of R.
+
On installation, R contains “base” packages and functions that perform common elementary tasks. But many R users create specialized functions, which are verified by the R community and which you can download as a package for your own use. In this handbook, package names are written in bold. One of the more challenging aspects of R is that there are often many functions or packages to choose from to complete a given task.
+
+
Install and load
+
Functions are contained within packages which can be downloaded (“installed”) to your computer from the internet. Once a package is downloaded, it is stored in your “library”. You can then access the functions it contains during your current R session by “loading” the package.
+
Think of R as your personal library: When you download a package, your library gains a new book of functions, but each time you want to use a function in that book, you must borrow, “load”, that book from your library.
+
In summary: to use the functions available in an R package, 2 steps must be implemented:
+
+
The package must be installed (once), and
+
+
The package must be loaded (each R session)
+
+
+
Your library
+
Your “library” is actually a folder on your computer, containing a folder for each package that has been installed. Find out where R is installed in your computer, and look for a folder called “library”. For example: R\4.4.1\library (the 4.4.1 is the R version - you’ll have a different library for each R version you’ve downloaded).
+
You can print the file path to your library by entering .libPaths() (empty parentheses). This becomes especially important if working with R on network drives.
+
+
+
Install from CRAN
+
Most often, R users download packages from CRAN. CRAN (Comprehensive R Archive Network) is an online public warehouse of R packages that have been published by R community members.
+
Are you worried about viruses and security when downloading a package from CRAN? Read this article on the topic.
+
+
+
How to install and load
+
In this handbook, we suggest using the pacman package (short for “package manager”). It offers a convenient function p_load() which will install a package if necessary and load it for use in the current R session.
+
The syntax quite simple. Just list the names of the packages within the p_load() parentheses, separated by commas. This command will install the rio, tidyverse, and here packages if they are not yet installed, and will load them for use. This makes the p_load() approach convenient and concise if sharing scripts with others.
+
Note that package names are case-sensitive.
+
+
# Install (if necessary) and load packages for use
+pacman::p_load(rio, tidyverse, here)
+
+
Here we have used the syntax pacman::p_load() which explicitly writes the package name (pacman) prior to the function name (p_load()), connected by two colons ::. This syntax is useful because it also loads the pacman package (assuming it is already installed).
+
There are alternative base R functions that you will see often. The base R function for installing a package is install.packages(). The name of the package to install must be provided in the parentheses in quotes. If you want to install multiple packages in one command, they must be listed within a character vector c().
+
Note: this command installs a package, but does not load it for use in the current session.
+
+
# install a single package with base R
+install.packages("tidyverse")
+
+# install multiple packages with base R
+install.packages(c("tidyverse", "rio", "here"))
+
+
Installation can also be accomplished point-and-click by going to the RStudio “Packages” pane and clicking “Install” and searching for the desired package name.
+
The base R function to load a package for use (after it has been installed) is library(). It can load only one package at a time (another reason to use p_load()). You can provide the package name with or without quotes.
+
+
# load packages for use, with base R
+library(tidyverse)
+library(rio)
+library(here)
+
+
To check whether a package is installed or loaded, you can view the Packages pane in RStudio. If the package is installed, it is shown there with version number. If its box is checked, it is loaded for the current session.
+
Install from Github
+
Sometimes, you need to install a package that is not yet available from CRAN. Or perhaps the package is available on CRAN but you want the development version with new features not yet offered in the more stable published CRAN version. These are often hosted on the website github.com in a free, public-facing code “repository”. Read more about Github in the handbook page on Version control and collaboration with Git and Github.
+
To download R packages from Github, you can use the function p_load_gh() from pacman, which will install the package if necessary, and load it for use in your current R session. Alternatives to install include using the remotes or devtools packages. Read more about all the pacman functions in the package documentation.
+
To install from Github, you have to provide more information. You must provide:
+
+
The Github ID of the repository owner
+
The name of the repository that contains the package
+
+
Optional: The name of the “branch” (specific development version) you want to download
+
+
In the examples below, the first word in the quotation marks is the Github ID of the repository owner, after the slash is the name of the repository (the name of the package).
+
+
# install/load the epicontacts package from its Github repository
+p_load_gh("reconhub/epicontacts")
+
+
If you want to install from a “branch” (version) other than the main branch, add the branch name after an “@”, after the repository name.
+
+
# install the "timeline" branch of the epicontacts package from Github
+p_load_gh("reconhub/epicontacts@timeline")
+
+
If there is no difference between the Github version and the version on your computer, no action will be taken. You can “force” a re-install by instead using p_load_current_gh() with the argument update = TRUE. Read more about pacman in this online vignette
For clarity in this handbook, functions are sometimes preceded by the name of their package using the :: symbol in the following way: package_name::function_name()
+
Once a package is loaded for a session, this explicit style is not necessary. One can just use function_name(). However writing the package name is useful when a function name is common and may exist in multiple packages (e.g. plot()). Writing the package name will also load the package if it is not already loaded.
+
+
# This command uses the package "rio" and its function "import()" to import a dataset
+linelist <- rio::import("linelist.xlsx", which ="Sheet1")
+
+
+
+
Function help
+
To read more about a function, you can search for it in the Help tab of the lower-right RStudio. You can also run a command like ?thefunctionname (for example, to get help for the function p_load you would write ?p_load) and the Help page will appear in the Help pane. Finally, try searching online for resources.
+
+
+
Update packages
+
You can update packages by re-installing them. You can also click the green “Update” button in your RStudio Packages pane to see which packages have new versions to install. Be aware that your old code may need to be updated if there is a major revision to how a function works!
+
+
+
Delete packages
+
Use p_delete() from pacman, or remove.packages() from base R.
+
+
+
Dependencies
+
Packages often depend on other packages to work. These are called dependencies. If a dependency fails to install, then the package depending on it may also fail to install.
+
See the dependencies of a package with p_depends(), and see which packages depend on it with p_depends_reverse()
+
+
+
Masked functions
+
It is not uncommon that two or more packages contain the same function name. For example, the package dplyr has a filter() function, but so does the package stats. The default filter() function depends on the order these packages are first loaded in the R session - the later one will be the default for the command filter().
+
You can check the order in your Environment pane of R Studio - click the drop-down for “Global Environment” and see the order of the packages. Functions from packages lower on that drop-down list will mask functions of the same name in packages that appear higher in the drop-down list. When first loading a package, R will warn you in the console if masking is occurring, but this can be easy to miss.
+
+
+
+
+
+
+
+
+
+
Here are ways you can fix masking:
+
+
Specify the package name in the command. For example, use dplyr::filter()
+
+
Re-arrange the order in which the packages are loaded (e.g. within p_load()), and start a new R session
+
+
+
+
Detach / unload
+
To detach (unload) a package, use this command, with the correct package name and only one colon. Note that this may not resolve masking.
+
+
detach(package:PACKAGE_NAME_HERE, unload=TRUE)
+
+
+
+
Install older version
+
See this guide to install an older version of a particular package.
+
+
+
Suggested packages
+
See the page on Suggested packages for a listing of packages we recommend for everyday epidemiology.
+
+
+
+
+
3.8 Scripts
+
Scripts are a fundamental part of programming. They are documents that hold your commands (e.g. functions to create and modify datasets, print visualizations, etc). You can save a script and run it again later. There are many advantages to storing and running your commands from a script (vs. typing commands one-by-one into the R console “command line”):
+
+
Portability - you can share your work with others by sending them your scripts.
+
+
Reproducibility - so that you and others know exactly what you did.
+
+
Version control - so you can track changes made by yourself or colleagues.
+
+
Commenting/annotation - to explain to your colleagues what you have done.
+
+
+
Commenting
+
In a script you can also annotate (“comment”) around your R code. Commenting is helpful to explain to yourself and other readers what you are doing. You can add a comment by typing the hash symbol (#) and writing your comment after it. The commented text will appear in a different color than the R code.
+
Any code written after the # will not be run. Therefore, placing a # before code is also a useful way to temporarily block a line of code (“comment out”) if you do not want to delete it. You can comment out/in multiple lines at once by highlighting them and pressing Ctrl+Shift+c (Cmd+Shift+c in Mac).
+
+
# A comment can be on a line by itself
+# import data
+linelist <-import("linelist_raw.xlsx") %>%# a comment can also come after code
+# filter(age > 50) # It can also be used to deactivate / remove a line of code
+count()
+
+
There are a few general ideas to follow when writing your scripts in order to make them accessible. - Add comments on what you are doing and on why you are doing it.
+- Break your code into logical sections.
+- Accompany your code with a text step-by-step description of what you are doing (e.g. numbered steps).
+
+
+
Style
+
It is important to be conscious of your coding style - especially if working on a team. We advocate for the tidyversestyle guide. There are also packages such as styler and lintr which help you conform to this style.
+
A few very basic points to make your code readable to others:
+* When naming objects, use only lowercase letters, numbers, and underscores _, e.g. my_data
+* Use frequent spaces, including around operators, e.g. n = 1 and age_new <- age_old + 3
+
+
+
Example Script
+
Below is an example of a short R script. Remember, the better you succinctly explain your code in comments, the more your colleagues will like you!
+
+
+
+
+
+
+
+
+
+
+
+
+
R markdown and Quarto
+
An R Markdown or Quarto script are types of R script in which the script itself becomes an output document (PDF, Word, HTML, Powerpoint, etc.). These are incredibly useful and versatile tools often used to create dynamic and automated reports.
+
Even this website and handbook is produced with Quarto scripts!
+
It is worth noting that beginner R users can also use R Markdown - do not be intimidated! To learn more, see the handbook page on Reports with R Markdown documents.
+
+
+
+
R notebooks
+
There is no difference between writing in a Rmarkdown vs an R notebook. However the execution of the document differs slightly. See this site for more details.
+
+
+
+
Shiny
+
Shiny apps/websites are contained within one script, which must be named app.R. This file has three components:
In previous versions, the above file was split into two files (ui.R and server.R)
+
+
+
Code folding
+
You can collapse portions of code to make your script easier to read.
+
To do this, create a text header with #, write your header, and follow it with at least 4 of either dashes (-), hashes (#) or equals (=). When you have done this, a small arrow will appear in the “gutter” to the left (by the row number). You can click this arrow and the code below until the next header will collapse and a dual-arrow icon will appear in its place.
+
To expand the code, either click the arrow in the gutter again, or the dual-arrow icon. There are also keyboard shortcuts as explained in the RStudio section of this page.
+
By creating headers with #, you will also activate the Table of Contents at the bottom of your script (see below) that you can use to navigate your script. You can create sub-headers by adding more # symbols, for example # for primary, ## for secondary, and ### for tertiary headers.
+
Below are two versions of an example script. On the left is the original with commented headers. On the right, four dashes have been written after each header, making them collapsible. Two of them have been collapsed, and you can see that the Table of Contents at the bottom now shows each section.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Other areas of code that are automatically eligible for folding include “braced” regions with brackets { } such as function definitions or conditional blocks (if else statements). You can read more about code folding at the RStudio site.
+
+
+
+
+
+
+
3.9 Working directory
+
The working directory is the root folder location used by R for your work - where R looks for and saves files by default. By default, it will save new files and outputs to this location, and will look for files to import (e.g. datasets) here as well.
+
The working directory appears in grey text at the top of the RStudio Console pane. You can also print the current working directory by running getwd() (leave the parentheses empty).
+
+
+
+
+
+
+
+
+
+
+
Recommended approach
+
See the page on R projects for details on our recommended approach to managing your working directory.
+
+
A common, efficient, and trouble-free way to manage your working directory and file paths is to combine these 3 elements in an R project-oriented workflow:
+
+
An R Project to store all your files (see page on R projects)
+
In brief, your work becomes specific to your computer. This means that file paths used to import and export files need to be changed if used on a different computer, or by different collaborators.
+
As noted above, although we do not recommend this approach in most circumstances, you can use the command setwd() with the desired folder file path in quotations, for example:
+
+
setwd("C:/Documents/R Files/My analysis")
+
+
DANGER: Setting a working directory with setwd()can be “brittle” if the file path is specific to one computer. Instead, use file paths relative to an R Project root directory, such as with the [here package].
+
+
+
+
Set manually
+
To set the working directory manually (the point-and-click equivalent of setwd()), click the Session drop-down menu and go to “Set Working Directory” and then “Choose Directory”. This will set the working directory for that specific R session. Note: if using this approach, you will have to do this manually each time you open RStudio.
+
+
+
+
Within an R project
+
If using an R project, the working directory will default to the R project root folder that contains the “.rproj” file. This will apply if you open RStudio by clicking open the R Project (the file with “.rproj” extension).
+
+
+
+
Working directory in an R markdown
+
In an R markdown script, the default working directory is the folder the Rmarkdown file (.Rmd) is saved within. If using an R project and here package, this does not apply and the working directory will be here() as explained in the R projects page.
+
If you want to change the working directory of a stand-alone R markdown (not in an R project), if you use setwd() this will only apply to that specific code chunk. To make the change for all code chunks in an R markdown, edit the setup chunk to add the root.dir = parameter, such as below:
It is much easier to just use the R markdown within an R project and use the here package.
+
+
+
+
Providing file paths
+
Perhaps the most common source of frustration for an R beginner (at least on a Windows machine) is typing in a file path to import or export data. There is a thorough explanation of how to best input file paths in the Import and export page, but here are a few key points:
+
Broken paths
+
Below is an example of an “absolute” or “full address” file path. These will likely break if used by another computer. One exception is if you are using a shared/network drive.
If typing in a file path, be aware the direction of the slashes.
+
Use forward slashes (/) to separate the components (“data/provincial.csv”). For Windows users, the default way that file paths are displayed is with back slashes (\) - so you will need to change the direction of each slash. If you use the here package as described in the R projects page the slash direction is not an issue.
+
Relative file paths
+
We generally recommend providing “relative” filepaths instead - that is, the path relative to the root of your R Project. You can do this using the here package as explained in the R projects page. A relativel filepath might look like this:
+
+
# Import csv linelist from the data/linelist/clean/ sub-folders of an R project
+linelist <-import(here("data", "clean", "linelists", "marin_country.csv"))
+
+
Even if using relative file paths within an R project, you can still use absolute paths to import and export data outside your R project.
+
+
+
+
+
3.10 Objects
+
Everything in R is an object, and R is an “object-oriented” language. These sections will explain:
+
+
How to create objects (<-).
+
Types of objects (e.g. data frames, vectors..).
+
+
How to access subparts of objects (e.g. variables in a dataset).
+
+
Classes of objects (e.g. numeric, logical, integer, double, character, factor).
Everything you store in R - datasets, variables, a list of village names, a total population number, even outputs such as graphs - are objects which are assigned a name and can be referenced in later commands.
+
An object exists when you have assigned it a value (see the assignment section below). When it is assigned a value, the object appears in the Environment (see the upper right pane of RStudio). It can then be operated upon, manipulated, changed, and re-defined.
+
+
+
+
Defining objects (<-)
+
Create objects by assigning them a value with the <- operator.
+You can think of the assignment operator <- as the words “is defined as”. Assignment commands generally follow a standard order:
+
object_name <- value (or process/calculation that produce a value)
+
For example, you may want to record the current epidemiological reporting week as an object for reference in later code. In this example, the object current_week is created when it is assigned the value "2018-W10" (the quote marks make this a character value). The object current_week will then appear in the RStudio Environment pane (upper-right) and can be referenced in later commands.
+
See the R commands and their output in the boxes below.
+
+
current_week <-"2018-W10"# this command creates the object current_week by assigning it a value
+current_week # this command prints the current value of current_week object in the console
+
+
[1] "2018-W10"
+
+
+
NOTE: Note the [1] in the R console output is simply indicating that you are viewing the first item of the output
+
CAUTION:An object’s value can be over-written at any time by running an assignment command to re-define its value. Thus, the order of the commands run is very important.
+
The following command will re-define the value of current_week:
+
+
current_week <-"2018-W51"# assigns a NEW value to the object current_week
+current_week # prints the current value of current_week in the console
+
+
[1] "2018-W51"
+
+
+
Equals signs =
+
You will also see equals signs in R code:
+
+
A double equals sign == between two objects or values asks a logical question: “is this equal to that?”.
+
+
You will also see equals signs within functions used to specify values of function arguments (read about these in sections below), for example max(age, na.rm = TRUE).
+
+
You can use a single equals sign = in place of <- to create and define objects, but this is discouraged. You can read about why this is discouraged here.
+
+
Datasets
+
Datasets are also objects (typically “data frames”) and must be assigned names when they are imported. In the code below, the object linelist is created and assigned the value of a CSV file imported with the rio package and its import() function.
+
+
# linelist is created and assigned the value of the imported CSV file
+linelist <-import("my_linelist.csv")
+
+
You can read more about importing and exporting datasets with the section on Import and export.
+
CAUTION: A quick note on naming of objects:
+
+
Object names must not contain spaces, but you should use underscore (_) or a period (.) instead of a space.
+
+
Object names are case-sensitive (meaning that Dataset_A is different from dataset_A).
+
Object names must begin with a letter (they cannot begin with a number like 1, 2 or 3).
+
+
Outputs
+
Outputs like tables and plots provide an example of how outputs can be saved as objects, or just be printed without being saved. A cross-tabulation of gender and outcome using the base R function table() can be printed directly to the R console (without being saved).
+
+
# printed to R console only
+table(linelist$gender, linelist$outcome)
+
+
+ Death Recover
+ f 1227 953
+ m 1228 950
+
+
+
But the same table can be saved as a named object. Then, optionally, it can be printed.
+
+
# save
+gen_out_table <-table(linelist$gender, linelist$outcome)
+
+# print
+gen_out_table
+
+
+ Death Recover
+ f 1227 953
+ m 1228 950
+
+
+
Columns
+
Columns in a dataset are also objects and can be defined, over-written, and created as described below in the section on Columns.
+
You can use the assignment operator from base R to create a new column. Below, the new column bmi (Body Mass Index) is created, and for each row the new value is result of a mathematical operation on the row’s value in the wt_kg and ht_cm columns.
+
+
# create new "bmi" column using base R syntax
+linelist$bmi <- linelist$wt_kg / (linelist$ht_cm/100)^2
+
+
However, in this handbook, we emphasize a different approach to defining columns, which uses the function mutate() from the dplyr package and piping with the pipe operator (%>%). The syntax is easier to read and there are other advantages explained in the page on Cleaning data and core functions. You can read more about piping in the Piping section below.
+
+
# create new "bmi" column using dplyr syntax
+linelist <- linelist %>%
+mutate(bmi = wt_kg / (ht_cm/100)^2)
+
+
+
+
+
Object structure
+
Objects can be a single piece of data (e.g. my_number <- 24), or they can consist of structured data.
+
The graphic below is borrowed from this online R tutorial. It shows some common data structures and their names. Not included in this image is spatial data, which is discussed in the GIS basics page.
+
+
+
+
+
+
+
+
+
+
In epidemiology (and particularly field epidemiology), you will most commonly encounter data frames and vectors:
+
+
+
+
+
+
+
+
+
Common structure
+
Explanation
+
Example
+
+
+
+
Vectors | A container for a sequence of singular objects, all of the same class (e.g. numeric, character). | “Variables” (columns) in data frames are vectors (e.g. the column age_years). |
+
+
+
+
+
+
+
+
+
Data Frames
+
Vectors (e.g. columns) that are bound together that all have the same number of rows.
+
linelist is a data frame.
+
+
+
+
Note that to create a vector that “stands alone” (is not part of a data frame) the function c() is used to combine the different elements. For example, if creating a vector of colors plot’s color scale: vector_of_colors <- c("blue", "red2", "orange", "grey")
+
+
+
+
Object classes
+
All the objects stored in R have a class which tells R how to handle the object. There are many possible classes, but common ones include:
+
+
+
+
Class
+
Explanation
+
Examples
+
+
+
+
+
+
Character
+
These are text/words/sentences “within quotation marks”. Math cannot be done on these objects.
+
“Character objects are in quotation marks”
+
+
+
+
Integer
+
Numbers that are whole only (no decimals)
+
-5, 14, or 2000
+
+
+
+
Numeric
+
These are numbers and can include decimals. If within quotation marks they will be considered character class.
+
23.1 or 14
+
+
+
+
Factor
+
These are vectors that have a specified order or hierarchy of values
+
An variable of economic status with ordered values
+
+
+
+
Date
+
Once R is told that certain data are Dates, these data can be manipulated and displayed in special ways. See the page on Working with dates for more information.
+
2018-04-12 or 15/3/1954 or Wed 4 Jan 1980
+
+
+
+
Logical
+
Values must be one of the two special values TRUE or FALSE (note these are not “TRUE” and “FALSE” in quotation marks)
+
TRUE or FALSE
+
+
+
+
data.frame
+
A data frame is how R stores a typical dataset. It consists of vectors (columns) of data bound together, that all have the same number of observations (rows).
+
The example AJS dataset named linelist_raw contains 68 variables with 300 observations (rows) each.
+
+
+
+
tibble
+
tibbles are a variation on data frame, the main operational difference being that they print more nicely to the console (display first 10 rows and only columns that fit on the screen)
+
Any data frame, list, or matrix can be converted to a tibble with as_tibble()
+
+
+
+
list
+
A list is like vector, but holds other objects that can be other different classes
+
A list could hold a single number, and a data frame, and a vector, and even another list within it!
+
+
+
+
+
You can test the class of an object by providing its name to the function class(). Note: you can reference a specific column within a dataset using the $ notation to separate the name of the dataset and the name of the column.
+
+
class(linelist) # class should be a data frame or tibble
+
+
[1] "data.frame"
+
+
class(linelist$age) # class should be numeric
+
+
[1] "numeric"
+
+
class(linelist$gender) # class should be character
+
+
[1] "character"
+
+
+
Sometimes, a column will be converted to a different class automatically by R. Watch out for this! For example, if you have a vector or column of numbers, but a character value is inserted… the entire column will change to class character.
+
+
num_vector <-c(1, 2, 3, 4, 5) # define vector as all numbers
+class(num_vector) # vector is numeric class
+
+
[1] "numeric"
+
+
num_vector[3] <-"three"# convert the third element to a character
+class(num_vector) # vector is now character class
+
+
[1] "character"
+
+
+
One common example of this is when manipulating a data frame in order to print a table - if you make a total row and try to paste/glue together percents in the same cell as numbers (e.g. 23 (40%)), the entire numeric column above will convert to character and can no longer be used for mathematical calculations.Sometimes, you will need to convert objects or columns to another class.
+
+
+
+
Function
+
Action
+
+
+
+
+
as.character()
+
Converts to character class
+
+
+
as.numeric()
+
Converts to numeric class
+
+
+
as.integer()
+
Converts to integer class
+
+
+
as.Date()
+
Converts to Date class - Note: see section on dates for details
+
+
+
factor()
+
Converts to factor - Note: re-defining order of value levels requires extra arguments
+
+
+
+
Likewise, there are base R functions to check whether an object IS of a specific class, such as is.numeric(), is.character(), is.double(), is.factor(), is.integer()
A column in a data frame is technically a “vector” (see table above) - a series of values that must all be the same class (either character, numeric, logical, etc).
+
A vector can exist independent of a data frame, for example a vector of column names that you want to include as explanatory variables in a model. To create a “stand alone” vector, use the c() function as below:
+
+
# define the stand-alone vector of character values
+explanatory_vars <-c("gender", "fever", "chills", "cough", "aches", "vomit")
+
+# print the values in this named vector
+explanatory_vars
Columns in a data frame are also vectors and can be called, referenced, extracted, or created using the $ symbol. The $ symbol connects the name of the column to the name of its data frame. In this handbook, we try to use the word “column” instead of “variable”.
+
+
# Retrieve the length of the vector age_years
+length(linelist$age) # (age is a column in the linelist data frame)
+
+
By typing the name of the data frame followed by $ you will also see a drop-down menu of all columns in the data frame. You can scroll through them using your arrow key, select one with your Enter key, and avoid spelling mistakes!
+
+
+
+
+
+
+
+
+
+
ADVANCED TIP: Some more complex objects (e.g. a list, or an epicontacts object) may have multiple levels which can be accessed through multiple dollar signs. For example epicontacts$linelist$date_onset
+
+
+
+
Access/index with brackets ([ ])
+
You may need to view parts of objects, also called “indexing”, which is often done using the square brackets [ ]. Using $ on a data frame to access a column is also a type of indexing.
+
+
my_vector <-c("a", "b", "c", "d", "e", "f") # define the vector
+my_vector[5] # print the 5th element
+
+
[1] "e"
+
+
+
Square brackets also work to return specific parts of an returned output, such as the output of a summary() function:
+
+
# All of the summary
+summary(linelist$age)
+
+
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
+ 0.00 6.00 13.00 16.07 23.00 84.00 86
+
+
# Just the second element of the summary, with name (using only single brackets)
+summary(linelist$age)[2]
+
+
1st Qu.
+ 6
+
+
# Just the second element, without name (using double brackets)
+summary(linelist$age)[[2]]
+
+
[1] 6
+
+
# Extract an element by name, without showing the name
+summary(linelist$age)[["Median"]]
+
+
[1] 13
+
+
+
Brackets also work on data frames to view specific rows and columns. You can do this using the syntax data frame[rows, columns]:
+
+
# View a specific row (2) from dataset, with all columns (don't forget the comma!)
+linelist[2,]
+
+# View all rows, but just one column
+linelist[, "date_onset"]
+
+# View values from row 2 and columns 5 through 10
+linelist[2, 5:10]
+
+# View values from row 2 and columns 5 through 10 and 18
+linelist[2, c(5:10, 18)]
+
+# View rows 2 through 20, and specific columns
+linelist[2:20, c("date_onset", "outcome", "age")]
+
+# View rows and columns based on criteria
+# *** Note the data frame must still be named in the criteria!
+linelist[linelist$age >25 , c("date_onset", "outcome", "age")]
+
+# Use View() to see the outputs in the RStudio Viewer pane (easier to read)
+# *** Note the capital "V" in View() function
+View(linelist[2:20, "date_onset"])
+
+# Save as a new object
+new_table <- linelist[2:20, c("date_onset")]
+
+
Note that you can also achieve the above row/column indexing on data frames and tibbles using dplyr syntax (functions filter() for rows, and select() for columns). Read more about these core functions in the Cleaning data and core functions page.
+
To filter based on “row number”, you can use the dplyr function row_number() with open parentheses as part of a logical filtering statement. Often you will use the %in% operator and a range of numbers as part of that logical statement, as shown below. To see the first N rows, you can also use the special dplyr function head(). Note, there is a function head() from base R, but this is overwritten by the dplyr function when you load tidyverse.
+
+
# View first 100 rows
+linelist %>%head(100)
+
+# Show row 5 only
+linelist %>%filter(row_number() ==5)
+
+# View rows 2 through 20, and three specific columns (note no quotes necessary on column names)
+linelist %>%
+filter(row_number() %in%2:20) %>%
+select(date_onset, outcome, age)
+
+
When indexing an object of class list, single brackets always return with class list, even if only a single object is returned. Double brackets, however, can be used to access a single element and return a different class than list. Brackets can also be written after one another, as demonstrated below.
# define demo list
+my_list <-list(
+# First element in the list is a character vector
+hospitals =c("Central", "Empire", "Santa Anna"),
+
+# second element in the list is a data frame of addresses
+addresses =data.frame(
+street =c("145 Medical Way", "1048 Brown Ave", "999 El Camino"),
+city =c("Andover", "Hamilton", "El Paso")
+ )
+ )
+
+
Here is how the list looks when printed to the console. See how there are two named elements:
+
+
hospitals, a character vector
+
+
addresses, a data frame of addresses
+
+
+
my_list
+
+
$hospitals
+[1] "Central" "Empire" "Santa Anna"
+
+$addresses
+ street city
+1 145 Medical Way Andover
+2 1048 Brown Ave Hamilton
+3 999 El Camino El Paso
+
+
+
Now we extract, using various methods:
+
+
my_list[1] # this returns the element in class "list" - the element name is still displayed
+
+
$hospitals
+[1] "Central" "Empire" "Santa Anna"
+
+
my_list[[1]] # this returns only the (unnamed) character vector
+
+
[1] "Central" "Empire" "Santa Anna"
+
+
my_list[["hospitals"]] # you can also index by name of the list element
+
+
[1] "Central" "Empire" "Santa Anna"
+
+
my_list[[1]][3] # this returns the third element of the "hospitals" character vector
+
+
[1] "Santa Anna"
+
+
my_list[[2]][1] # This returns the first column ("street") of the address data frame
+
+
street
+1 145 Medical Way
+2 1048 Brown Ave
+3 999 El Camino
+
+
+
+
+
+
Remove objects
+
You can remove individual objects from your R environment by putting the name in the rm() function (no quote marks):
+
+
rm(object_name)
+
+
You can remove all objects (clear your workspace) by running:
+
+
rm(list =ls(all =TRUE))
+
+
+
+
+
+
+
+
3.11 Piping (%>%)
+
Two general approaches to working with objects are:
+
+
Pipes/tidyverse - pipes send an object from function to function - emphasis is on the action, not the object.
+
+
Define intermediate objects - an object is re-defined again and again - emphasis is on the object.
+
+
+
+
Pipes
+
Simply explained, the pipe operator passes an intermediate output from one function to the next.
+You can think of it as saying “and then”. Many functions can be linked together with %>%.
+
+
Piping emphasizes a sequence of actions, not the object the actions are being performed on.
+
+
Pipes are best when a sequence of actions must be performed on one object.
+
+
Pipes can make code more clean and easier to read, more intuitive.
+
+
Pipe operators were first introduced through the magrittr package, which is part of tidyverse, and were specified as %>%. In R 4.1.0, they introduced a base R pipe which is specified through |>. The behaviour of the two pipes is the same, and they can be used somewhat interchangeably. However, there are a few key differences.
+
+
The %>% pipe allows you to pass multiple arguments.
+
The %>% pipe lets you drop parentheses when calling a function with no other arguments (i.e. drop vs drop()).
+
The %>% pipe allows you to start a pipe with . to create a function in your linking of code.
+
+
For these reasons, we recommend the magrittr pipe, %>%, over the base R pipe, |>.
+
To read more about the differences between base R and tidyverse (magrittr) pipes, see this blog post. For more information on the tidyverse approach, please see this style guide.
+
Here is a fake example for comparison, using fictional functions to “bake a cake”. First, the pipe method:
+
+
# A fake example of how to bake a cake using piping syntax
+
+cake <- flour %>%# to define cake, start with flour, and then...
+add(eggs) %>%# add eggs
+add(oil) %>%# add oil
+add(water) %>%# add water
+mix_together( # mix together
+utensil = spoon,
+minutes =2) %>%
+bake(degrees =350, # bake
+system ="fahrenheit",
+minutes =35) %>%
+let_cool() # let it cool down
+
+
Note that just like other R commands, pipes can be used to just display the result, or to save/re-save an object, depending on whether the assignment operator <- is involved. See both below:
+
+
# Create or overwrite object, defining as aggregate counts by age category (not printed)
+linelist_summary <- linelist %>%
+count(age_cat)
+
+
+
# Print the table of counts in the console, but don't save it
+linelist %>%
+count(age_cat)
%<>%
+This is an “assignment pipe” from the magrittr package, which pipes an object forward and also re-defines the object. It must be the first pipe operator in the chain. It is shorthand. The below two commands are equivalent:
This approach to changing objects/data frames may be better if:
+
+
You need to manipulate multiple objects
+
+
There are intermediate steps that are meaningful and deserve separate object names
+
+
Risks:
+
+
Creating new objects for each step means creating lots of objects. If you use the wrong one you might not realize it!
+
+
Naming all the objects can be confusing.
+
+
Errors may not be easily detectable.
+
+
Either name each intermediate object, or overwrite the original, or combine all the functions together. All come with their own risks.
+
Below is the same fake “cake” example as above, but using this style:
+
+
# a fake example of how to bake a cake using this method (defining intermediate objects)
+batter_1 <-left_join(flour, eggs)
+batter_2 <-left_join(batter_1, oil)
+batter_3 <-left_join(batter_2, water)
+
+batter_4 <-mix_together(object = batter_3, utensil = spoon, minutes =2)
+
+cake <-bake(batter_4, degrees =350, system ="fahrenheit", minutes =35)
+
+cake <-let_cool(cake)
+
+
Combine all functions together - this is difficult to read:
+
+
# an example of combining/nesting mutliple functions together - difficult to read
+cake <-let_cool(bake(mix_together(batter_3, utensil = spoon, minutes =2), degrees =350, system ="fahrenheit", minutes =35))
+
+
+
+
+
+
3.12 Key operators and functions
+
This section details operators in R, such as:
+
+
Definitional operators.
+
+
Relational operators (less than, equal too..).
+
+
Logical operators (and, or…).
+
+
Handling missing values.
+
+
Mathematical operators and functions (+/-, >, sum(), median(), …).
+
+
The %in% operator.
+
+
+
+
Assignment operators
+
<-
+
The basic assignment operator in R is <-. Such that object_name <- value.
+This assignment operator can also be written as =. We advise use of <- for general R use. We also advise surrounding such operators with spaces, for readability.
+
<<-
+
If Writing functions, or using R in an interactive way with sourced scripts, then you may need to use this assignment operator <<- (from base R). This operator is used to define an object in a higher ‘parent’ R Environment. See this online reference.
+
%<>%
+
This is an “assignment pipe” from the magrittr package, which pipes an object forward and also re-defines the object. It must be the first pipe operator in the chain. It is shorthand, as shown below in two equivalent examples:
This is used to add data to phylogenetic trees with the ggtree package. See the page on Phylogenetic trees or this online resource book.
+
+
+
+
Relational and logical operators
+
Relational operators compare values and are often used when defining new variables and subsets of datasets. Here are the common relational operators in R:
+
+
+
+
Meaning
+
Operator
+
Example
+
Example Result
+
+
+
+
+
Equal to
+
==
+
"A" == "a"
+
FALSE (because R is case sensitive) Note that == (double equals) is different from = (single equals), which acts like the assignment operator <-
Logical operators, such as AND and OR, are often used to connect relational operators and create more complicated criteria. Complex statements might require parentheses ( ) for grouping and order of application.
+
+
+
+
+
+
+
+
Meaning
+
Operator
+
+
+
+
+
AND
+
&
+
+
+
OR
+
| (vertical bar)
+
+
+
Parentheses
+
( ) Used to group criteria together and clarify order of operations
+
+
+
+
For example, below, we have a linelist with two variables we want to use to create our case definition, hep_e_rdt, a test result and other_cases_in_hh, which will tell us if there are other cases in the household. The command below uses the function case_when() to create the new variable case_def such that:
If the value for variables rdt_result and other_cases_in_home are missing
+
NA (missing)
+
+
+
If the value in rdt_result is “Positive”
+
“Confirmed”
+
+
+
If the value in rdt_result is NOT “Positive” AND the value in other_cases_in_home is “Yes”
+
“Probable”
+
+
+
If one of the above criteria are not met
+
“Suspected”
+
+
+
+
Note that R is case-sensitive, so “Positive” is different than “positive”.
+
+
+
+
Missing values
+
In R, missing values are represented by the special value NA (a “reserved” value) (capital letters N and A - not in quotation marks). If you import data that records missing data in another way (e.g. 99, “Missing”), you may want to re-code those values to NA. How to do this is addressed in the Import and export page.
+
To test whether a value is NA, use the special function is.na(), which returns TRUE or FALSE.
+
+
rdt_result <-c("Positive", "Suspected", "Positive", NA) # two positive cases, one suspected, and one unknown
+is.na(rdt_result) # Tests whether the value of rdt_result is NA
+
+
[1] FALSE FALSE FALSE TRUE
+
+
+
Read more about missing, infinite, NULL, and impossible values in the page on Missing data. Learn how to convert missing values when importing data in the page on Import and export.
+
+
+
+
Mathematics and statistics
+
All the operators and functions in this page are automatically available using base R.
+
+
Mathematical operators
+
These are often used to perform addition, division, to create new columns, etc. Below are common mathematical operators in R. Whether you put spaces around the operators is not important.
+
+
+
+
Purpose
+
Example in R
+
+
+
+
+
addition
+
2 + 3
+
+
+
subtraction
+
2 - 3
+
+
+
multiplication
+
2 * 3
+
+
+
division
+
30 / 5
+
+
+
exponent
+
2^3
+
+
+
order of operations
+
( )
+
+
+
+
+
+
Mathematical functions
+
+
+
+
Purpose
+
Function
+
+
+
+
+
rounding
+
round(x, digits = n)
+
+
+
rounding
+
janitor::round_half_up(x, digits = n)
+
+
+
ceiling (round up)
+
ceiling(x)
+
+
+
floor (round down)
+
floor(x)
+
+
+
absolute value
+
abs(x)
+
+
+
square root
+
sqrt(x)
+
+
+
exponent
+
exponent(x)
+
+
+
natural logarithm
+
log(x)
+
+
+
log base 10
+
log10(x)
+
+
+
log base 2
+
log2(x)
+
+
+
+
Note: for round() the digits = specifies the number of decimal placed. Use signif() to round to a number of significant figures.
+
+
+
Scientific notation
+
The likelihood of scientific notation being used depends on the value of the scipen option.
+
From the documentation of ?options: scipen is a penalty to be applied when deciding to print numeric values in fixed or exponential notation. Positive values bias towards fixed and negative towards scientific notation: fixed notation will be preferred unless it is more than ‘scipen’ digits wider.
+
If it is set to a low number (e.g. 0) it will be “turned on” always. To “turn off” scientific notation in your R session, set it to a very high number, for example:
+
+
# turn off scientific notation
+options(scipen =999)
+
+
+
+
Rounding
+
DANGER:round() uses “banker’s rounding” which rounds up from a .5 only if the upper number is even. Use round_half_up() from janitor to consistently round halves up to the nearest whole number. See this explanation
+
+
# use the appropriate rounding function for your work
+round(c(2.5, 3.5))
+
+
[1] 2 4
+
+
janitor::round_half_up(c(2.5, 3.5))
+
+
[1] 3 4
+
+
+
For rounding from proportion to percentages, you can use the function percent() from the scales package.
+
+
scales::percent(c(0.25, 0.35), accuracy =0.1)
+
+
[1] "25.0%" "35.0%"
+
+
+
+
+
Statistical functions
+
CAUTION: The functions below will by default include missing values in calculations. Missing values will result in an output of NA, unless the argument na.rm = TRUE is specified. This can be written shorthand as na.rm = T.
+
+
+
+
Objective
+
Function
+
+
+
+
+
mean (average)
+
mean(x, na.rm = T)
+
+
+
median
+
median(x, na.rm= T)
+
+
+
standard deviation
+
sd(x, na.rm = T)
+
+
+
quantiles*
+
quantile(x, probs)
+
+
+
sum
+
sum(x, na.rm = T)
+
+
+
minimum value
+
min(x, na.rm = T)
+
+
+
maximum value
+
max(x, na.rm = T)
+
+
+
range of numeric values
+
range(x, na.rm = T)
+
+
+
summary**
+
summary(x)
+
+
+
+
Notes:
+
+
*quantile(): x is the numeric vector to examine, and probs = is a numeric vector with probabilities within 0 and 1.0, e.g c(0.5, 0.8, 0.85).
+
**summary(): gives a summary on a numeric vector including mean, median, and common percentiles.
+
+
DANGER: If providing a vector of numbers to one of the above functions, be sure to wrap the numbers within c() .
+
+
# If supplying raw numbers to a function, wrap them in c()
+mean(1, 6, 12, 10, 5, 0) # !!! INCORRECT !!!
+
+
[1] 1
+
+
mean(c(1, 6, 12, 10, 5, 0)) # CORRECT
+
+
[1] 5.666667
+
+
+
+
+
Other useful functions
+
+
+
+
+
+
+
+
+
Objective
+
Function
+
Example
+
+
+
+
+
create a sequence
+
seq(from, to, by)
+
seq(1, 10, 2)
+
+
+
repeat x, n times
+
rep(x, ntimes)
+
rep(1:3, 2) or rep(c("a", "b", "c"), 3)
+
+
+
subdivide a numeric vector
+
cut(x, n)
+
cut(linelist$age, 5)
+
+
+
take a random sample
+
sample(x, size)
+
sample(linelist$id, size = 5, replace = TRUE)
+
+
+
+
+
+
+
+
%in%
+
A very useful operator for matching values, and for quickly assessing if a value is within a vector or data frame.
+
+
my_vector <-c("a", "b", "c", "d")
+
+
+
"a"%in% my_vector
+
+
[1] TRUE
+
+
"h"%in% my_vector
+
+
[1] FALSE
+
+
+
To ask if a value is not%in% a vector, put an exclamation mark (!) in front of the logic statement:
+
+
# to negate, put an exclamation in front
+!"a"%in% my_vector
+
+
[1] FALSE
+
+
!"h"%in% my_vector
+
+
[1] TRUE
+
+
+
%in% is very useful when using the dplyr function case_when(). You can define a vector previously, and then reference it later. For example:
Note: If you want to detect a partial string, perhaps using str_detect() from stringr, it will not accept a character vector like c("1", "Yes", "yes", "y"). Instead, it must be given a regular expression - one condensed string with OR bars, such as “1|Yes|yes|y”. For example, str_detect(hospitalized, "1|Yes|yes|y"). See the page on Characters and strings for more information.
+
You can convert a character vector to a named regular expression with this command:
# condense to
+affirmative_str_search <-paste0(affirmative, collapse ="|") # option with base R
+affirmative_str_search <-str_c(affirmative, collapse ="|") # option with stringr package
+
+affirmative_str_search
+
+
[1] "1|Yes|YES|yes|y|Y|oui|Oui|Si"
+
+
+
+
+
+
+
+
+
3.13 Errors & warnings
+
This section explains:
+
+
The difference between errors and warnings.
+
+
General syntax tips for writing R code.
+
+
Code assists.
+
+
Common errors and warnings and troubleshooting tips can be found in the page on Errors and help.
+
+
+
Error versus Warning
+
When a command is run, the R Console may show you warning or error messages in red text.
+
+
A warning means that R has completed your command, but had to take additional steps or produced unusual output that you should be aware of.
+
An error means that R was not able to complete your command.
+
+
Look for clues:
+
+
The error/warning message will often include a line number for the problem.
+
If an object “is unknown” or “not found”, perhaps you spelled it incorrectly, forgot to call a package with library(), or forgot to re-run your script after making changes.
+
+
If all else fails, copy the error message into Google along with some key terms - chances are that someone else has worked through this already!
+
+
+
+
General syntax tips
+
A few things to remember when writing commands in R, to avoid errors and warnings:
+
+
Always close parentheses - tip: count the number of opening “(” and closing parentheses “)” for each code chunk.
+
Avoid spaces in column and object names. Use underscore ( _ ) or periods ( . ) instead.
+
Keep track of and remember to separate a function’s arguments with commas.
+
R is case-sensitive, meaning Variable_A is different from variable_A.
+
+
+
+
+
Code assists
+
Any script (RMarkdown or otherwise) will give clues when you have made a mistake. For example, if you forgot to write a comma where it is needed, or to close a parentheses, RStudio will raise a flag on that line, on the left hand side of the script, to warn you.