diff --git a/.Rhistory b/.Rhistory index 626ee6c4..eb574f3a 100644 --- a/.Rhistory +++ b/.Rhistory @@ -1,4 +1,3 @@ -############ learnr, # interactive tutorials in RStudio Tutorial pane swirl, # interactive tutorials in R console # project and file management @@ -510,3 +509,4 @@ bookdown::render_book( output_format = 'bookdown::bs4_book', config_file = "_bookdown.yml") renv::status() +here("data", "linelists", "linelist_raw.xlsx") diff --git a/_quarto.yml b/_quarto.yml index 0d1f5707..d31f50a7 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -66,7 +66,7 @@ book: - icon: twitter href: "https://twitter.com/appliedepi" - icon: linkedin - href: "https://www.linkedin.com/company/appliedepi/" + href: "https://www.linkedin.com/company/appliedepi" # - icon: github # menu: # - text: Source Code diff --git a/html_outputs/index.html b/html_outputs/index.html index 5294aca4..5dffafc6 100644 --- a/html_outputs/index.html +++ b/html_outputs/index.html @@ -254,7 +254,7 @@ The Epidemiologist R Handbook - +
+ @@ -889,11 +884,9 @@

Import data

10.2 Unite, split, and arrange

This section covers:

@@ -959,7 +952,7 @@

Dynamic strings
str_glue("Data include {nrow(linelist)} cases and are current to {format(Sys.Date(), '%d %b %Y')}.")
-
Data include 5888 cases and are current to 30 Sep 2024.
+
Data include 5888 cases and are current to 18 Oct 2024.

An alternative format is to use placeholders within the brackets and define the code in separate arguments at the end of the str_glue() function, as below. This can improve code readability if the text is long.

@@ -970,7 +963,7 @@

Dynamic strings n_missing_onset = nrow(linelist %>% filter(is.na(date_onset))) )
-
Linelist as of 30 Sep 2024.
+
Linelist as of 18 Oct 2024.
 Last case hospitalized on 30 Apr 2015.
 256 cases are missing date of onset and not shown
@@ -987,8 +980,8 @@

Dynamic strings
-
- +
+

Use str_glue_data(), which is specially made for taking data from data frame rows:

@@ -1053,8 +1046,8 @@

Unite columns

Here is the example data frame:

-
- +
+

Below, we unite the three symptom columns:

@@ -1168,8 +1161,8 @@

Split columns

Let’s say we have a simple data frame df (defined and united in the unite section) containing a case_ID column, one character column with many symptoms, and one outcome column. Our goal is to separate the symptoms column into many columns - each one containing one symptom.

-
- +
+

Assuming the data are piped into separate(), first provide the column to be separated. Then provide into = as a vector c( ) containing the new columns names, as shown below.

@@ -1433,20 +1426,17 @@

Extract by character position

Use str_sub() to return only a part of a string. The function takes three main arguments:

    -
  1. the character vector(s).
    -
  2. -
  3. start position.
  4. -
  5. end position.
  6. +
  7. The character vector(s)
  8. +
  9. Start position
  10. +
  11. End position

A few notes on position numbers:

    -
  • If a position number is positive, the position is counted starting from the left end of the string.
    -
  • -
  • If a position number is negative, it is counted starting from the right end of the string.
    -
  • -
  • Position numbers are inclusive.
    +
  • If a position number is positive, the position is counted starting from the left end of the string
  • -
  • Positions extending beyond the string will be truncated (removed).
  • +
  • If a position number is negative, it is counted starting from the right end of the string
  • +
  • Position numbers are inclusive
  • +
  • Positions extending beyond the string will be truncated (removed)

Below are some examples applied to the string “pneumonia”:

@@ -1859,13 +1849,10 @@

-
  • Character sets.
    -
  • -
  • Meta characters.
    -
  • -
  • Quantifiers.
    -
  • -
  • Groups.
  • +
  • Character sets
  • +
  • Meta characters
  • +
  • Quantifiers
  • +
  • Groups
  • Character sets

    Character sets, are a way of expressing listing options for a character match, within brackets. So any a match will be triggered if any of the characters within the brackets are found in the string. For example, to look for vowels one could use this character set: “[aeiou]”. Some other common character sets are:

    @@ -1948,15 +1935,11 @@

    will return instances of two capital A letters.
    - -
  • "A{2,4}" will return instances of between two and four capital A letters (do not put spaces!).
    -
  • -
  • "A{2,}" will return instances of two or more capital A letters.
    -
  • -
  • "A+" will return instances of one or more capital A letters (group extended until a different character is encountered).
    -
  • -
  • Precede with an * asterisk to return zero or more matches (useful if you are not sure the pattern is present).
  • +
  • "A{2}" will return instances of two capital A letters
  • +
  • "A{2,4}" will return instances of between two and four capital A letters (do not put spaces!)
  • +
  • "A{2,}" will return instances of two or more capital A letters
  • +
  • "A+" will return instances of one or more capital A letters (group extended until a different character is encountered)
  • +
  • Precede with an * asterisk to return zero or more matches (useful if you are not sure the pattern is present)
  • Using the + plus symbol as a quantifier, the match will occur until a different character is encountered. For example, this expression will return all words (alpha characters: "[A-Za-z]+"

    @@ -2063,6 +2046,10 @@

    this cheatsheet

    Also see this tutorial.

    +

    Additionally, the package RVerbalExpressions can offer an easy way to construct regular expressions.

    +
    +
    devtools::install_github("VerbalExpressions/RVerbalExpressions")
    +

    @@ -2664,7 +2651,7 @@

    var lightboxQuarto = GLightbox({"openEffect":"zoom","closeEffect":"zoom","selector":".lightbox","loop":false,"descPosition":"bottom"}); (function() { let previousOnload = window.onload; window.onload = () => { diff --git a/html_outputs/new_pages/cleaning.html b/html_outputs/new_pages/cleaning.html index 54786b3d..bbc38d8b 100644 --- a/html_outputs/new_pages/cleaning.html +++ b/html_outputs/new_pages/cleaning.html @@ -317,7 +317,7 @@ The Epidemiologist R Handbook

    @@ -1480,19 +1476,16 @@

    Other statistical software such as SAS and STATA use “labels” that co-exist as longer printed versions of the shorter column names. While R does offer the possibility of adding column labels to the data, this is not emphasized in most practice. To make column names “printer-friendly” for figures, one typically adjusts their display within the plotting commands that create the outputs (e.g. axis or legend titles of a plot, or column headers in a printed table - see the scales section of the ggplot tips page and Tables for presentation pages). If you want to assign column labels in the data, read more online here and here.

    As R column names are used very often, so they must have “clean” syntax. We suggest the following:

      -
    • Short names.
    • -
    • No spaces (replace with underscores _ ).
    • -
    • No unusual characters (&, #, <, >, …).
      -
    • -
    • Similar style nomenclature (e.g. all date columns named like date_onset, date_report, date_death…).
    • +
    • Short names
    • +
    • No spaces (replace with underscores _ )
    • +
    • No unusual characters (&, #, <, >, …)
    • +
    • Similar style nomenclature (e.g. all date columns named like date_onset, date_report, date_death…)

    The columns names of linelist_raw are printed below using names() from base R. We can see that initially:

      -
    • Some names contain spaces (e.g. infection date).
      -
    • -
    • Different naming patterns are used for dates (date onset vs. infection date).
      -
    • -
    • There must have been a merged header across the two last columns in the .xlsx. We know this because the name of two merged columns (“merged_header”) was assigned by R to the first column, and the second column was assigned a placeholder name “…28” (as it was then empty and is the 28th column).
    • +
    • Some names contain spaces (e.g. infection date)
    • +
    • Different naming patterns are used for dates (date onset vs. infection date)
    • +
    • There must have been a merged header across the two last columns in the .xlsx. We know this because the name of two merged columns (“merged_header”) was assigned by R to the first column, and the second column was assigned a placeholder name “…28” (as it was then empty and is the 28th column)
    names(linelist_raw)
    @@ -1511,15 +1504,11 @@

    Automatic cleaning

    The function clean_names() from the package janitor standardizes column names and makes them unique by doing the following:

      -
    • Converts all names to consist of only underscores, numbers, and letters.
      -
    • -
    • Accented characters are transliterated to ASCII (e.g. german o with umlaut becomes “o”, spanish “enye” becomes “n”).
      -
    • -
    • Capitalization preference for the new column names can be specified using the case = argument (“snake” is default, alternatives include “sentence”, “title”, “small_camel”…).
      -
    • -
    • You can specify specific name replacements by providing a vector to the replace = argument (e.g. replace = c(onset = "date_of_onset")).
      -
    • -
    • Here is an online vignette.
    • +
    • Converts all names to consist of only underscores, numbers, and letters
    • +
    • Accented characters are transliterated to ASCII (e.g. german o with umlaut becomes “o”, spanish “enye” becomes “n”)
    • +
    • Capitalization preference for the new column names can be specified using the case = argument (“snake” is default, alternatives include “sentence”, “title”, “small_camel”…)
    • +
    • You can specify specific name replacements by providing a vector to the replace = argument (e.g. replace = c(onset = "date_of_onset"))
    • +
    • Here is an online vignette

    Below, the cleaning pipeline begins by using clean_names() on the raw linelist.

    @@ -1606,11 +1595,9 @@

    Transition to R, merged cells can be nice for human reading of data, but are not “tidy data” and cause many problems for machine reading of data. R cannot accommodate merged cells.

    Remind people doing data entry that human-readable data is not the same as machine-readable data. Strive to train users about the principles of tidy data. If at all possible, try to change procedures so that data arrive in a tidy format without merged cells.

      -
    • Each variable must have its own column.
      -
    • -
    • Each observation must have its own row.
      -
    • -
    • Each value must have its own cell.
    • +
    • Each variable must have its own column
    • +
    • Each observation must have its own row
    • +
    • Each value must have its own cell

    When using rio’s import() function, the value in a merged cell will be assigned to the first cell and subsequent cells will be empty.

    One solution to deal with merged cells is to import the data with the function readWorkbook() from the package openxlsx. Set the argument fillMergedCells = TRUE. This gives the value in a merged cell to all cells within the merge range.

    @@ -1683,35 +1670,35 @@

    “tidyselect

    Here are other “tidyselect” helper functions that also work within dplyr functions like select(), across(), and summarise():

      -
    • everything() - all other columns not mentioned.
      +
    • everything() - all other columns not mentioned
    • -
    • last_col() - the last column.
    • -
    • where() - applies a function to all columns and selects those which are TRUE.
      +
    • last_col() - the last column
    • +
    • where() - applies a function to all columns and selects those which are TRUE
    • -
    • contains() - columns containing a character string. +
    • contains() - columns containing a character string
        -
      • example: select(contains("time")).
        +
      • example: select(contains("time"))
    • -
    • starts_with() - matches to a specified prefix. +
    • starts_with() - matches to a specified prefix
        -
      • example: select(starts_with("date_")).
        +
      • example: select(starts_with("date_"))
    • -
    • ends_with() - matches to a specified suffix. +
    • ends_with() - matches to a specified suffix
        -
      • example: select(ends_with("_post")).
        +
      • example: select(ends_with("_post"))
    • -
    • matches() - to apply a regular expression (regex). +
    • matches() - to apply a regular expression (regex)
        -
      • example: select(matches("[pt]al")).
      • +
      • example: select(matches("[pt]al"))
    • -
    • num_range() - a numerical range like x01, x02, x03.
      +
    • num_range() - a numerical range like x01, x02, x03
    • -
    • any_of() - matches IF column exists but returns no error if it is not found. +
    • any_of() - matches IF column exists but returns no error if it is not found
        -
      • example: select(any_of(date_onset, date_death, cardiac_arrest)).
      • +
      • example: select(any_of(date_onset, date_death, cardiac_arrest))

    In addition, use normal operators such as c() to list several columns, : for consecutive columns, ! for opposite, & for AND, and | for OR.

    @@ -1901,8 +1888,8 @@

    New columns

    Review the new columns. For demonstration purposes, only the new columns and the columns used to create them are shown:

    -
    - +
    +

    TIP: A variation on mutate() is the function transmute(). This function adds a new column just like mutate(), but also drops/removes all other columns that you do not mention within its parentheses.

    @@ -1995,13 +1982,11 @@

    a

    across() functions

    You can read the documentation with ?across for details on how to provide functions to across(). A few summary points: there are several ways to specify the function(s) to perform on a column and you can even define your own functions:

      -
    • You can provide the function name alone (e.g. mean or as.character).
      -
    • -
    • You can provide the function in purrr-style (e.g. ~ mean(.x, na.rm = TRUE)) (see this page).
      -
    • -
    • You can specify multiple functions by providing a list (e.g. list(mean = mean, n_miss = ~ sum(is.na(.x))). +
    • You can provide the function name alone (e.g. mean or as.character)
    • +
    • You can provide the function in purrr-style (e.g. ~ mean(.x, na.rm = TRUE)) (see this page)
    • +
    • You can specify multiple functions by providing a list (e.g. list(mean = mean, n_miss = ~ sum(is.na(.x)))
        -
      • If you provide multiple functions, multiple transformed columns will be returned per input column, with unique names in the format col_fn. You can adjust how the new columns are named with the .names = argument using glue syntax (see page on Characters and strings) where {.col} and {.fn} are shorthand for the input column and function.
      • +
      • If you provide multiple functions, multiple transformed columns will be returned per input column, with unique names in the format col_fn. You can adjust how the new columns are named with the .names = argument using glue syntax (see page on Characters and strings) where {.col} and {.fn} are shorthand for the input column and function

    Here are a few online resources on using across(): creator Hadley Wickham’s thoughts/rationale

    @@ -2118,12 +2103,10 @@

    Add to pipe c

    8.8 Re-code values

    Here are a few scenarios where you need to re-code (change) values:

      -
    • to edit one specific value (e.g. one date with an incorrect year or format).
      -
    • -
    • to reconcile values not spelled the same.
    • -
    • to create a new column of categorical values.
      -
    • -
    • to create a new column of numeric categories (e.g. age categories).
    • +
    • to edit one specific value (e.g. one date with an incorrect year or format)
    • +
    • to reconcile values not spelled the same
    • +
    • to create a new column of categorical values
    • +
    • to create a new column of numeric categories (e.g. age categories)

    Specific values

    @@ -2193,8 +2176,8 @@

    Specific values

    By logic

    Below we demonstrate how to re-code values in a column using logic and conditions:

      -
    • Using replace(), ifelse() and if_else() for simple logic.
    • -
    • Using case_when() for more complex logic.
    • +
    • Using replace(), ifelse() and if_else() for simple logic
    • +
    • Using case_when() for more complex logic
    @@ -2321,11 +2304,9 @@

    Cleaning di
    1. Create a cleaning dictionary with 3 columns:
        -
      • A “from” column (the incorrect value).
        -
      • -
      • A “to” column (the correct value).
        -
      • -
      • A column specifying the column for the changes to be applied (or “.global” to apply to all columns).
      • +
      • A “from” column (the incorrect value)
      • +
      • A “to” column (the correct value)
      • +
      • A column specifying the column for the changes to be applied (or “.global” to apply to all columns)

    Note: .global dictionary entries will be overridden by column-specific dictionary entries.

    @@ -2360,8 +2341,8 @@

    Cleaning di

    Now scroll to the right to see how values have changed - particularly gender (lowercase to uppercase), and all the symptoms columns have been transformed from yes/no to 1/0.

    -
    - +
    +

    Note that your column names in the cleaning dictionary must correspond to the names at this point in your cleaning script. See this online reference for the linelist package for more details.

    @@ -2436,13 +2417,10 @@

    Add to pipe

    8.9 Numeric categories

    Here we describe some special approaches for creating categories from numerical columns. Common examples include age categories, groups of lab values, etc. Here we will discuss:

      -
    • age_categories(), from the epikit package.
      -
    • -
    • cut(), from base R.
      -
    • -
    • case_when().
      -
    • -
    • quantile breaks with quantile() and ntile().
    • +
    • age_categories(), from the epikit package
    • +
    • cut(), from base R
    • +
    • case_when()
    • +
    • quantile breaks with quantile() and ntile()

    Review distribution

    @@ -2576,13 +2554,11 @@

    age_catego

    cut()

    cut() is a base R alternative to age_categories(), but I think you will see why age_categories() was developed to simplify this process. Some notable differences from age_categories() are:

      -
    • You do not need to install/load another package.
      -
    • -
    • You can specify whether groups are open/closed on the right/left.
      +
    • You do not need to install/load another package
    • +
    • You can specify whether groups are open/closed on the right/left
    • -
    • You must provide accurate labels yourself.
      -
    • -
    • If you want 0 included in the lowest group you must specify this.
    • +
    • You must provide accurate labels yourself
    • +
    • If you want 0 included in the lowest group you must specify this

    The basic syntax within cut() is to first provide the numeric column to be cut (age_years), and then the breaks argument, which is a numeric vector c() of break points. Using cut(), the resulting column is an ordered factor.

    By default, the categorization occurs so that the right/upper side is “open” and inclusive (and the left/lower side is “closed” or exclusive). This is the opposite behavior from the age_categories() function. The default labels use the notation “(A, B]”, which means A is not included but B is.Reverse this behavior by providing the right = TRUE argument.

    @@ -3290,7 +3266,7 @@

    <
    linelist %>%
       rowwise() %>%
    -  mutate(num_symptoms = sum(c(fever, chills, cough, aches, vomit) == "yes")) %>% 
    +  mutate(num_symptoms = sum(c(fever, chills, cough, aches, vomit) == "yes"), na.rm =T) %>% 
       ungroup() %>% 
       select(fever, chills, cough, aches, vomit, num_symptoms) # for display
    @@ -3313,11 +3289,9 @@

    <

    As you specify the column to evaluate, you may want to use the “tidyselect” helper functions described in the select() section of this page. You just have to make one adjustment (because you are not using them within a dplyr function like select() or summarise()).

    Put the column-specification criteria within the dplyr function c_across(). This is because c_across (documentation) is designed to work with rowwise() specifically. For example, the following code:

      -
    • Applies rowwise() so the following operation (sum()) is applied within each row (not summing entire columns).
      -
    • -
    • Creates new column num_NA_dates, defined for each row as the number of columns (with name containing “date”) for which is.na() evaluated to TRUE (they are missing data).
      -
    • -
    • ungroup() to remove the effects of rowwise() for subsequent steps.
    • +
    • Applies rowwise() so the following operation (sum()) is applied within each row (not summing entire columns)
    • +
    • Creates new column num_NA_dates, defined for each row as the number of columns (with name containing “date”) for which is.na() evaluated to TRUE (they are missing data)
    • +
    • ungroup() to remove the effects of rowwise() for subsequent steps
    linelist %>%
    @@ -3973,7 +3947,7 @@ 

    - +
    +

    @@ -872,8 +873,8 @@

    Re-format valu

    View the new data. Note the two columns towards the right end - the pasted combined values, and the list.

    -
    - +
    +
    @@ -1564,7 +1565,7 @@

    var lightboxQuarto = GLightbox({"descPosition":"bottom","openEffect":"zoom","selector":".lightbox","closeEffect":"zoom","loop":false}); (function() { let previousOnload = window.onload; window.onload = () => { diff --git a/html_outputs/new_pages/combination_analysis_files/figure-html/unnamed-chunk-1-1.png b/html_outputs/new_pages/combination_analysis_files/figure-html/unnamed-chunk-1-1.png index eba3e4ee..9877da62 100644 Binary files a/html_outputs/new_pages/combination_analysis_files/figure-html/unnamed-chunk-1-1.png and b/html_outputs/new_pages/combination_analysis_files/figure-html/unnamed-chunk-1-1.png differ diff --git a/html_outputs/new_pages/data_table.html b/html_outputs/new_pages/data_table.html index fb6526dc..6c3752f8 100644 --- a/html_outputs/new_pages/data_table.html +++ b/html_outputs/new_pages/data_table.html @@ -285,7 +285,7 @@ The Epidemiologist R Handbook

    @@ -833,9 +833,9 @@

    linelist[hospital %like% "Hospital"] #filter rows where the hospital variable contains “Hospital”
    @@ -891,9 +891,9 @@ 

    linelist[, .N, .(hospital)] #the number of cases by hospital
    diff --git a/html_outputs/new_pages/data_used.html b/html_outputs/new_pages/data_used.html index c059ba3d..8cd7cc5a 100644 --- a/html_outputs/new_pages/data_used.html +++ b/html_outputs/new_pages/data_used.html @@ -289,7 +289,7 @@ The Epidemiologist R Handbook

    GIS

    @@ -1584,7 +1584,7 @@

    Shiny

    - +
    + @@ -1438,23 +1456,23 @@

    https://en.wikipedia.org/wiki/List_of_tz_database_time_zones

    -
    # assign the current time to a column
    -time_now <- Sys.time()
    -time_now
    +
    # assign the current time to a column
    +time_now <- Sys.time()
    +time_now
    -
    [1] "2024-09-30 19:33:19 PDT"
    -
    -
    # use with_tz() to assign a new timezone to the column, while CHANGING the clock time
    -time_london_real <- with_tz(time_now, "Europe/London")
    -
    -# use force_tz() to assign a new timezone to the column, while KEEPING the clock time
    -time_london_local <- force_tz(time_now, "Europe/London")
    -
    -
    -# note that as long as the computer that was used to run this code is NOT set to London time,
    -# there will be a difference in the times 
    -# (the number of hours difference from the computers time zone to london)
    -time_london_real - time_london_local
    +
    [1] "2024-10-14 17:00:55 PDT"
    +
    +
    # use with_tz() to assign a new timezone to the column, while CHANGING the clock time
    +time_london_real <- with_tz(time_now, "Europe/London")
    +
    +# use force_tz() to assign a new timezone to the column, while KEEPING the clock time
    +time_london_local <- force_tz(time_now, "Europe/London")
    +
    +
    +# note that as long as the computer that was used to run this code is NOT set to London time,
    +# there will be a difference in the times 
    +# (the number of hours difference from the computers time zone to london)
    +time_london_real - time_london_local
    Time difference of 8 hours
    @@ -1468,8 +1486,8 @@

    -
    - +
    +

    When using lag() or lead() the order of rows in the dataframe is very important! - pay attention to whether your dates/numbers are ascending or descending.

    @@ -1482,25 +1500,25 @@

    -
    counts <- counts %>% 
    -  mutate(cases_prev_wk = lag(cases_wk, n = 1))
    +
    counts <- counts %>% 
    +  mutate(cases_prev_wk = lag(cases_wk, n = 1))
    -
    - +
    +

    Next, create a new column which is the difference between the two cases columns:

    -
    counts <- counts %>% 
    -  mutate(cases_prev_wk = lag(cases_wk, n = 1),
    -         case_diff = cases_wk - cases_prev_wk)
    +
    counts <- counts %>% 
    +  mutate(cases_prev_wk = lag(cases_wk, n = 1),
    +         case_diff = cases_wk - cases_prev_wk)
    -
    - +
    +

    You can read more about lead() and lag() in the documentation here or by entering ?lag in your console.

    @@ -2107,7 +2125,7 @@

    var lightboxQuarto = GLightbox({"loop":false,"selector":".lightbox","descPosition":"bottom","openEffect":"zoom","closeEffect":"zoom"}); (function() { let previousOnload = window.onload; window.onload = () => { diff --git a/html_outputs/new_pages/deduplication.html b/html_outputs/new_pages/deduplication.html index cedf57f8..c2c6842f 100644 --- a/html_outputs/new_pages/deduplication.html +++ b/html_outputs/new_pages/deduplication.html @@ -317,7 +317,7 @@ The Epidemiologist R Handbook
    -
    - +
    +

    You can also positively specify the columns to consider. Below, only rows that have the same values in the name and purpose columns are returned. Notice how “amrish” now has dupe_count equal to 3 to reflect his three “contact” encounters.

    @@ -935,8 +934,8 @@

    Examine
    -
    - +
    +

    See the original data.

    @@ -959,8 +958,8 @@

    Keep only
    -
    - +
    +

    CAUTION: If using distinct() on grouped data, the function will apply to each group.

    @@ -976,8 +975,8 @@

    Keep only
    -
    - +
    +

    See the original data.

    @@ -1044,25 +1043,35 @@

    obs %>% 
    -     slice(c(2,4))  # return rows 2 and 4
    + slice(c(2, 4)) # return rows 2 and 4
      recordID personID   name       date  time encounter purpose symptoms_ever
     1        1        1   adam 2020-01-01 09:00         1 contact          <NA>
     2        3        2 amrish 2020-01-02 14:20         1 contact            No
    +
    +
    obs %>% 
    +     slice(c(2:4))  # return rows 2 through 4
    +
    +
      recordID personID   name       date  time encounter purpose symptoms_ever
    +1        1        1   adam 2020-01-01 09:00         1 contact          <NA>
    +2        2        2 amrish 2020-01-02 14:20         1 contact            No
    +3        3        2 amrish 2020-01-02 14:20         1 contact            No
    +
    +

    See the original data.

    There are several variations: These should be provided with a column and a number of rows to return (to n =).

      -
    • slice_min() and slice_max() keep only the row(s) with the minimium or maximum value(s) of the specified column. This also works to return the “min” and “max” of ordered factors.
      +
    • slice_min() and slice_max() keep only the row(s) with the minimium or maximum value(s) of the specified column. This also works to return the “min” and “max” of ordered factors
    • -
    • slice_head() and slice_tail() - keep only the first or last row(s).
      +
    • slice_head() and slice_tail() - keep only the first or last row(s)
    • -
    • slice_sample() - keep only a random sample of the rows.
    • +
    • slice_sample() - keep only a random sample of the rows
    -
    obs %>% 
    -     slice_max(encounter, n = 1)  # return rows with the largest encounter number
    +
    obs %>% 
    +     slice_max(encounter, n = 1)  # return rows with the largest encounter number
      recordID personID   name       date  time encounter purpose symptoms_ever
     1        5        2 amrish 2020-01-05 16:10         3    case           Yes
    @@ -1073,13 +1082,10 @@ 

    TIP: When using slice_max() and slice_min(), be sure to specify/write the n = (e.g. n = 2, not just 2). Otherwise you may get an error Error:is not empty.

    NOTE: You may encounter the function top_n(), which has been superseded by the slice functions.

    @@ -1094,16 +1100,16 @@

    Slice with gr

    CAUTION: If using arrange(), specify .by_group = TRUE to have the data arranged within each group.

    DANGER: If with_ties = FALSE, the first row of a tie is kept. This may be deceptive. See how for Mariah, she has two encounters on her latest date (6 Jan) and the first (earliest) one was kept. Likely, we want to keep her later encounter on that day. See how to “break” these ties in the next example.

    -
    obs %>% 
    -  group_by(name) %>%       # group the rows by 'name'
    -  slice_max(date,          # keep row per group with maximum date value 
    -            n = 1,         # keep only the single highest row 
    -            with_ties = F) # if there's a tie (of date), take the first row
    +
    obs %>% 
    +  group_by(name) %>%       # group the rows by 'name'
    +  slice_max(date,          # keep row per group with maximum date value 
    +            n = 1,         # keep only the single highest row 
    +            with_ties = F) # if there's a tie (of date), take the first row
    -
    - +
    +

    Above, for example we can see that only Amrish’s row on 5 Jan was kept, and only Brian’s row on 7 Jan was kept. See the original data.

    @@ -1111,20 +1117,20 @@

    Slice with gr

    Multiple slice statements can be run to “break ties”. In this case, if a person has multiple encounters on their latest date, the encounter with the latest time is kept (lubridate::hm() is used to convert the character times to a sortable time class).
    Note how now, the one row kept for “Mariah” on 6 Jan is encounter 3 from 08:32, not encounter 2 at 07:25.

    -
    # Example of multiple slice statements to "break ties"
    -obs %>%
    -  group_by(name) %>%
    -  
    -  # FIRST - slice by latest date
    -  slice_max(date, n = 1, with_ties = TRUE) %>% 
    -  
    -  # SECOND - if there is a tie, select row with latest time; ties prohibited
    -  slice_max(lubridate::hm(time), n = 1, with_ties = FALSE)
    +
    # Example of multiple slice statements to "break ties"
    +obs %>%
    +  group_by(name) %>%
    +  
    +  # FIRST - slice by latest date
    +  slice_max(date, n = 1, with_ties = TRUE) %>% 
    +  
    +  # SECOND - if there is a tie, select row with latest time; ties prohibited
    +  slice_max(lubridate::hm(time), n = 1, with_ties = FALSE)
    -
    - +
    +

    In the example above, it would also have been possible to slice by encounter number, but we showed the slice on date and time for example purposes.

    @@ -1141,28 +1147,28 @@

    Keep all
  • In the original data frame, mark rows as appropriate with case_when(), based on whether their record unique identifier (recordID in this example) is present in the reduced data frame.
  • -
    # 1. Define data frame of rows to keep for analysis
    -obs_keep <- obs %>%
    -  group_by(name) %>%
    -  slice_max(encounter, 
    -            n = 1, 
    -            with_ties = FALSE) # keep only latest encounter per person
    -
    -
    -# 2. Mark original data frame
    -obs_marked <- obs %>%
    -
    -  # make new dup_record column
    -  mutate(dup_record = case_when(
    -    
    -    # if record is in obs_keep data frame
    -    recordID %in% obs_keep$recordID ~ "For analysis", 
    -    
    -    # all else marked as "Ignore" for analysis purposes
    -    TRUE                            ~ "Ignore"))
    -
    -# print
    -obs_marked
    +
    # 1. Define data frame of rows to keep for analysis
    +obs_keep <- obs %>%
    +  group_by(name) %>%
    +  slice_max(encounter, 
    +            n = 1, 
    +            with_ties = FALSE) # keep only latest encounter per person
    +
    +
    +# 2. Mark original data frame
    +obs_marked <- obs %>%
    +
    +  # make new dup_record column
    +  mutate(dup_record = case_when(
    +    
    +    # if record is in obs_keep data frame
    +    recordID %in% obs_keep$recordID ~ "For analysis", 
    +    
    +    # all else marked as "Ignore" for analysis purposes
    +    TRUE                            ~ "Ignore"))
    +
    +# print
    +obs_marked
       recordID personID    name       date  time encounter purpose symptoms_ever
     1         1        1    adam 2020-01-01 09:00         1 contact          <NA>
    @@ -1208,8 +1214,8 @@ 

    Keep all

    -
    - +
    +

    See the original data.

    @@ -1223,18 +1229,18 @@

    Calc

    This involves the function rowSums() from base R. Also used is ., which within piping refers to the data frame at that point in the pipe (in this case, it is being subset with brackets []).

    Scroll to the right to see more rows

    -
    # create a "key variable completeness" column
    -# this is a *proportion* of the columns designated as "key_cols" that have non-missing values
    -
    -key_cols = c("personID", "name", "symptoms_ever")
    -
    -obs %>% 
    -  mutate(key_completeness = rowSums(!is.na(.[,key_cols]))/length(key_cols)) 
    +
    # create a "key variable completeness" column
    +# this is a *proportion* of the columns designated as "key_cols" that have non-missing values
    +
    +key_cols = c("personID", "name", "symptoms_ever")
    +
    +obs %>% 
    +  mutate(key_completeness = rowSums(!is.na(.[,key_cols]))/length(key_cols)) 
    -
    - +
    +

    See the original data.

    @@ -1245,9 +1251,8 @@

    Calc

    15.4 Roll-up values

    This section describes:

      -
    1. How to “roll-up” values from multiple rows into just one row, with some variations.
      -
    2. -
    3. Once you have “rolled-up” values, how to overwrite/prioritize the values in each cell.
    4. +
    5. How to “roll-up” values from multiple rows into just one row, with some variations
    6. +
    7. Once you have “rolled-up” values, how to overwrite/prioritize the values in each cell

    This tab uses the example dataset from the Preparation tab.

    @@ -1255,66 +1260,64 @@

    Roll-up values into one row

    The code example below uses group_by() and summarise() to group rows by person, and then paste together all unique values within the grouped rows. Thus, you get one summary row per person. A few notes:

      -
    • A suffix is appended to all new columns (“_roll” in this example).
      -
    • -
    • If you want to show only unique values per cell, then wrap the na.omit() with unique().
      -
    • -
    • na.omit() removes NA values, but if this is not desired it can be removed paste0(.x).
    • +
    • A suffix is appended to all new columns (“_roll” in this example)
    • +
    • If you want to show only unique values per cell, then wrap the na.omit() with unique()
    • +
    • na.omit() removes NA values, but if this is not desired it can be removed paste0(.x)
    -
    # "Roll-up" values into one row per group (per "personID") 
    -cases_rolled <- obs %>% 
    -  
    -  # create groups by name
    -  group_by(personID) %>% 
    -  
    -  # order the rows within each group (e.g. by date)
    -  arrange(date, .by_group = TRUE) %>% 
    -  
    -  # For each column, paste together all values within the grouped rows, separated by ";"
    -  summarise(
    -    across(everything(),                           # apply to all columns
    -           ~paste0(na.omit(.x), collapse = "; "))) # function is defined which combines non-NA values
    +
    # "Roll-up" values into one row per group (per "personID") 
    +cases_rolled <- obs %>% 
    +  
    +  # create groups by name
    +  group_by(personID) %>% 
    +  
    +  # order the rows within each group (e.g. by date)
    +  arrange(date, .by_group = TRUE) %>% 
    +  
    +  # For each column, paste together all values within the grouped rows, separated by ";"
    +  summarise(
    +    across(everything(),                           # apply to all columns
    +           ~paste0(na.omit(.x), collapse = "; "))) # function is defined which combines non-NA values

    The result is one row per group (ID), with entries arranged by date and pasted together. Scroll to the left to see more rows

    -
    - +
    +

    See the original data.

    This variation shows unique values only:

    -
    # Variation - show unique values only 
    -cases_rolled <- obs %>% 
    -  group_by(personID) %>% 
    -  arrange(date, .by_group = TRUE) %>% 
    -  summarise(
    -    across(everything(),                                   # apply to all columns
    -           ~paste0(unique(na.omit(.x)), collapse = "; "))) # function is defined which combines unique non-NA values
    +
    # Variation - show unique values only 
    +cases_rolled <- obs %>% 
    +  group_by(personID) %>% 
    +  arrange(date, .by_group = TRUE) %>% 
    +  summarise(
    +    across(everything(),                                   # apply to all columns
    +           ~paste0(unique(na.omit(.x)), collapse = "; "))) # function is defined which combines unique non-NA values
    -
    - +
    +

    This variation appends a suffix to each column.
    In this case “_roll” to signify that it has been rolled:

    -
    # Variation - suffix added to column names 
    -cases_rolled <- obs %>% 
    -  group_by(personID) %>% 
    -  arrange(date, .by_group = TRUE) %>% 
    -  summarise(
    -    across(everything(),                
    -           list(roll = ~paste0(na.omit(.x), collapse = "; ")))) # _roll is appended to column names
    +
    # Variation - suffix added to column names 
    +cases_rolled <- obs %>% 
    +  group_by(personID) %>% 
    +  arrange(date, .by_group = TRUE) %>% 
    +  summarise(
    +    across(everything(),                
    +           list(roll = ~paste0(na.omit(.x), collapse = "; ")))) # _roll is appended to column names
    -
    - +
    +
    @@ -1323,25 +1326,25 @@

    Rol

    Overwrite values/hierarchy

    If you then want to evaluate all of the rolled values, and keep only a specific value (e.g. “best” or “maximum” value), you can use mutate() across the desired columns, to implement case_when(), which uses str_detect() from the stringr package to sequentially look for string patterns and overwrite the cell content.

    -
    # CLEAN CASES
    -#############
    -cases_clean <- cases_rolled %>% 
    -    
    -    # clean Yes-No-Unknown vars: replace text with "highest" value present in the string
    -    mutate(across(c(contains("symptoms_ever")),                     # operates on specified columns (Y/N/U)
    -             list(mod = ~case_when(                                 # adds suffix "_mod" to new cols; implements case_when()
    -               
    -               str_detect(.x, "Yes")       ~ "Yes",                 # if "Yes" is detected, then cell value converts to yes
    -               str_detect(.x, "No")        ~ "No",                  # then, if "No" is detected, then cell value converts to no
    -               str_detect(.x, "Unknown")   ~ "Unknown",             # then, if "Unknown" is detected, then cell value converts to Unknown
    -               TRUE                        ~ as.character(.x)))),   # then, if anything else if it kept as is
    -      .keep = "unused")                                             # old columns removed, leaving only _mod columns
    +
    # CLEAN CASES
    +#############
    +cases_clean <- cases_rolled %>% 
    +    
    +    # clean Yes-No-Unknown vars: replace text with "highest" value present in the string
    +    mutate(across(c(contains("symptoms_ever")),                     # operates on specified columns (Y/N/U)
    +             list(mod = ~case_when(                                 # adds suffix "_mod" to new cols; implements case_when()
    +               
    +               str_detect(.x, "Yes")       ~ "Yes",                 # if "Yes" is detected, then cell value converts to yes
    +               str_detect(.x, "No")        ~ "No",                  # then, if "No" is detected, then cell value converts to no
    +               str_detect(.x, "Unknown")   ~ "Unknown",             # then, if "Unknown" is detected, then cell value converts to Unknown
    +               TRUE                        ~ as.character(.x)))),   # then, if anything else if it kept as is
    +      .keep = "unused")                                             # old columns removed, leaving only _mod columns

    Now you can see in the column symptoms_ever that if the person EVER said “Yes” to symptoms, then only “Yes” is displayed.

    -
    - +
    +

    See the original data.

    @@ -1954,7 +1957,7 @@

    var lightboxQuarto = GLightbox({"loop":false,"descPosition":"bottom","openEffect":"zoom","selector":".lightbox","closeEffect":"zoom"}); (function() { let previousOnload = window.onload; window.onload = () => { diff --git a/html_outputs/new_pages/diagrams.html b/html_outputs/new_pages/diagrams.html index 8df5589e..5e758448 100644 --- a/html_outputs/new_pages/diagrams.html +++ b/html_outputs/new_pages/diagrams.html @@ -331,7 +331,7 @@ The Epidemiologist R Handbook

    -
    - +
    +

    An example with perhaps a bit more applied public health context:

    @@ -936,8 +931,8 @@

    Simple examples } ")
    -
    - +
    +
    @@ -1075,8 +1070,8 @@

    Complex exampl
    -
    - +
    +

    Sub-graph clusters

    @@ -1148,8 +1143,8 @@

    Complex exampl
    -
    - +
    +

    Node shapes

    @@ -1174,8 +1169,8 @@

    Complex exampl saved_plot
    -
    - +
    +
    @@ -1221,8 +1216,8 @@

    Plotting

    The dataset now look like this:

    -
    - +
    +

    Now plot the Sankey diagram with geom_alluvium() and geom_stratum(). You can read more about each argument by running ?geom_alluvium and ?geom_stratum in the console.

    @@ -1288,8 +1283,8 @@

    Here is the events dataset we begin with:

    -
    - +
    +
    @@ -1317,8 +1312,8 @@

    #print pp

    -
    - +
    +
    @@ -1931,7 +1926,7 @@

    var lightboxQuarto = GLightbox({"descPosition":"bottom","selector":".lightbox","openEffect":"zoom","closeEffect":"zoom","loop":false}); (function() { let previousOnload = window.onload; window.onload = () => { diff --git a/html_outputs/new_pages/editorial_style.html b/html_outputs/new_pages/editorial_style.html index d2db1876..0c40bf3c 100644 --- a/html_outputs/new_pages/editorial_style.html +++ b/html_outputs/new_pages/editorial_style.html @@ -286,7 +286,7 @@ The Epidemiologist R Handbook
    -
    - +
    +
    @@ -1422,11 +1419,9 @@

    Since we also have population data by ADM3, we can add this information to the case_adm3 table created previously.

    We begin with the dataframe created in the previous step case_adm3, which is a summary table of each administrative unit and its number of cases.

      -
    1. The population data sle_adm3_pop are joined using a left_join() from dplyr on the basis of common values across column admin3pcod in the case_adm3 dataframe, and column adm_pcode in the sle_adm3_pop dataframe. See page on Joining data).
      -
    2. -
    3. select() is applied to the new dataframe, to keep only the useful columns - total is total population.
      -
    4. -
    5. Cases per 10,000 populaton is calculated as a new column with mutate().
    6. +
    7. The population data sle_adm3_pop are joined using a left_join() from dplyr on the basis of common values across column admin3pcod in the case_adm3 dataframe, and column adm_pcode in the sle_adm3_pop dataframe. See page on Joining data)
    8. +
    9. select() is applied to the new dataframe, to keep only the useful columns - total is total population
    10. +
    11. Cases per 10,000 populaton is calculated as a new column with mutate()
    # Add population data and calculate cases per 10K population
    @@ -1443,15 +1438,15 @@ 

    + 2 SL040208 West III 213 210252 10.1 + 3 SL040207 West II 184 145109 12.7 + 4 SL040204 East II 122 99821 12.2 + 5 SL040203 East I 60 68284 8.79 + 6 SL040201 Central I 51 69683 7.32 + 7 SL040206 West I 48 60186 7.98 + 8 SL040202 Central II 18 23874 7.54 + 9 SL040205 East III 17 500134 0.34 +10 <NA> <NA> 5 NA NA

    Join this table with the ADM3 polygons shapefile for mapping.

    @@ -1553,14 +1548,14 @@

    +2 Central I 51 (((-13.22646 8.489716, -13.22648 8.48955, -13.22644 8.48… +3 East I 60 (((-13.2129 8.494033, -13.21076 8.494026, -13.21013 8.49… +4 East II 122 (((-13.22653 8.491883, -13.22647 8.491853, -13.22642 8.4… +5 Central II 18 (((-13.23154 8.491768, -13.23141 8.491566, -13.23144 8.4… +6 West III 213 (((-13.28529 8.497354, -13.28456 8.496497, -13.28403 8.4… +7 West I 48 (((-13.24677 8.493453, -13.24669 8.493285, -13.2464 8.49… +8 West II 184 (((-13.25698 8.485518, -13.25685 8.485501, -13.25668 8.4… +9 East III 17 (((-13.20465 8.485758, -13.20461 8.485698, -13.20449 8.4…

    To make a column chart of case counts by region, using ggplot2, we could then call geom_col() as follows:

    @@ -1621,123 +1616,108 @@

    OpenStreetMap

    Below we describe how to achieve a basemap for a ggplot2 map using OpenStreetMap features. Alternative methods include using ggmap which requires free registration with Google (details).

    OpenStreetMap is a collaborative project to create a free editable map of the world. The underlying geolocation data (e.g. locations of cities, roads, natural features, airports, schools, hospitals, roads etc) are considered the primary output of the project.

    -

    First we load the OpenStreetMap package, from which we will get our basemap.

    -

    Then, we create the object map, which we define using the function openmap() from OpenStreetMap package (documentation). We provide the following:

    -
      -
    • upperLeft and lowerRight Two coordinate pairs specifying the limits of the basemap tile: -
        -
      • In this case we’ve put in the max and min from the linelist rows, so the map will respond dynamically to the data.
        -
      • -
    • -
    • zoom = (if null it is determined automatically).
      -
    • -
    • type = which type of basemap - we have listed several possibilities here and the code is currently using the first one ([1]) “osm”.
      -
    • -
    • mergeTiles = we chose TRUE so the basetiles are all merged into one.
    • -
    +

    To do this, first we load in the packages we’ll need. These are the maptiles package, which we will use to get the OpenStreetMap base layer, and the tidyterra package for plotting the maptiles object.

    +

    The function get_tiles() from the maptiles package can accept a variety of different inputs. These include shapefiles, as an sf object, bbox objects and SpatExtent objects. For a full list, type ?get_tiles into your console.

    +

    There are a number of different ways to customise the output of get_tiles(), including changing the map provider, and saving the output so it can be accessed offline (by specifying a folder location in the cachedir = argument).

    +

    Here we are going to create a map using the coordinates of the area we are interested in. To provide these in a format that can be used by get_tiles() we will wrap the coordinates with the function ext() from the terra package, which should be loaded with tidyterra. Then we will use the function geom_spatraster_rgb() to display the map.

    +

    Note: If you right click on Google Maps it will display the coordinates of the point.

    # load package
    -pacman::p_load(OpenStreetMap)
    -
    -# Fit basemap by range of lat/long coordinates. Choose tile type
    -map <- OpenStreetMap::openmap(
    -  upperLeft = c(max(linelist$lat, na.rm=T), max(linelist$lon, na.rm=T)),   # limits of basemap tile
    -  lowerRight = c(min(linelist$lat, na.rm=T), min(linelist$lon, na.rm=T)),
    -  zoom = NULL,
    -  type = c("osm", "stamen-toner", "stamen-terrain", "stamen-watercolor", "esri","esri-topo")[1])
    -
    -

    If we plot this basemap right now, using autoplot.OpenStreetMap() from OpenStreetMap package, you see that the units on the axes are not latitude/longitude coordinates. It is using a different coordinate system. To correctly display the case residences (which are stored in lat/long), this must be changed.

    -
    -
    autoplot.OpenStreetMap(map)
    +pacman::p_load( + maptiles, + tidyterra +) + +#The coordinate extent of the area we are looking at, by taking the range of longditude and latitudes +#Values correspond as xmin, xmax, ymin, ymax +coordinates <- c(min(linelist$lon), + max(linelist$lon), + min(linelist$lat), + max(linelist$lat)) + +#Get the basemap +basemap <- get_tiles(terra::ext(coordinates), + crop = T, project = T) + +# Plot the tile +ggplot() + + geom_spatraster_rgb( + data = basemap + )
    -

    +

    -

    Thus, we want to convert the map to latitude/longitude with the openproj() function from OpenStreetMap package. We provide the basemap map and also provide the Coordinate Reference System (CRS) we want. We do this by providing the “proj.4” character string for the WGS 1984 projection, but you can provide the CRS in other ways as well. (see this page to better understand what a proj.4 string is).

    -
    -
    # Projection WGS84
    -map_latlon <- openproj(map, projection = "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs")
    -
    -

    Now when we create the plot we see that along the axes are latitude and longitude coordinate. The coordinate system has been converted. Now our cases will plot correctly if overlaid!

    -
    -
    # Plot map. Must use "autoplot" in order to work with ggplot
    -autoplot.OpenStreetMap(map_latlon)
    -
    -
    -
    -

    -
    -
    -
    -
    -

    See the tutorials here for more info.

    28.10 Contoured density heatmaps

    Below we describe how to achieve a contoured density heatmap of cases, over a basemap, beginning with a linelist (one row per case).

      -
    1. Create basemap tile from OpenStreetMap, as described above.
      -
    2. -
    3. Plot the cases from linelist using the latitude and longitude columns.
      -
    4. -
    5. Convert the points to a density heatmap with stat_density_2d() from ggplot2.
    6. +
    7. Create basemap tile, as described above
    8. +
    9. Plot the cases from linelist using the latitude and longitude columns
    10. +
    11. Convert the points to a density heatmap with stat_density_2d() from ggplot2

    When we have a basemap with lat/long coordinates, we can plot our cases on top using the lat/long coordinates of their residence.

    Building on the function autoplot.OpenStreetMap() to create the basemap, ggplot2 functions will easily add on top, as shown with geom_point() below:

    -
    # Plot map. Must be autoplotted to work with ggplot
    -autoplot.OpenStreetMap(map_latlon) +                 # begin with the basemap
    -  geom_point(                                       # add xy points from linelist lon and lat columns 
    -    data = linelist,                                
    -    aes(x = lon, y = lat),
    -    size = 1, 
    -    alpha = 0.5,
    -    show.legend = FALSE) +                          # drop legend entirely
    -  labs(x = "Longitude",                             # titles & labels
    -       y = "Latitude",
    -       title = "Cumulative cases")
    +
    # Plot map. Must be autoplotted to work with ggplot
    +ggplot() +
    +     geom_spatraster_rgb(
    +          data = basemap
    +     ) + 
    +     geom_point(                                       # add xy points from linelist lon and lat columns 
    +    data = linelist,                                
    +    aes(x = lon, y = lat),
    +    size = 1, 
    +    alpha = 0.5,
    +    show.legend = FALSE) +                          # drop legend entirely
    +  labs(x = "Longitude",                             # titles & labels
    +       y = "Latitude",
    +       title = "Cumulative cases")
    -

    +

    The map above might be difficult to interpret, especially with the points overlapping. So you can instead plot a 2d density map using the ggplot2 function stat_density_2d(). You are still using the linelist lat/lon coordinates, but a 2D kernel density estimation is performed and the results are displayed with contour lines - like a topographical map. Read the full documentation here.

    -
    # begin with the basemap
    -autoplot.OpenStreetMap(map_latlon) +
    -  
    -  # add the density plot
    -  ggplot2::stat_density_2d(
    -        data = linelist,
    -        aes(
    -          x = lon,
    -          y = lat,
    -          fill = ..level..,
    -          alpha = ..level..),
    -        bins = 10,
    -        geom = "polygon",
    -        contour_var = "count",
    -        show.legend = F) +                          
    -  
    -  # specify color scale
    -  scale_fill_gradient(low = "black", high = "red") +
    -  
    -  # labels 
    -  labs(x = "Longitude",
    -       y = "Latitude",
    -       title = "Distribution of cumulative cases")
    +
    # begin with the basemap
    +ggplot() +
    +     geom_spatraster_rgb(
    +          data = basemap
    +     ) +
    +  # add the density plot
    +  ggplot2::stat_density_2d(
    +        data = linelist,
    +        aes(
    +          x = lon,
    +          y = lat,
    +          fill = ..level..,
    +          alpha = ..level..),
    +        bins = 10,
    +        geom = "polygon",
    +        contour_var = "count",
    +        show.legend = F) +                          
    +  
    +  # specify color scale
    +  scale_fill_gradient(low = "black", high = "red") +
    +  
    +  # labels 
    +  labs(x = "Longitude",
    +       y = "Latitude",
    +       title = "Distribution of cumulative cases")
    -

    +

    @@ -1748,55 +1728,54 @@

    Time series

    The density heatmap above shows cumulative cases. We can examine the outbreak over time and space by faceting the heatmap based on the month of symptom onset, as derived from the linelist.

    We begin in the linelist, creating a new column with the Year and Month of onset. The format() function from base R changes how a date is displayed. In this case we want “YYYY-MM”.

    -
    # Extract month of onset
    -linelist <- linelist %>% 
    -  mutate(date_onset_ym = format(date_onset, "%Y-%m"))
    -
    -# Examine the values 
    -table(linelist$date_onset_ym, useNA = "always")
    +
    # Extract month of onset
    +linelist <- linelist %>% 
    +  mutate(date_onset_ym = format(date_onset, "%Y-%m"))
    +
    +# Examine the values 
    +table(linelist$date_onset_ym, useNA = "always")
    
    -2014-05 2014-06 2014-07 2014-08 2014-09 2014-10 2014-11 2014-12 2015-01 2015-02 
    -     15      25      37      83     186     193     117     112      67      52 
    -2015-03 2015-04    <NA> 
    -     42      29      42 
    +2014-04 2014-05 2014-06 2014-07 2014-08 2014-09 2014-10 2014-11 2014-12 2015-01 + 2 14 14 43 72 184 176 148 94 76 +2015-02 2015-03 2015-04 <NA> + 48 50 35 44

    Now, we simply introduce facetting via ggplot2 to the density heatmap. facet_wrap() is applied, using the new column as rows. We set the number of facet columns to 3 for clarity.

    -
    # packages
    -pacman::p_load(OpenStreetMap, tidyverse)
    -
    -# begin with the basemap
    -autoplot.OpenStreetMap(map_latlon) +
    -  
    -  # add the density plot
    -  ggplot2::stat_density_2d(
    -        data = linelist,
    -        aes(
    -          x = lon,
    -          y = lat,
    -          fill = ..level..,
    -          alpha = ..level..),
    -        bins = 10,
    -        geom = "polygon",
    -        contour_var = "count",
    -        show.legend = F) +                          
    -  
    -  # specify color scale
    -  scale_fill_gradient(low = "black", high = "red") +
    -  
    -  # labels 
    -  labs(x = "Longitude",
    -       y = "Latitude",
    -       title = "Distribution of cumulative cases over time") +
    -  
    -  # facet the plot by month-year of onset
    -  facet_wrap(~ date_onset_ym, ncol = 4)               
    +
    # begin with the basemap
    +ggplot() +
    +     geom_spatraster_rgb(
    +          data = basemap
    +     ) +  
    +  # add the density plot
    +  ggplot2::stat_density_2d(
    +        data = linelist,
    +        aes(
    +          x = lon,
    +          y = lat,
    +          fill = ..level..,
    +          alpha = ..level..),
    +        bins = 10,
    +        geom = "polygon",
    +        contour_var = "count",
    +        show.legend = F) +                          
    +  
    +  # specify color scale
    +  scale_fill_gradient(low = "black", high = "red") +
    +  
    +  # labels 
    +  labs(x = "Longitude",
    +       y = "Latitude",
    +       title = "Distribution of cumulative cases over time") +
    +  
    +  # facet the plot by month-year of onset
    +  facet_wrap(~ date_onset_ym, ncol = 4)               
    -

    +

    @@ -1812,11 +1791,11 @@

    Spatial r

    Before we can calculate any spatial statistics, we need to specify the relationships between features in our data. There are many ways to conceptualize spatial relationships, but a simple and commonly-applicable model to use is that of adjacency - specifically, that we expect a geographic relationship between areas that share a border or “neighbour” one another.

    We can quantify adjacency relationships between administrative region polygons in the sle_adm3 data we have been using with the spdep package. We will specify queen contiguity, which means that regions will be neighbors if they share at least one point along their borders. The alternative would be rook contiguity, which requires that regions share an edge - in our case, with irregular polygons, the distinction is trivial, but in some cases the choice between queen and rook can be influential.

    -
    sle_nb <- spdep::poly2nb(sle_adm3_dat, queen = T) # create neighbors 
    -sle_adjmat <- spdep::nb2mat(sle_nb)    # create matrix summarizing neighbor relationships
    -sle_listw <- spdep::nb2listw(sle_nb)   # create listw (list of weights) object -- we will need this later
    -
    -sle_nb
    +
    sle_nb <- spdep::poly2nb(sle_adm3_dat, queen = T) # create neighbors 
    +sle_adjmat <- spdep::nb2mat(sle_nb)    # create matrix summarizing neighbor relationships
    +sle_listw <- spdep::nb2listw(sle_nb)   # create listw (list of weights) object -- we will need this later
    +
    +sle_nb
    Neighbour list object:
     Number of regions: 9 
    @@ -1824,7 +1803,7 @@ 

    Spatial r Percentage nonzero weights: 37.03704 Average number of links: 3.333333

    -
    round(sle_adjmat, digits = 2)
    +
    round(sle_adjmat, digits = 2)
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
     1 0.00 0.20 0.00 0.20 0.00  0.2 0.00 0.20 0.20
    @@ -1843,20 +1822,20 @@ 

    Spatial r

    The matrix printed above shows the relationships between the 9 regions in our sle_adm3 data. A score of 0 indicates two regions are not neighbors, while any value other than 0 indicates a neighbor relationship. The values in the matrix are scaled so that each region has a total row weight of 1.

    A better way to visualize these neighbor relationships is by plotting them:

    -
    plot(sle_adm3_dat$geometry) +                                           # plot region boundaries
    -  spdep::plot.nb(sle_nb,as(sle_adm3_dat, 'Spatial'), col = 'grey', add = T) # add neighbor relationships
    +
    plot(sle_adm3_dat$geometry) +                                           # plot region boundaries
    +  spdep::plot.nb(sle_nb,as(sle_adm3_dat, 'Spatial'), col = 'grey', add = T) # add neighbor relationships
    -

    +

    We have used an adjacency approach to identify neighboring polygons; the neighbors we identified are also sometimes called contiguity-based neighbors. But this is just one way of choosing which regions are expected to have a geographic relationship. The most common alternative approaches for identifying geographic relationships generate distance-based neighbors; briefly, these are:

      -
    • K-nearest neighbors - Based on the distance between centroids (the geographically-weighted center of each polygon region), select the n closest regions as neighbors. A maximum-distance proximity threshold may also be specified. In spdep, you can use knearneigh() (see documentation).

    • -
    • Distance threshold neighbors - Select all neighbors within a distance threshold. In spdep, these neighbor relationships can be identified using dnearneigh() (see documentation).

    • +
    • K-nearest neighbors - Based on the distance between centroids (the geographically-weighted center of each polygon region), select the n closest regions as neighbors. A maximum-distance proximity threshold may also be specified. In spdep, you can use knearneigh() (see documentation)

    • +
    • Distance threshold neighbors - Select all neighbors within a distance threshold. In spdep, these neighbor relationships can be identified using dnearneigh() (see documentation)

    @@ -1865,10 +1844,10 @@

    Spatial

    Moran’s I - This is a global summary statistic of the correlation between the value of a variable in one region, and the values of the same variable in neighboring regions. The Moran’s I statistic typically ranges from -1 to 1. A value of 0 indicates no pattern of spatial correlation, while values closer to 1 or -1 indicate stronger spatial autocorrelation (similar values close together) or spatial dispersion (dissimilar values close together), respectively.

    For an example, we will calculate a Moran’s I statistic to quantify the spatial autocorrelation in Ebola cases we mapped earlier (remember, this is a subset of cases from the simulated epidemic linelist dataframe). The spdep package has a function, moran.test, that can do this calculation for us:

    -
    moran_i <-spdep::moran.test(sle_adm3_dat$cases,    # numeric vector with variable of interest
    -                            listw = sle_listw)       # listw object summarizing neighbor relationships
    -
    -moran_i                                            # print results of Moran's I test
    +
    moran_i <-spdep::moran.test(sle_adm3_dat$cases,    # numeric vector with variable of interest
    +                            listw = sle_listw)       # listw object summarizing neighbor relationships
    +
    +moran_i                                            # print results of Moran's I test
    
         Moran I test under randomisation
    @@ -1876,38 +1855,38 @@ 

    Spatial data: sle_adm3_dat$cases weights: sle_listw -Moran I statistic standard deviate = 1.5687, p-value = 0.05836 +Moran I statistic standard deviate = 1.3956, p-value = 0.08142 alternative hypothesis: greater sample estimates: Moran I statistic Expectation Variance - 0.19807329 -0.12500000 0.04241628

    + 0.16638757 -0.12500000 0.04359253
    -

    The output from the moran.test() function shows us a Moran I statistic of 0.2. This indicates the presence of spatial autocorrelation in our data - specifically, that regions with similar numbers of Ebola cases are likely to be close together. The p-value provided by moran.test() is generated by comparison to the expectation under null hypothesis of no spatial autocorrelation, and can be used if you need to report the results of a formal hypothesis test.

    +

    The output from the moran.test() function shows us a Moran I statistic of 0.17. This indicates the presence of spatial autocorrelation in our data - specifically, that regions with similar numbers of Ebola cases are likely to be close together. The p-value provided by moran.test() is generated by comparison to the expectation under null hypothesis of no spatial autocorrelation, and can be used if you need to report the results of a formal hypothesis test.

    Local Moran’s I - We can decompose the (global) Moran’s I statistic calculated above to identify localized spatial autocorrelation; that is, to identify specific clusters in our data. This statistic, which is sometimes called a Local Indicator of Spatial Association (LISA) statistic, summarizes the extent of spatial autocorrelation around each individual region. It can be useful for finding “hot” and “cold” spots on the map.

    To show an example, we can calculate and map Local Moran’s I for the Ebola case counts used above, with the local_moran() function from spdep:

    -
    # calculate local Moran's I
    -local_moran <- spdep::localmoran(                  
    -  sle_adm3_dat$cases,                              # variable of interest
    -  listw = sle_listw                                  # listw object with neighbor weights
    -)
    -
    -# join results to sf data
    -sle_adm3_dat<- cbind(sle_adm3_dat, local_moran)    
    -
    -# plot map
    -ggplot(data = sle_adm3_dat) +
    -  geom_sf(aes(fill = Ii)) +
    -  theme_bw() +
    -  scale_fill_gradient2(low = "#2c7bb6", mid = "#ffffbf", high = "#d7191c",
    -                       name = "Local Moran's I") +
    -  labs(title = "Local Moran's I statistic for Ebola cases",
    -       subtitle = "Admin level 3 regions, Sierra Leone")
    +
    # calculate local Moran's I
    +local_moran <- spdep::localmoran(                  
    +  sle_adm3_dat$cases,                              # variable of interest
    +  listw = sle_listw                                  # listw object with neighbor weights
    +)
    +
    +# join results to sf data
    +sle_adm3_dat<- cbind(sle_adm3_dat, local_moran)    
    +
    +# plot map
    +ggplot(data = sle_adm3_dat) +
    +  geom_sf(aes(fill = Ii)) +
    +  theme_bw() +
    +  scale_fill_gradient2(low = "#2c7bb6", mid = "#ffffbf", high = "#d7191c",
    +                       name = "Local Moran's I") +
    +  labs(title = "Local Moran's I statistic for Ebola cases",
    +       subtitle = "Admin level 3 regions, Sierra Leone")
    -

    +

    @@ -1915,27 +1894,27 @@

    Spatial

    **Getis-Ord Gi** - This is another statistic that is commonly used for hotspot analysis; in large part, the popularity of this statistic relates to its use in the Hot Spot Analysis tool in ArcGIS. It is based on the assumption that typically, the difference in a variable’s value between neighboring regions should follow a normal distribution. It uses a z-score approach to identify regions that have significantly higher (hot spot) or significantly lower (cold spot) values of a specified variable, compared to their neighbors.

    We can calculate and map the Gi* statistic using the localG() function from spdep:

    -
    # Perform local G analysis
    -getis_ord <- spdep::localG(
    -  sle_adm3_dat$cases,
    -  sle_listw
    -)
    -
    -# join results to sf data
    -sle_adm3_dat$getis_ord <- as.numeric(getis_ord)
    -
    -# plot map
    -ggplot(data=sle_adm3_dat) +
    -  geom_sf(aes(fill = getis_ord)) +
    -  theme_bw() +
    -  scale_fill_gradient2(low="#2c7bb6", mid = "#ffffbf", high = "#d7191c",
    -                       name = "Gi*") +
    -  labs(title = "Getis-Ord Gi* statistic for Ebola cases",
    -       subtitle = "Admin level 3 regions, Sierra Leone")
    +
    # Perform local G analysis
    +getis_ord <- spdep::localG(
    +  sle_adm3_dat$cases,
    +  sle_listw
    +)
    +
    +# join results to sf data
    +sle_adm3_dat$getis_ord <- as.numeric(getis_ord)
    +
    +# plot map
    +ggplot(data=sle_adm3_dat) +
    +  geom_sf(aes(fill = getis_ord)) +
    +  theme_bw() +
    +  scale_fill_gradient2(low="#2c7bb6", mid = "#ffffbf", high = "#d7191c",
    +                       name = "Gi*") +
    +  labs(title = "Getis-Ord Gi* statistic for Ebola cases",
    +       subtitle = "Admin level 3 regions, Sierra Leone")
    -

    +

    @@ -1944,36 +1923,36 @@

    Spatial

    Lee’s L test - This is a statistical test for bivariate spatial correlation. It allows you to test whether the spatial pattern for a given variable x is similar to the spatial pattern of another variable, y, that is hypothesized to be related spatially to x.

    To give an example, let’s test whether the spatial pattern of Ebola cases from the simulated epidemic is correlated with the spatial pattern of population. To start, we need to have a population variable in our sle_adm3 data. We can use the total variable from the sle_adm3_pop dataframe that we loaded earlier.

    -
    sle_adm3_dat <- sle_adm3_dat %>% 
    -  rename(population = total)                          # rename 'total' to 'population'
    +
    sle_adm3_dat <- sle_adm3_dat %>% 
    +  rename(population = total)                          # rename 'total' to 'population'

    We can quickly visualize the spatial patterns of the two variables side by side, to see whether they look similar:

    -
    tmap_mode("plot")
    -
    -cases_map <- tm_shape(sle_adm3_dat) + 
    -     tm_polygons("cases") + tm_layout(main.title = "Cases")
    -pop_map <- tm_shape(sle_adm3_dat) + tm_polygons("population") + 
    -     tm_layout(main.title = "Population")
    -
    -tmap_arrange(cases_map, pop_map, ncol = 2)   # arrange into 2x1 facets
    +
    tmap_mode("plot")
    +
    +cases_map <- tm_shape(sle_adm3_dat) + 
    +     tm_polygons("cases") + tm_layout(main.title = "Cases")
    +pop_map <- tm_shape(sle_adm3_dat) + tm_polygons("population") + 
    +     tm_layout(main.title = "Population")
    +
    +tmap_arrange(cases_map, pop_map, ncol = 2)   # arrange into 2x1 facets
    -

    +

    Visually, the patterns seem dissimilar. We can use the lee.test() function in spdep to test statistically whether the pattern of spatial autocorrelation in the two variables is related. The L statistic will be close to 0 if there is no correlation between the patterns, close to 1 if there is a strong positive correlation (i.e. the patterns are similar), and close to -1 if there is a strong negative correlation (i.e. the patterns are inverse).

    -
    lee_test <- spdep::lee.test(
    -  x = sle_adm3_dat$cases,          # variable 1 to compare
    -  y = sle_adm3_dat$population,     # variable 2 to compare
    -  listw = sle_listw                # listw object with neighbor weights
    -)
    -
    -lee_test
    +
    lee_test <- spdep::lee.test(
    +  x = sle_adm3_dat$cases,          # variable 1 to compare
    +  y = sle_adm3_dat$population,     # variable 2 to compare
    +  listw = sle_listw                # listw object with neighbor weights
    +)
    +
    +lee_test
    
         Lee's L statistic randomisation
    @@ -1981,14 +1960,14 @@ 

    Spatial data: sle_adm3_dat$cases , sle_adm3_dat$population weights: sle_listw -Lee's L statistic standard deviate = -0.84013, p-value = 0.7996 +Lee's L statistic standard deviate = -0.90934, p-value = 0.8184 alternative hypothesis: greater sample estimates: Lee's L statistic Expectation Variance - -0.14182377 -0.04850528 0.01233807

    + -0.14925304 -0.04823804 0.01234005
    -

    The output above shows that the Lee’s L statistic for our two variables was -0.14, which indicates weak negative correlation. This confirms our visual assessment that the pattern of cases and population are not related to one another, and provides evidence that the spatial pattern of cases is not strictly a result of population density in high-risk areas.

    +

    The output above shows that the Lee’s L statistic for our two variables was -0.15, which indicates weak negative correlation. This confirms our visual assessment that the pattern of cases and population are not related to one another, and provides evidence that the spatial pattern of cases is not strictly a result of population density in high-risk areas.

    The Lee L statistic can be useful for making these kinds of inferences about the relationship between spatially distributed variables; however, to describe the nature of the relationship between two variables in more detail, or adjust for confounding, spatial regression techniques will be needed. These are described briefly in the following section.

    @@ -2006,7 +1985,7 @@

    Spatial regr
    -

    +

    @@ -2626,7 +2605,7 @@

    - +
    + @@ -1157,7 +1157,7 @@

    Add counts

    Add totals

    -

    To easily add total sum rows or columns after using tally() or count(), see the janitor section of the Descriptive tables page. This package offers functions like adorn_totals() and adorn_percentages() to add totals and convert to show percentages. Below is a brief example:

    +

    To easily add total sum rows or columns after using tally() or count(), see the janitor section of the Descriptive tables page. This package offers functions like adorn_totals() and adorn_percentages() to add totals and convert to show percentages. Below is a brief example:

    linelist %>%                                  # case linelist
       tabyl(age_cat, gender) %>%                  # cross-tabulate counts of two columns
    @@ -1183,7 +1183,7 @@ 

    Add totals

    Total 2,807 (100.0%) 2,803 (100.0%) 278 (100.0%)
    -

    To add more complex totals rows that involve summary statistics other than sums, see this section of the Descriptive Tables page.

    +

    To add more complex totals rows that involve summary statistics other than sums, see this section of the Descriptive Tables page.

    @@ -1201,8 +1201,8 @@

    Lineli
    -
    - +
    +

    Below we add the complete() command to ensure every day in the range is represented.

    @@ -1219,8 +1219,8 @@

    Lineli
    -
    - +
    +

    @@ -1243,8 +1243,8 @@

    Linel

    Here are the first 50 rows of the resulting data frame:

    -
    - +
    +
    @@ -1266,8 +1266,8 @@

    Line
    -
    - +
    +
    @@ -2028,7 +2028,7 @@

    - +
    +
    @@ -862,8 +860,8 @@

    Make case
    -
    - +
    +

    @@ -877,8 +875,8 @@

    Make
    -
    - +
    +

    Next, we use joins to procure the ages of the infectors. This is not simple, because in the linelist, the infector’s ages are not listed as such. We achieve this result by joining the case linelist to the infectors. We begin with the infectors, and left_join() (add) the case linelist such that the infector id column left-side “baseline” data frame joins to the case_id column in the right-side linelist data frame.

    @@ -893,8 +891,8 @@

    Make
    -
    - +
    +

    Then, we combine the cases and their ages with the infectors and their ages. Each of these data frame has the column infector, so it is used for the join. The first rows are displayed below:

    @@ -914,8 +912,8 @@

    Make
    -
    - +
    +

    Below, a simple cross-tabulation of counts between the case and infector age groups. Labels added for clarity.

    @@ -943,8 +941,8 @@

    Make
    -
    - +
    +

    Now we do the same, but apply prop.table() from base R to the table so instead of counts we get proportions of the total. The first 50 rows are shown below.

    @@ -955,8 +953,8 @@

    Make
    -
    - +
    +
    @@ -965,8 +963,7 @@

    Make

    Create heat plot

    Now finally we can create the heat plot with ggplot2 package, using the geom_tile() function. See the ggplot tips page to learn more extensively about color/fill scales, especially the scale_fill_gradient() function.

      -
    • In the aesthetics aes() of geom_tile() set the x and y as the case age and infector age.
      -
    • +
    • In the aesthetics aes() of geom_tile() set the x and y as the case age and infector age.
    • Also in aes() set the argument fill = to the Freq column - this is the value that will be converted to a tile color.
    • Set a scale color with scale_fill_gradient() - you can specify the high/low colors. @@ -1015,8 +1012,8 @@

      Data prepara
      -
      - +
      +
      @@ -1035,15 +1032,11 @@

      Aggrega

    • The function summarise() creates new columns to reflecting summary statistics per facility-week group:
        -
      • Number of days per week (7 - a static value).
        -
      • -
      • Number of reports received from the facility-week (could be more than 7!).
      • -
      • Sum of malaria cases reported by the facility-week (just for interest).
        -
      • -
      • Number of unique days in the facility-week for which there is data reported.
        -
      • -
      • Percent of the 7 days per facility-week for which data was reported.
        -
      • +
      • Number of days per week (7 - a static value)
      • +
      • Number of reports received from the facility-week (could be more than 7!)
      • +
      • Sum of malaria cases reported by the facility-week (just for interest)
      • +
      • Number of unique days in the facility-week for which there is data reported
      • +
      • Percent of the 7 days per facility-week for which data was reported
    • The data frame is joined with right_join() to a comprehensive list of all possible facility-week combinations, to make the dataset complete. The matrix of all possible combinations is created by applying expand() to those two columns of the data frame as it is at that moment in the pipe chain (represented by .). Because a right_join() is used, all rows in the expand() data frame are kept, and added to agg_weeks if necessary. These new rows appear with NA (missing) summarized values.
    • @@ -1070,8 +1063,8 @@

      Aggrega

      The new week column can be seen on the far right of the data frame

      -
      - +
      +

      Now we group the data into facility-weeks and summarise them to produce statistics per facility-week. See the page on Descriptive tables for tips. The grouping itself doesn’t change the data frame, but it impacts how the subsequent summary statistics are calculated.

      @@ -1094,8 +1087,8 @@

      Aggrega
      -
      - +
      +

      Finally, we run the command below to ensure that ALL possible facility-weeks are present in the data, even if they were missing before.

      @@ -1108,8 +1101,8 @@

      Aggrega

      Here is expanded_weeks, with 180 rows:

      -
      - +
      +

      Before running this code, agg_weeks contains 107 rows.

      @@ -1130,17 +1123,12 @@

      Aggrega

      Create heat plot

      The ggplot() is made using geom_tile() from the ggplot2 package:

        -
      • Weeks on the x-axis is transformed to dates, allowing use of scale_x_date().
        -
      • -
      • location_name on the y-axis will show all facility names.
        -
      • -
      • The fill is p_days_reported, the performance for that facility-week (numeric).
        -
      • -
      • scale_fill_gradient() is used on the numeric fill, specifying colors for high, low, and NA.
        -
      • -
      • scale_x_date() is used on the x-axis specifying labels every 2 weeks and their format.
        -
      • -
      • Display themes and labels can be adjusted as necessary.
      • +
      • Weeks on the x-axis is transformed to dates, allowing use of scale_x_date()
      • +
      • location_name on the y-axis will show all facility names
      • +
      • The fill is p_days_reported, the performance for that facility-week (numeric)
      • +
      • scale_fill_gradient() is used on the numeric fill, specifying colors for high, low, and NA
      • +
      • scale_x_date() is used on the x-axis specifying labels every 2 weeks and their format
      • +
      • Display themes and labels can be adjusted as necessary
      @@ -1235,8 +1223,8 @@

      Ordered y-axisSee the data frame below:

      -
      - +
      +

      Now use a column from the above data frame (facility_order$location_name) to be the order of the factor levels of location_name in the data frame agg_weeks:

      @@ -1245,10 +1233,16 @@

      Ordered y-axispacman::p_load(forcats) # create factor and define levels manually -agg_weeks <- agg_weeks %>% - mutate(location_name = fct_relevel( - location_name, facility_order$location_name) - ) +numerical_order <- gsub("Facility ", "", facility_order$location_name) %>% + as.numeric() %>% + sort() + +facilities_in_order <- str_c("Facility ", numerical_order, sep = "") + +agg_weeks <- agg_weeks %>% + mutate(location_name = fct_relevel( + location_name, facilities_in_order) + )

      And now the data are re-plotted, with location_name being an ordered factor:

      @@ -1975,7 +1969,7 @@

      var lightboxQuarto = GLightbox({"loop":false,"descPosition":"bottom","selector":".lightbox","openEffect":"zoom","closeEffect":"zoom"}); (function() { let previousOnload = window.onload; window.onload = () => { diff --git a/html_outputs/new_pages/heatmaps_files/figure-html/unnamed-chunk-36-1.png b/html_outputs/new_pages/heatmaps_files/figure-html/unnamed-chunk-36-1.png index 168ca59f..8a4364a4 100644 Binary files a/html_outputs/new_pages/heatmaps_files/figure-html/unnamed-chunk-36-1.png and b/html_outputs/new_pages/heatmaps_files/figure-html/unnamed-chunk-36-1.png differ diff --git a/html_outputs/new_pages/heatmaps_files/figure-html/unnamed-chunk-37-1.png b/html_outputs/new_pages/heatmaps_files/figure-html/unnamed-chunk-37-1.png index 95c42cdf..af72c647 100644 Binary files a/html_outputs/new_pages/heatmaps_files/figure-html/unnamed-chunk-37-1.png and b/html_outputs/new_pages/heatmaps_files/figure-html/unnamed-chunk-37-1.png differ diff --git a/html_outputs/new_pages/help.html b/html_outputs/new_pages/help.html index f5531305..dd86c56c 100644 --- a/html_outputs/new_pages/help.html +++ b/html_outputs/new_pages/help.html @@ -7,7 +7,7 @@ -48  Getting help – The Epidemiologist R Handbook +49  Getting help – The Epidemiologist R Handbook @@ -1883,8 +1883,8 @@

      gtsummary

      -
      -

      19.2.1 Cross-tabulation

      +
      +

      19.1.1 Cross-tabulation

      The gtsummary package also allows us to quickly and easily create tables of counts. This can be useful for quickly summarising the data, and putting it in context with the regression we have carried out.

      #Carry out our regression
      @@ -1907,23 +1907,23 @@ 

      univ_tab), tab_spanner = c("Summary", "Univariate regression"))

      -
      - @@ -2573,8 +2573,8 @@

      -
      -

      19.3 Stratified

      +
      +

      19.2 Stratified

      Here we define stratified regression as the process of carrying out separate regression analyses on different “groups” of data.

      Sometimes in your analysis, you will want to investigate whether or not there are different relationships between an outcome and variables, by different strata. This could be something like, a difference in gender, age group, or source of infection.

      To do this, you will want to split your dataset into the strata of interest. For example, creating two separate datasets of gender == "f" and gender == "m", would be done by:

      @@ -2588,8 +2588,8 @@

      dplyr::select(explanatory_vars, outcome) ## select variables of interest

      Once this has been done, you can carry out your regression in either base R or gtsummary.

      -
      -

      19.3.1 base R

      +
      +

      19.2.1 base R

      To carry this out in base R, you run two different regressions, one for where gender == "f" and gender == "m".

      #Run model for f
      @@ -2618,8 +2618,8 @@ 

      -
      -

      19.3.2 gtsummary

      +
      +

      19.2.2 gtsummary

      The same approach is repeated using gtsummary, however it is easier to produce publication ready tables with gtsummary and compare the two tables with the function tbl_merge().

      #Run model for f
      @@ -2653,23 +2653,23 @@ 

      #Print f_and_m_table

      -
      - @@ -3177,8 +3177,8 @@

      -
      -

      19.4 Multivariable

      +
      +

      19.3 Multivariable

      For multivariable analysis, we again present two approaches:

      • glm() and tidy().
        @@ -3278,8 +3278,8 @@

        Building the

        Here is what the resulting data frame looks like:

        -
        - +
        +
        @@ -3298,23 +3298,23 @@

        Combine
        mv_tab
        -
        - @@ -3883,23 +3883,23 @@

        Combine tbls = list(univ_tab, mv_tab), # combine tab_spanner = c("**Univariate**", "**Multivariable**")) # set header names

        -
        - @@ -4618,8 +4618,8 @@

        Combine with

      -
      -

      19.5 Forest plot

      +
      +

      19.4 Forest plot

      This section shows how to produce a plot with the outputs of your regression. There are two options, you can build a plot yourself using ggplot2 or use a meta-package called easystats (a package that includes many packages).

      See the page on ggplot basics if you are unfamiliar with the ggplot2 plotting package.

      @@ -4690,8 +4690,8 @@

      easy

      -
      -

      19.6 Model performance

      +
      +

      19.5 Model performance

      Once you have built your regression models, you may want to assess how well the model has fit the data. There are many different approaches to do this, and many different metrics with which to assess your model fit, and how it compares with other model formulations. How you assess your model fit will depend on your model, the data, and the context in which you are conducting your work.

      While there are many different functions, and many different packages, to assess model fit, one package that nicely combines several different metrics and approaches into a single source is the performance package. This package allows you to assess model assumptions (such as linearity, homogeneity, highlight outliers, etc.) and check how well the model performs (Akaike Information Criterion values, R2, RMSE, etc) with a few simple functions.

      Unfortunately, we are unable to use this package with gtsummary, but it readily accepts objects generated by other packages such as stats, lmerMod and tidymodels. Here we will demonstrate its application using the function glm() for a multivariable regression. To do this we can use the function performance() to assess model fit, and compare_perfomrance() to compare the two models.

      @@ -4746,8 +4746,8 @@

      For further reading on the performance package, and the model tests you can carry out, see their github.

      -
      -

      19.7 Resources

      +
      +

      19.6 Resources

      The content of this page was informed by these resources and vignettes online:

      Linear regression in R

      gtsummary

      diff --git a/html_outputs/new_pages/rmarkdown.html b/html_outputs/new_pages/rmarkdown.html index 9668dc42..b345a20f 100644 --- a/html_outputs/new_pages/rmarkdown.html +++ b/html_outputs/new_pages/rmarkdown.html @@ -289,7 +289,7 @@ The Epidemiologist R Handbook

  • 40.4 File structure
      @@ -843,9 +844,8 @@

      Installation

      To create a R Markdown output, you need to have the following installed:

        -
      • The rmarkdown package (knitr will also be installed automatically).
        -
      • -
      • Pandoc, which should come installed with RStudio. If you are not using RStudio, you can download Pandoc here: http://pandoc.org.
      • +
      • The rmarkdown package (knitr will also be installed automatically)
      • +
      • Pandoc, which should come installed with RStudio. If you are not using RStudio, you can download Pandoc here.
      • If you want to generate PDF output (a bit trickier), you will need to install LaTeX. For R Markdown users who have not installed LaTeX before, we recommend that you install TinyTeX. You can use the following commands:
      @@ -929,7 +929,7 @@

      YAML metadata

      The YAML should begin with metadata for the document. The order of these primary YAML parameters (not indented) does not matter. For example:

      title: "My document"
       author: "Me"
      -date: "2024-10-01"
      +date: "2024-10-18"

      You can use R code in YAML values by writing it as in-line code (preceded by r within back-ticks) but also within quotes (see above example for date:).

      In the image above, because we clicked that our default output would be an html file, we can see that the YAML says output: html_document. However we can also change this to say powerpoint_presentation or word_document or even pdf_document.

      @@ -945,9 +945,9 @@

      New lines

      Case

      Surround your normal text with these character to change how it appears in the output.

        -
      • Underscores (_text_) or single asterisk (*text*) to italicise.
      • -
      • Double asterisks (**text**) for bold text.
      • -
      • Back-ticks (text) to display text as code.
      • +
      • Underscores (_text_) or single asterisk (*text*) to italicise
      • +
      • Double asterisks (**text**) for bold text
      • +
      • Back-ticks (text) to display text as code

      The actual appearance of the font can be set by using specific templates (specified in the YAML metadata; see example tabs).

      @@ -999,28 +999,21 @@

      Code chunks

      You can create a new chunk by typing it out yourself, by using the keyboard shortcut “Ctrl + Alt + i” (or Cmd + Shift + r in Mac), or by clicking the green ‘insert a new code chunk’ icon at the top of your script editor.

      Some notes about the contents of the curly brackets { }:

        -
      • They start with ‘r’ to indicate that the language name within the chunk is R.
      • -
      • After the r you can optionally write a chunk “name” – these are not necessary but can help you organise your work. Note that if you name your chunks, you should ALWAYS use unique names or else R will complain when you try to render.
        +
      • They start with ‘r’ to indicate that the language name within the chunk is R
      • +
      • After the r you can optionally write a chunk “name” – these are not necessary but can help you organise your work. Note that if you name your chunks, you should ALWAYS use unique names or else R will complain when you try to render
      • The curly brackets can include other options too, written as tag=value, such as:
      • -
      • eval = FALSE to not run the R code.
        -
      • -
      • echo = FALSE to not print the chunk’s R source code in the output document.
        -
      • -
      • warning = FALSE to not print warnings produced by the R code.
        -
      • -
      • message = FALSE to not print any messages produced by the R code.
        -
      • -
      • include = either TRUE/FALSE whether to include chunk outputs (e.g. plots) in the document.
      • -
      • out.width = and out.height = - provide in style out.width = "75%".
        -
      • -
      • fig.align = "center" adjust how a figure is aligned across the page.
        -
      • -
      • fig.show='hold' if your chunk prints multiple figures and you want them printed next to each other (pair with out.width = c("33%", "67%"). Can also set as fig.show='asis' to show them below the code that generates them, 'hide' to hide, or 'animate' to concatenate multiple into an animation.
        -
      • -
      • A chunk header must be written in one line.
      • -
      • Try to avoid periods, underscores, and spaces. Use hyphens ( - ) instead if you need a separator.
      • +
      • eval = FALSE to not run the R code
      • +
      • echo = FALSE to not print the chunk’s R source code in the output document
      • +
      • warning = FALSE to not print warnings produced by the R code
      • +
      • message = FALSE to not print any messages produced by the R code
      • +
      • include = either TRUE/FALSE whether to include chunk outputs (e.g. plots) in the document
      • +
      • out.width = and out.height = - provide in style out.width = "75%"
      • +
      • fig.align = "center" adjust how a figure is aligned across the page
      • +
      • fig.show='hold' if your chunk prints multiple figures and you want them printed next to each other (pair with out.width = c("33%", "67%"). Can also set as fig.show='asis' to show them below the code that generates them, 'hide' to hide, or 'animate' to concatenate multiple into an animation
      • +
      • A chunk header must be written in one line
      • +
      • Try to avoid periods, underscores, and spaces. Use hyphens ( - ) instead if you need a separator

      Read more extensively about the knitr options here.

      Some of the above options can be configured with point-and-click using the setting buttons at the top right of the chunk. Here, you can specify which parts of the chunk you want the rendered document to include, namely the code, the outputs, and the warnings. This will come out as written preferences within the curly brackets, e.g. echo=FALSE if you specify you want to ‘Show output only’.

      @@ -1129,6 +1122,16 @@

      Tabbed sections

      You can add an additional option .tabset-pills after .tabset to give the tabs themselves a “pilled” appearance. Be aware that when viewing the tabbed HTML output, the Ctrl+f search functionality will only search “active” tabs, not hidden tabs.

      + +
      +

      remedy

      +

      remedy is an addin for R-Studio which helps with writing R Markdown scripts. It provides a user interface and series of keyboard shortcuts to format your text.

      +

      This package is installed directly from GitHub.

      +
      +
      remotes::install_github("ThinkR-open/remedy")
      +
      +

      Once installed, the package does not need to be re-loaded. It will automatically load when you start RStudio.

      +

      For a full list of features and updates, please see https://thinkr-open.github.io/remedy/

      @@ -1151,25 +1154,20 @@

      Self-contain

      Everything you need to run the R markdown is imported or created within the Rmd file, including all the code chunks and package loading. This “self-contained” approach is appropriate when you do not need to do much data processing (e.g. it brings in a clean or semi-clean data file) and the rendering of the R Markdown will not take too long.

      In this scenario, one logical organization of the R Markdown script might be:

        -
      1. Set global knitr options.
        -
      2. -
      3. Load packages.
        -
      4. -
      5. Import data.
        -
      6. -
      7. Process data.
        -
      8. -
      9. Produce outputs (tables, plots, etc.).
        -
      10. -
      11. Save outputs, if applicable (.csv, .png, etc.).
      12. +
      13. Set global knitr options
      14. +
      15. Load packages
      16. +
      17. Import data
      18. +
      19. Process data
      20. +
      21. Produce outputs (tables, plots, etc.)
      22. +
      23. Save outputs, if applicable (.csv, .png, etc.)

      Source other files

      One variation of the “self-contained” approach is to have R Markdown code chunks “source” (run) other R scripts. This can make your R Markdown script less cluttered, more simple, and easier to organize. It can also help if you want to display final figures at the beginning of the report. In this approach, the final R Markdown script simply combines pre-processed outputs into a document.

      One way to do this is by providing the R scripts (file path and name with extension) to the base R command source().

      -
      source("your-script.R", local = knitr::knit_global())
      -# or sys.source("your-script.R", envir = knitr::knit_global())
      +
      source("your-script.R", local = knitr::knit_global())
      +# or sys.source("your-script.R", envir = knitr::knit_global())

      Note that when using source() within the R Markdown, the external files will still be run during the course of rendering your Rmd file. Therefore, each script is run every time you render the report. Thus, having these source() commands within the R Markdown does not speed up your run time, nor does it greatly assist with de-bugging, as error produced will still be printed when producing the R Markdown.

      An alternative is to utilize the child = knitr option.

      @@ -1221,16 +1219,15 @@

      Runfile

      For instance, you can load the packages, load and clean the data, and even create the graphs of interest prior to render(). These steps can occur in the R script, or in other scripts that are sourced. As long as these commands occur in the same RStudio session and objects are saved to the environment, the objects can then be called within the Rmd content. Then the R markdown itself will only be used for the final step - to produce the output with all the pre-processed objects. This is much easier to de-bug if something goes wrong.

      This approach is helpful for the following reasons:

        -
      • More informative error messages - these messages will be generated from the R script, not the R Markdown. R Markdown errors tend to tell you which chunk had a problem, but will not tell you which line.
        -
      • -
      • If applicable, you can run long processing steps in advance of the render() command - they will run only once.
      • +
      • More informative error messages - these messages will be generated from the R script, not the R Markdown. R Markdown errors tend to tell you which chunk had a problem, but will not tell you which line
      • +
      • If applicable, you can run long processing steps in advance of the render() command - they will run only once

      In the example below, we have a separate R script in which we pre-process a data object into the R Environment and then render the “create_output.Rmd” using render().

      -
      data <- import("datafile.csv") %>%       # Load data and save to environment
      -  select(age, hospital, weight)          # Select limited columns
      -
      -rmarkdown::render(input = "create_output.Rmd")   # Create Rmd file
      +
      data <- import("datafile.csv") %>%       # Load data and save to environment
      +  select(age, hospital, weight)          # Select limited columns
      +
      +rmarkdown::render(input = "create_output.Rmd")   # Create Rmd file
      @@ -1266,16 +1263,13 @@

      Option 1:

      Option 2: render() command

      Another way to produce your R Markdown output is to run the render() function (from the rmarkdown package). You must execute this command outside the R Markdown script - so either in a separate R script (often called a “run file”), or as a stand-alone command in the R Console.

      -
      rmarkdown::render(input = "my_report.Rmd")
      +
      rmarkdown::render(input = "my_report.Rmd")

      As with “knit”, the default settings will save the Rmd output to the same folder as the Rmd script, with the same file name (aside from the file extension). For instance “my_report.Rmd” when knitted will create “my_report.docx” if you are knitting to a word document. However, by using render() you have the option to use different settings. render() can accept arguments including:

        -
      • output_format = This is the output format to convert to (e.g. "html_document", "pdf_document", "word_document", or "all"). You can also specify this in the YAML inside the R Markdown script.
        -
      • -
      • output_file = This is the name of the output file (and file path). This can be created via R functions like here() or str_glue() as demonstrated below.
        -
      • -
      • output_dir = This is an output directory (folder) to save the file. This allows you to chose an alternative other than the directory the Rmd file is saved to.
        -
      • +
      • output_format = This is the output format to convert to (e.g. "html_document", "pdf_document", "word_document", or "all"). You can also specify this in the YAML inside the R Markdown script
      • +
      • output_file = This is the name of the output file (and file path). This can be created via R functions like here() or str_glue() as demonstrated below
      • +
      • output_dir = This is an output directory (folder) to save the file. This allows you to chose an alternative other than the directory the Rmd file is saved to
      • output_options = You can provide a list of options that will override those in the script YAML (e.g. )
      • output_yaml = You can provide path to a .yml file that contains YAML specifications
      • @@ -1285,9 +1279,9 @@

        Option

      As one example, to improve version control, the following command will save the output file within an ‘outputs’ sub-folder, with the current date in the file name. To create the file name, the function str_glue() from the stringr package is use to ‘glue’ together static strings (written plainly) with dynamic R code (written in curly brackets). For instance if it is April 10th 2021, the file name from below will be “Report_2021-04-10.docx”. See the page on Characters and strings for more details on str_glue().

      -
      rmarkdown::render(
      -  input = "create_output.Rmd",
      -  output_file = stringr::str_glue("outputs/Report_{Sys.Date()}.docx")) 
      +
      rmarkdown::render(
      +  input = "create_output.Rmd",
      +  output_file = stringr::str_glue("outputs/Report_{Sys.Date()}.docx")) 

      As the file renders, the RStudio Console will show you the rendering progress up to 100%, and a final message to indicate that the rendering is complete.

      @@ -1309,13 +1303,13 @@

      Setting para

      Option 1: Set parameters within YAML

      Edit the YAML to include a params: option, with indented statements for each parameter you want to define. In this example we create parameters date and hospital, for which we specify values. These values are subject to change each time the report is run. If you use the “Knit” button to produce the output, the parameters will have these default values. Likewise, if you use render() the parameters will have these default values unless otherwise specified in the render() command.

      -
      ---
      -title: Surveillance report
      -output: html_document
      -params:
      - date: 2021-04-10
      - hospital: Central Hospital
      ----
      +
      ---
      +title: Surveillance report
      +output: html_document
      +params:
      + date: 2021-04-10
      + hospital: Central Hospital
      +---

      In the background, these parameter values are contained within a read-only list called params. Thus, you can insert the parameter values in R code as you would another R object/value in your environment. Simply type params$ followed by the parameter name. For example params$hospital to represent the hospital name (“Central Hospital” by default).

      Note that parameters can also hold values true or false, and so these can be included in your knitr options for a R chunk. For example, you can set {r, eval=params$run} instead of {r, eval=FALSE}, and now whether the chunk runs or not depends on the value of a parameter run:.

      Note that for parameters that are dates, they will be input as a string. So for params$date to be interpreted in R code it will likely need to be wrapped with as.Date() or a similar function to convert to class Date.

      @@ -1326,10 +1320,10 @@

      -
      rmarkdown::render(
      -  input = "surveillance_report.Rmd",  
      -  output_file = stringr::str_glue("outputs/Report_{Sys.Date()}.docx"),
      -  params = list(date = "2021-04-10", hospital  = "Central Hospital"))
      +
      rmarkdown::render(
      +  input = "surveillance_report.Rmd",  
      +  output_file = stringr::str_glue("outputs/Report_{Sys.Date()}.docx"),
      +  params = list(date = "2021-04-10", hospital  = "Central Hospital"))

      @@ -1347,33 +1341,30 @@

      , as demonstrated below.

      -
      rmarkdown::render(
      -  input = "surveillance_report.Rmd",  
      -  output_file = stringr::str_glue("outputs/Report_{Sys.Date()}.docx"),
      -  params = “ask”)
      +
      rmarkdown::render(
      +  input = "surveillance_report.Rmd",  
      +  output_file = stringr::str_glue("outputs/Report_{Sys.Date()}.docx"),
      +  params = “ask”)

      However, typing values into this pop-up window is subject to error and spelling mistakes. You may prefer to add restrictions to the values that can be entered through drop-down menus. You can do this by adding in the YAML several specifications for each params: entry.

        -
      • label: is the title for that particular drop-down menu.
        -
      • -
      • value: is the default (starting) value.
        -
      • -
      • input: set to select for drop-down menu.
        -
      • -
      • choices: provide the eligible values in the drop-down menu.
      • +
      • label: is the title for that particular drop-down menu
      • +
      • value: is the default (starting) value
      • +
      • input: set to select for drop-down menu
      • +
      • choices: provide the eligible values in the drop-down menu

      Below, these specifications are written for the hospital parameter.

      -
      ---
      -title: Surveillance report
      -output: html_document
      -params:
      - date: 2021-04-10
      - hospital: 
      -  label: “Town:”
      -  value: Central Hospital
      -  input: select
      -  choices: [Central Hospital, Military Hospital, Port Hospital, St. Mark's Maternity Hospital (SMMH)]
      ----
      +
      ---
      +title: Surveillance report
      +output: html_document
      +params:
      + date: 2021-04-10
      + hospital: 
      +  label: “Town:”
      +  value: Central Hospital
      +  input: select
      +  choices: [Central Hospital, Military Hospital, Port Hospital, St. Mark's Maternity Hospital (SMMH)]
      +---

      When knitting (either via the ‘knit with parameters’ button or by render()), the pop-up window will have drop-down options to select from.

      @@ -1415,14 +1406,14 @@

      If you are rendering a R Markdown file with render() from a separate script, you can actually create the impact of parameterization without using the params: functionality.

      For instance, in the R script that contains the render() command, you can simply define hospital and date as two R objects (values) before the render() command. In the R Markdown, you would not need to have a params: section in the YAML, and we would refer to the date object rather than params$date and hospital rather than params$hospital.

      -
      # This is a R script that is separate from the R Markdown
      -
      -# define R objects
      -hospital <- "Central Hospital"
      -date <- "2021-04-10"
      -
      -# Render the R markdown
      -rmarkdown::render(input = "create_output.Rmd") 
      +
      # This is a R script that is separate from the R Markdown
      +
      +# define R objects
      +hospital <- "Central Hospital"
      +date <- "2021-04-10"
      +
      +# Render the R markdown
      +rmarkdown::render(input = "create_output.Rmd") 

      Following this approach means means you can not “knit with parameters”, use the GUI, or include knitting options within the parameters. However it allows for simpler code, which may be advantageous.

      @@ -1433,10 +1424,10 @@

      We may want to run a report multiple times, varying the input parameters, to produce a report for each jurisdictions/unit. This can be done using tools for iteration, which are explained in detail in the page on Iteration, loops, and lists. Options include the purrr package, or use of a for loop as explained below.

      Below, we use a simple for loop to generate a surveillance report for all hospitals of interest. This is done with one command (instead of manually changing the hospital parameter one-at-a-time). The command to render the reports must exist in a separate script outside the report Rmd. This script will also contain defined objects to “loop through” - today’s date, and a vector of hospital names to loop through.

      -
      hospitals <- c("Central Hospital",
      -                "Military Hospital", 
      -                "Port Hospital",
      -                "St. Mark's Maternity Hospital (SMMH)") 
      +
      hospitals <- c("Central Hospital",
      +                "Military Hospital", 
      +                "Port Hospital",
      +                "St. Mark's Maternity Hospital (SMMH)") 

      We then feed these values one-at-a-time into the render() command using a loop, which runs the command once for each value in the hospitals vector. The letter i represents the index position (1 through 4) of the hospital currently being used in that iteration, such that hospital_list[1] would be “Central Hospital”. This information is supplied in two places in the render() command:

        @@ -1445,12 +1436,12 @@

        To params = such that the Rmd uses the hospital name internally whenever the params$hospital value is called (e.g. to filter the dataset to the particular hospital only). In this example, four files would be created - one for each hospital.

      -
      for(i in 1:length(hospitals)){
      -  rmarkdown::render(
      -    input = "surveillance_report.Rmd",
      -    output_file = str_glue("output/Report_{hospitals[i]}_{Sys.Date()}.docx"),
      -    params = list(hospital  = hospitals[i]))
      -}       
      +
      for(i in 1:length(hospitals)){
      +  rmarkdown::render(
      +    input = "surveillance_report.Rmd",
      +    output_file = str_glue("output/Report_{hospitals[i]}_{Sys.Date()}.docx"),
      +    params = list(hospital  = hospitals[i]))
      +}       
      @@ -1493,8 +1484,8 @@

      Powerpoint

      Unfortunately, editing powerpoint files is slightly less flexible:

      • A first level header (# Header 1) will automatically become the title of a new slide,
      • -
      • A ## Header 2 text will not come up as a subtitle but text within the slide’s main textbox (unless you find a way to maniuplate the Master view).
      • -
      • Outputted plots and tables will automatically go into new slides. You will need to combine them, for instance the the patchwork function to combine ggplots, so that they show up on the same page. See this blog post about using the patchwork package to put multiple images on one slide.
      • +
      • A ## Header 2 text will not come up as a subtitle but text within the slide’s main textbox (unless you find a way to maniuplate the Master view)
      • +
      • Outputted plots and tables will automatically go into new slides. You will need to combine them, for instance the the patchwork function to combine ggplots, so that they show up on the same page. See this blog post about using the patchwork package to put multiple images on one slide

      See the officer package for a tool to work more in-depth with powerpoint presentations.

      @@ -1502,17 +1493,17 @@

      Powerpoint

      Integrating templates into the YAML

      Once a template is prepared, the detail of this can be added in the YAML of the Rmd underneath the ‘output’ line and underneath where the document type is specified (which goes to a separate line itself). Note reference_doc can be used for powerpoint slide templates.

      It is easiest to save the template in the same folder as where the Rmd file is (as in the example below), or in a subfolder within.

      -
      ---
      -title: Surveillance report
      -output: 
      - word_document:
      -  reference_docx: "template.docx"
      -params:
      - date: 2021-04-10
      - hospital: Central Hospital
      -template:
      - 
      ----
      +
      ---
      +title: Surveillance report
      +output: 
      + word_document:
      +  reference_docx: "template.docx"
      +params:
      + date: 2021-04-10
      + hospital: Central Hospital
      +template:
      + 
      +---

      Formatting HTML files

      @@ -1523,16 +1514,16 @@

      Formattin
    • Highlight: Configuring this changes the look of highlighted text (e.g. code within chunks that are shown). Supported styles include default, tango, pygments, kate, monochrome, espresso, zenburn, haddock, breezedark, and textmate.

    Here is an example of how to integrate the above options into the YAML.

    -
    ---
    -title: "HTML example"
    -output:
    -  html_document:
    -    toc: true
    -    toc_float: true
    -    theme: cerulean
    -    highlight: kate
    -    
    ----
    +
    ---
    +title: "HTML example"
    +output:
    +  html_document:
    +    toc: true
    +    toc_float: true
    +    theme: cerulean
    +    highlight: kate
    +    
    +---

    Below are two examples of HTML outputs which both have floating tables of contents, but different theme and highlight styles selected:

    @@ -1569,14 +1560,12 @@

    HTML widgets

    HTML widgets for R are a special class of R packages that enable increased interactivity by utilizing JavaScript libraries. You can embed them in HTML R Markdown outputs.

    Some common examples of these widgets include:

      -
    • Plotly (used in this handbook page and in the Interative plots page).
    • -
    • visNetwork (used in the Transmission Chains page of this handbook).
      -
    • -
    • Leaflet (used in the GIS Basics page of this handbook).
      -
    • -
    • dygraphs (useful for interactively showing time series data).
      +
    • Plotly (used in this handbook page and in the Interative plots page)
    • +
    • visNetwork (used in the Transmission Chains page of this handbook)
    • -
    • DT (datatable()) (used to show dynamic tables with filter, sort, etc.).
    • +
    • Leaflet (used in the GIS Basics page of this handbook)
    • +
    • dygraphs (useful for interactively showing time series data)
    • +
    • DT (datatable()) (used to show dynamic tables with filter, sort, etc.)

    The ggplotly() function from plotly is particularly easy to use. See the Interactive plots page.

    @@ -2184,7 +2173,7 @@

    - +
    +
    @@ -861,8 +859,8 @@

    Load popul

    -
    - +
    +
    @@ -872,15 +870,15 @@

    Load death co

    Deaths in Country A

    -
    - +
    +

    Deaths in Country B

    -
    - +
    +
    @@ -911,8 +909,8 @@

    Cl

    The combined population data now look like this (click through to see countries A and B):

    -
    - +
    +

    And now we perform similar operations on the two deaths datasets.

    @@ -928,8 +926,8 @@

    Cl

    The deaths data now look like this, and contain data from both countries:

    -
    - +
    +

    We now join the deaths and population data based on common columns Country, age_cat5, and Sex. This adds the column Deaths.

    @@ -957,8 +955,8 @@

    Cl
    -
    - +
    +

    CAUTION: If you have few deaths per stratum, consider using 10-, or 15-year categories, instead of 5-year categories for age.

    @@ -972,8 +970,8 @@

    Load
    -
    - +
    +
    @@ -990,7 +988,7 @@

    Clea age_cat5 = str_replace_all(age_cat5, "plus", ""), # remove "plus" age_cat5 = str_replace_all(age_cat5, " ", "")) %>% # remove " " space - rename(pop = WorldStandardPopulation) # change col name to "pop" + rename(pop = WorldStandardPopulation)

    CAUTION: If you try to use str_replace_all() to remove a plus symbol, it won’t work because it is a special symbol. “Escape” the specialnes by putting two back slashes in front, as in str_replace_call(column, "\\+", "").

    @@ -1005,8 +1003,8 @@

    Create dataset wit

    This complete dataset looks like this:

    -
    - +
    +
    diff --git a/html_outputs/new_pages/stat_tests.html b/html_outputs/new_pages/stat_tests.html index 656d638c..b5efa37b 100644 --- a/html_outputs/new_pages/stat_tests.html +++ b/html_outputs/new_pages/stat_tests.html @@ -317,7 +317,7 @@ The Epidemiologist R Handbook