Merge branch 'master' of https://github.com/moderndive/moderndive_book

moderndive · Aug 28, 2019 · 381ea1b · 381ea1b
2 parents 7f3427b + 2adae6a
commit 381ea1b
Show file tree

Hide file tree

Showing 88 changed files with 1,171 additions and 832 deletions.
diff --git a/01-getting-started.Rmd b/01-getting-started.Rmd
diff --git a/02-visualization.Rmd b/02-visualization.Rmd
diff --git a/03-wrangling.Rmd b/03-wrangling.Rmd
diff --git a/04-tidy.Rmd b/04-tidy.Rmd
@@ -105,7 +105,7 @@ Let's read in the exact same data, but this time from an Excel file saved on you
 At this point you should see a screen pop-up like in Figure \@ref(fig:read-excel). After clicking on the "Import" \index{RStudio!import data} button on the bottom right of Figure \@ref(fig:read-excel), RStudio will save this spreadsheet's data in a data frame called `dem_score` and display its contents in the spreadsheet viewer. 
 
 ```{r read-excel, echo=FALSE, fig.cap="Importing an Excel file to R."}
-include_graphics("images/read_excel.png")
+include_graphics("images/rstudio_screenshots/read_excel.png")
 ```
 
 Furthermore, note the "Code Preview" block in the bottom right of Figure \@ref(fig:read-excel). You can copy and paste this code to reload your data again later automatically, instead of repeating this manual point-and-click process.
@@ -116,7 +116,7 @@ Furthermore, note the "Code Preview" block in the bottom right of Figure \@ref(f
 
 ## Tidy data {#tidy-data-ex}
 
-Let's now switch gears and learn about the concept of "tidy" data format with a motivating example from the `fivethirtyeight` package. The `fivethirtyeight` package [@R-fivethirtyeight] provides access to the data sets used in many articles published by data journalism website [FiveThirtyEight.com](https://fivethirtyeight.com/). For a complete list of all `r nrow(data(package = "fivethirtyeight")[[3]])` data sets included in the `fivethirtyeight` package, check out the package webpage by going to <https://fivethirtyeight-r.netlify.com/articles/fivethirtyeight.html>.\index{R packages!fivethirtyeight}
+Let's now switch gears and learn about the concept of "tidy" data format with a motivating example from the `fivethirtyeight` package. The `fivethirtyeight` package [@R-fivethirtyeight] provides access to the datasets used in many articles published by data journalism website [FiveThirtyEight.com](https://fivethirtyeight.com/). For a complete list of all `r nrow(data(package = "fivethirtyeight")[[3]])` data sets included in the `fivethirtyeight` package, check out the package webpage by going to <https://fivethirtyeight-r.netlify.com/articles/fivethirtyeight.html>.\index{R packages!fivethirtyeight}
 
 Let's focus our attention on the `drinks` data frame:
 
@@ -147,15 +147,23 @@ Let's now ask ourselves a question: "Using the `drinks_smaller` data frame, how
 ```{r drinks-smaller, fig.cap="Comparing alcohol consumption in 4 countries.", fig.height=3.5, echo=FALSE}
 drinks_smaller_tidy <- drinks_smaller %>% 
   gather(type, servings, -country)
-ggplot(drinks_smaller_tidy, aes(x=country, y=servings, fill=type)) +
+drinks_smaller_tidy_plot <- ggplot(
+    drinks_smaller_tidy, 
+    aes(x = country, y = servings, fill = type)
+    ) +
   geom_col(position = "dodge") +
   labs(x = "country", y = "servings")
+if(knitr::is_html_output()){
+  drinks_smaller_tidy_plot
+} else {
+  drinks_smaller_tidy_plot + scale_fill_grey()
+}
 ```
 
 Let's break down the Grammar of Graphics we introduced in Section \@ref(grammarofgraphics):
 
 1. The categorical variable `country` with four levels (China, Italy, Saudi Arabia, USA) would have to be mapped to the `x`-position of the bars.
-1. The numerical variable `servings` would have to be mapped to the `y`-position of the bars, in other words the height of the bars.
+1. The numerical variable `servings` would have to be mapped to the `y`-position of the bars (the height of the bars).
 1. The categorical variable `type` with three levels (beer, spirit, wine) would have to be mapped to the `fill` color of the bars.
 
 Observe however that `drinks_smaller` has three separate variables `beer`, `spirit`, and `wine`. In order to use the `ggplot()` function to recreate the barplot in Figure \@ref(fig:drinks-smaller) however, we need a *single variable* `type` with three possible values: `beer`, `spirit`, and `wine`.  We could then map this `type` variable to the `fill` aesthetic of our plot.  In other words, to recreate the barplot in Figure \@ref(fig:drinks-smaller), our data frame would have to look like this:
@@ -195,13 +203,13 @@ What does it mean for your data to be "tidy"? While "tidy" has a clear English m
 > 2. Each observation forms a row.
 > 3. Each type of observational unit forms a table.
 
-```{r tidyfig, echo=FALSE, fig.cap="Tidy data graphic from R for Data Science."}
-knitr::include_graphics("images/tidy-1.png")
+```{r tidyfig, echo=FALSE, fig.cap="Tidy data graphic from R for Data Science.", purl=FALSE}
+knitr::include_graphics("images/r4ds/tidy-1.png")
 ```
 
 For example, say you have the following table of stock prices in Table \@ref(tab:non-tidy-stocks):
 
-```{r non-tidy-stocks, echo=FALSE}
+```{r non-tidy-stocks, echo=FALSE, purl=FALSE}
 stocks <- tibble(
   Date = as.Date('2009-01-01') + 0:4,
   `Boeing stock price` = paste("$", c("173.55", "172.61", "173.86", "170.77", "174.29"), sep = ""),
@@ -216,10 +224,10 @@ stocks %>%
     booktabs = TRUE
   ) %>% 
   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
-                latex_options = c("HOLD_position"))
+                latex_options = c("hold_position"))
 ```
 
-Although the data are neatly organized in a rectangular spreadsheet-type format, they do not follow the definition of data in "tidy" format. Wile there are three variables corresponding to three unique pieces of information (date, stock name, and stock price), there are not three columns. In "tidy" data format each variable should be its own column, as shown in Table \@ref(tab:tidy-stocks). Notice that both tables present the same information, but in different formats. 
+Although the data are neatly organized in a rectangular spreadsheet-type format, they do not follow the definition of data in "tidy" format. While there are three variables corresponding to three unique pieces of information (date, stock name, and stock price), there are not three columns. In "tidy" data format each variable should be its own column, as shown in Table \@ref(tab:tidy-stocks). Notice that both tables present the same information, but in different formats. 
 
 ```{r tidy-stocks, echo=FALSE}
 stocks_tidy <- stocks %>% 
@@ -236,7 +244,7 @@ stocks_tidy %>%
     booktabs = TRUE
   ) %>% 
   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
-                latex_options = c("HOLD_position"))
+                latex_options = c("hold_position"))
 ```
 
 Now we have the requisite three columns `Date`, `Stock Name`, and `Stock Price`. On the other hand, consider the data in Table \@ref(tab:tidy-stocks-2).
@@ -255,26 +263,30 @@ stocks %>%
     booktabs = TRUE
   ) %>% 
   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16), 
-                latex_options = c("HOLD_position"))
+                latex_options = c("hold_position"))
 ```
 
 In this case, even though the variable "Boeing Price" occurs just like in our non-"tidy" data in Table \@ref(tab:non-tidy-stocks), the data *is* "tidy" since there are three variables corresponding to three unique pieces of information: Date, Boeing stock price, and the weather that particular day.
 
 ```{block, type='learncheck'}
+\vspace{-0.25in}
 **_Learning check_**
+\vspace{-0.25in}
 ```
 
 **`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What are common characteristics of "tidy" data frames?
 
 **`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What makes "tidy" data frames useful for organizing data?
 
 ```{block, type='learncheck', purl=FALSE}
+\vspace{-0.25in}
+\vspace{-0.25in}
 ```
 
 
 ### Converting to "tidy" data
 
-In this book so far, you've only seen data frames that were already in "tidy" format. Furthermore for the rest of this book, you'll mostly only see data frames that are already in "tidy" format as well. This is not always the case however with all datasets in the world. If your original data frame is in wide i.e. non-"tidy" format and you would like to use the `ggplot2` or `dplyr` packages, you will first have to convert it "tidy" format using the \index{tidyr!gather()} `gather()` function in the `tidyr` \index{R packages!tidyr} package [@R-tidyr]. 
+In this book so far, you've only seen data frames that were already in "tidy" format. Furthermore, for the rest of this book, you'll mostly only see data frames that are already in "tidy" format as well. This is not always the case however with all datasets in the world. If your original data frame is in wide i.e. non-"tidy" format and you would like to use the `ggplot2` or `dplyr` packages, you will first have to convert it "tidy" format using the \index{tidyr!gather()} `gather()` function in the `tidyr` \index{R packages!tidyr} package [@R-tidyr]. 
 
 Going back to our `drinks_smaller` data frame from earlier:
 
@@ -308,17 +320,28 @@ Note that the third argument now specifies which columns we want to tidy `c(beer
 
 With our `drinks_smaller_tidy` "tidy" formatted data frame, we can now produce the barplot you saw in Figure  \@ref(fig:drinks-smaller) using `geom_col()`. Recall from Section \@ref(geombar) on barplots that we use `geom_col()` and not `geom_bar()`, since we would like to map the "pre-counted" `servings` variable to the `y`-aesthetic of the bars.
 
-```{r drinks-smaller-tidy-barplot, fig.cap="Comparing alcohol consumption in 4 countries.", fig.height=3.5}
+```{r eval=FALSE}
 ggplot(drinks_smaller_tidy, 
        aes(x = country, y = servings, fill = type)) +
   geom_col(position = "dodge")
 ```
 
+
+```{r drinks-smaller-tidy-barplot, echo=FALSE, fig.cap="Comparing alcohol consumption in 4 countries.", fig.height=3.5}
+if(knitr::is_html_output()){
+  drinks_smaller_tidy_plot
+} else {
+  drinks_smaller_tidy_plot + scale_fill_grey()
+}
+```
+
 Converting "wide" format data to "tidy" format often confuses new R users. The only way to learn to get comfortable with the `gather()` function is with practice, practice, and more practice. For example, run `?gather` and look at the examples in the bottom of the help file. We'll show another example of using `gather()` to convert a "wide" formatted data frame to "tidy" format in Section \@ref(case-study-tidy). For other examples of converting a dataset into "tidy" format, check out the different functions available for data tidying and a case study using data from the World Health Organization in [R for Data Science](http://r4ds.had.co.nz/tidy-data.html) [@rds2016].
 
 
 ```{block, type='learncheck'}
+\vspace{-0.25in}
 **_Learning check_**
+\vspace{-0.25in}
 ```
 
 **`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Take a look the `airline_safety` data frame included in the `fivethirtyeight` data package. Run the following:
@@ -338,6 +361,8 @@ airline_safety_smaller
 This data frame is not in "tidy" format. How would you convert this data frame to be in "tidy" format, in particular so that it has a variable `incident_type_years` indicating the incident type/year and a variable `count` of the counts?
 
 ```{block, type='learncheck', purl=FALSE}
+\vspace{-0.25in}
+\vspace{-0.25in}
 ```
 
 
@@ -420,7 +445,9 @@ ggplot(guat_dem_tidy, aes(x = year, y = democracy_score)) +
 
 
 ```{block lc-tidying, type='learncheck', purl=FALSE}
+\vspace{-0.25in}
 **_Learning check_**
+\vspace{-0.25in}
 ```
 
 **`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`**  Convert the `dem_score` data frame into
@@ -429,6 +456,8 @@ a tidy data frame and assign the name of `dem_score_tidy` to the resulting long-
 **`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`**  Read in the life expectancy data stored at <https://moderndive.com/data/le_mess.csv> and convert it to a tidy data frame. 
 
 ```{block, type='learncheck', purl=FALSE}
+\vspace{-0.25in}
+\vspace{-0.25in}
 ```
 
 
@@ -464,7 +493,7 @@ library(stringr)
 library(forcats)
 ```
 
-You've seen the first 4 of the these packages: `ggplot2` for data visualization, `dplyr` for data wrangling, `tidyr` for converting data to "tidy" format, and `readr` for importing spreadsheet data into R. The remaining packages (`purrr`, `tibble`, `stringr`, and `forcats`) are left for a more advanced book; check out [R for Data Science](http://r4ds.had.co.nz/) to learn about these packages.
+You've seen the first 4 of these packages: `ggplot2` for data visualization, `dplyr` for data wrangling, `tidyr` for converting data to "tidy" format, and `readr` for importing spreadsheet data into R. The remaining packages (`purrr`, `tibble`, `stringr`, and `forcats`) are left for a more advanced book; check out [R for Data Science](http://r4ds.had.co.nz/) to learn about these packages.
 
 For the remainder of this book, we'll start every chapter by running `library(tidyverse)`, instead of loading the various component packages individually. The `tidyverse` "umbrella" package gets its name from the fact that all the functions in all its packages are designed to have common inputs and outputs: data frames are in "tidy" format. This standardization of input and output data frames makes transitions between different functions in the different packages as seamless as possible. For more information, check out the [tidyverse.org](https://www.tidyverse.org/) webpage for the package.
 
@@ -482,14 +511,16 @@ generate_r_file_link("04-tidy.R")
 
 If you want to learn more about using the `readr` \index{R packages!readr!cheatsheet} and `tidyr` \index{R packages!tidyr!cheatsheet} package, we suggest you that you check out RStudio's "Data Import Cheat Sheet."
 
-You can access these cheatsheets by going to the RStudio Menu Bar -> Help -> Cheatsheets -> "Browse Cheatsheets" -> Scroll down the page to the "Data Import Cheat Sheet". The first page of this cheatsheet has information on using the `readr` package to import data while the second page has information on using the `tidyr` package to "tidy" data. You can see previews of both cheatsheets in Figures \@ref(fig:import-cheatsheet) and \@ref(fig:tidyr-cheatsheet).
+You can access these cheatsheets by going to the RStudio Menu Bar -> Help -> Cheatsheets -> "Browse Cheatsheets" -> Scroll down the page to the "Data Import Cheat Sheet". The first page of this cheatsheet has information on using the `readr` package to import data while the second page has information on using the `tidyr` package to "tidy" data. `r if(knitr::is_html_output()) "You can see a preview of both cheatsheets in the figures below."`
 
 ```{r import-cheatsheet, echo=FALSE, fig.cap="Data Import cheatsheet (first page): readr package.", out.width="66%"}
-include_graphics("images/data-import-1.png")
+if(knitr::is_html_output())
+  include_graphics("images/cheatsheets/data-import-1.png")
 ```
 
 ```{r tidyr-cheatsheet, echo=FALSE, fig.cap="Data Import cheatsheet (second page): tidyr package.", out.width="66%"}
-include_graphics("images/data-import-2.png")
+if(knitr::is_html_output())
+  include_graphics("images/cheatsheets/data-import-2.png")
 ```