Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
  • Loading branch information
ismayc committed Aug 28, 2019
2 parents 7f3427b + 2adae6a commit 381ea1b
Show file tree
Hide file tree
Showing 88 changed files with 1,171 additions and 832 deletions.
55 changes: 40 additions & 15 deletions 01-getting-started.Rmd

Large diffs are not rendered by default.

187 changes: 145 additions & 42 deletions 02-visualization.Rmd

Large diffs are not rendered by default.

71 changes: 53 additions & 18 deletions 03-wrangling.Rmd

Large diffs are not rendered by default.

65 changes: 48 additions & 17 deletions 04-tidy.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ Let's read in the exact same data, but this time from an Excel file saved on you
At this point you should see a screen pop-up like in Figure \@ref(fig:read-excel). After clicking on the "Import" \index{RStudio!import data} button on the bottom right of Figure \@ref(fig:read-excel), RStudio will save this spreadsheet's data in a data frame called `dem_score` and display its contents in the spreadsheet viewer.

```{r read-excel, echo=FALSE, fig.cap="Importing an Excel file to R."}
include_graphics("images/read_excel.png")
include_graphics("images/rstudio_screenshots/read_excel.png")
```

Furthermore, note the "Code Preview" block in the bottom right of Figure \@ref(fig:read-excel). You can copy and paste this code to reload your data again later automatically, instead of repeating this manual point-and-click process.
Expand All @@ -116,7 +116,7 @@ Furthermore, note the "Code Preview" block in the bottom right of Figure \@ref(f

## Tidy data {#tidy-data-ex}

Let's now switch gears and learn about the concept of "tidy" data format with a motivating example from the `fivethirtyeight` package. The `fivethirtyeight` package [@R-fivethirtyeight] provides access to the data sets used in many articles published by data journalism website [FiveThirtyEight.com](https://fivethirtyeight.com/). For a complete list of all `r nrow(data(package = "fivethirtyeight")[[3]])` data sets included in the `fivethirtyeight` package, check out the package webpage by going to <https://fivethirtyeight-r.netlify.com/articles/fivethirtyeight.html>.\index{R packages!fivethirtyeight}
Let's now switch gears and learn about the concept of "tidy" data format with a motivating example from the `fivethirtyeight` package. The `fivethirtyeight` package [@R-fivethirtyeight] provides access to the datasets used in many articles published by data journalism website [FiveThirtyEight.com](https://fivethirtyeight.com/). For a complete list of all `r nrow(data(package = "fivethirtyeight")[[3]])` data sets included in the `fivethirtyeight` package, check out the package webpage by going to <https://fivethirtyeight-r.netlify.com/articles/fivethirtyeight.html>.\index{R packages!fivethirtyeight}

Let's focus our attention on the `drinks` data frame:

Expand Down Expand Up @@ -147,15 +147,23 @@ Let's now ask ourselves a question: "Using the `drinks_smaller` data frame, how
```{r drinks-smaller, fig.cap="Comparing alcohol consumption in 4 countries.", fig.height=3.5, echo=FALSE}
drinks_smaller_tidy <- drinks_smaller %>%
gather(type, servings, -country)
ggplot(drinks_smaller_tidy, aes(x=country, y=servings, fill=type)) +
drinks_smaller_tidy_plot <- ggplot(
drinks_smaller_tidy,
aes(x = country, y = servings, fill = type)
) +
geom_col(position = "dodge") +
labs(x = "country", y = "servings")
if(knitr::is_html_output()){
drinks_smaller_tidy_plot
} else {
drinks_smaller_tidy_plot + scale_fill_grey()
}
```

Let's break down the Grammar of Graphics we introduced in Section \@ref(grammarofgraphics):

1. The categorical variable `country` with four levels (China, Italy, Saudi Arabia, USA) would have to be mapped to the `x`-position of the bars.
1. The numerical variable `servings` would have to be mapped to the `y`-position of the bars, in other words the height of the bars.
1. The numerical variable `servings` would have to be mapped to the `y`-position of the bars (the height of the bars).
1. The categorical variable `type` with three levels (beer, spirit, wine) would have to be mapped to the `fill` color of the bars.

Observe however that `drinks_smaller` has three separate variables `beer`, `spirit`, and `wine`. In order to use the `ggplot()` function to recreate the barplot in Figure \@ref(fig:drinks-smaller) however, we need a *single variable* `type` with three possible values: `beer`, `spirit`, and `wine`. We could then map this `type` variable to the `fill` aesthetic of our plot. In other words, to recreate the barplot in Figure \@ref(fig:drinks-smaller), our data frame would have to look like this:
Expand Down Expand Up @@ -195,13 +203,13 @@ What does it mean for your data to be "tidy"? While "tidy" has a clear English m
> 2. Each observation forms a row.
> 3. Each type of observational unit forms a table.
```{r tidyfig, echo=FALSE, fig.cap="Tidy data graphic from R for Data Science."}
knitr::include_graphics("images/tidy-1.png")
```{r tidyfig, echo=FALSE, fig.cap="Tidy data graphic from R for Data Science.", purl=FALSE}
knitr::include_graphics("images/r4ds/tidy-1.png")
```

For example, say you have the following table of stock prices in Table \@ref(tab:non-tidy-stocks):

```{r non-tidy-stocks, echo=FALSE}
```{r non-tidy-stocks, echo=FALSE, purl=FALSE}
stocks <- tibble(
Date = as.Date('2009-01-01') + 0:4,
`Boeing stock price` = paste("$", c("173.55", "172.61", "173.86", "170.77", "174.29"), sep = ""),
Expand All @@ -216,10 +224,10 @@ stocks %>%
booktabs = TRUE
) %>%
kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
latex_options = c("HOLD_position"))
latex_options = c("hold_position"))
```

Although the data are neatly organized in a rectangular spreadsheet-type format, they do not follow the definition of data in "tidy" format. Wile there are three variables corresponding to three unique pieces of information (date, stock name, and stock price), there are not three columns. In "tidy" data format each variable should be its own column, as shown in Table \@ref(tab:tidy-stocks). Notice that both tables present the same information, but in different formats.
Although the data are neatly organized in a rectangular spreadsheet-type format, they do not follow the definition of data in "tidy" format. While there are three variables corresponding to three unique pieces of information (date, stock name, and stock price), there are not three columns. In "tidy" data format each variable should be its own column, as shown in Table \@ref(tab:tidy-stocks). Notice that both tables present the same information, but in different formats.

```{r tidy-stocks, echo=FALSE}
stocks_tidy <- stocks %>%
Expand All @@ -236,7 +244,7 @@ stocks_tidy %>%
booktabs = TRUE
) %>%
kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
latex_options = c("HOLD_position"))
latex_options = c("hold_position"))
```

Now we have the requisite three columns `Date`, `Stock Name`, and `Stock Price`. On the other hand, consider the data in Table \@ref(tab:tidy-stocks-2).
Expand All @@ -255,26 +263,30 @@ stocks %>%
booktabs = TRUE
) %>%
kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
latex_options = c("HOLD_position"))
latex_options = c("hold_position"))
```

In this case, even though the variable "Boeing Price" occurs just like in our non-"tidy" data in Table \@ref(tab:non-tidy-stocks), the data *is* "tidy" since there are three variables corresponding to three unique pieces of information: Date, Boeing stock price, and the weather that particular day.

```{block, type='learncheck'}
\vspace{-0.25in}
**_Learning check_**
\vspace{-0.25in}
```

**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What are common characteristics of "tidy" data frames?

**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What makes "tidy" data frames useful for organizing data?

```{block, type='learncheck', purl=FALSE}
\vspace{-0.25in}
\vspace{-0.25in}
```


### Converting to "tidy" data

In this book so far, you've only seen data frames that were already in "tidy" format. Furthermore for the rest of this book, you'll mostly only see data frames that are already in "tidy" format as well. This is not always the case however with all datasets in the world. If your original data frame is in wide i.e. non-"tidy" format and you would like to use the `ggplot2` or `dplyr` packages, you will first have to convert it "tidy" format using the \index{tidyr!gather()} `gather()` function in the `tidyr` \index{R packages!tidyr} package [@R-tidyr].
In this book so far, you've only seen data frames that were already in "tidy" format. Furthermore, for the rest of this book, you'll mostly only see data frames that are already in "tidy" format as well. This is not always the case however with all datasets in the world. If your original data frame is in wide i.e. non-"tidy" format and you would like to use the `ggplot2` or `dplyr` packages, you will first have to convert it "tidy" format using the \index{tidyr!gather()} `gather()` function in the `tidyr` \index{R packages!tidyr} package [@R-tidyr].

Going back to our `drinks_smaller` data frame from earlier:

Expand Down Expand Up @@ -308,17 +320,28 @@ Note that the third argument now specifies which columns we want to tidy `c(beer

With our `drinks_smaller_tidy` "tidy" formatted data frame, we can now produce the barplot you saw in Figure \@ref(fig:drinks-smaller) using `geom_col()`. Recall from Section \@ref(geombar) on barplots that we use `geom_col()` and not `geom_bar()`, since we would like to map the "pre-counted" `servings` variable to the `y`-aesthetic of the bars.

```{r drinks-smaller-tidy-barplot, fig.cap="Comparing alcohol consumption in 4 countries.", fig.height=3.5}
```{r eval=FALSE}
ggplot(drinks_smaller_tidy,
aes(x = country, y = servings, fill = type)) +
geom_col(position = "dodge")
```


```{r drinks-smaller-tidy-barplot, echo=FALSE, fig.cap="Comparing alcohol consumption in 4 countries.", fig.height=3.5}
if(knitr::is_html_output()){
drinks_smaller_tidy_plot
} else {
drinks_smaller_tidy_plot + scale_fill_grey()
}
```

Converting "wide" format data to "tidy" format often confuses new R users. The only way to learn to get comfortable with the `gather()` function is with practice, practice, and more practice. For example, run `?gather` and look at the examples in the bottom of the help file. We'll show another example of using `gather()` to convert a "wide" formatted data frame to "tidy" format in Section \@ref(case-study-tidy). For other examples of converting a dataset into "tidy" format, check out the different functions available for data tidying and a case study using data from the World Health Organization in [R for Data Science](http://r4ds.had.co.nz/tidy-data.html) [@rds2016].


```{block, type='learncheck'}
\vspace{-0.25in}
**_Learning check_**
\vspace{-0.25in}
```

**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Take a look the `airline_safety` data frame included in the `fivethirtyeight` data package. Run the following:
Expand All @@ -338,6 +361,8 @@ airline_safety_smaller
This data frame is not in "tidy" format. How would you convert this data frame to be in "tidy" format, in particular so that it has a variable `incident_type_years` indicating the incident type/year and a variable `count` of the counts?

```{block, type='learncheck', purl=FALSE}
\vspace{-0.25in}
\vspace{-0.25in}
```


Expand Down Expand Up @@ -420,7 +445,9 @@ ggplot(guat_dem_tidy, aes(x = year, y = democracy_score)) +


```{block lc-tidying, type='learncheck', purl=FALSE}
\vspace{-0.25in}
**_Learning check_**
\vspace{-0.25in}
```

**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Convert the `dem_score` data frame into
Expand All @@ -429,6 +456,8 @@ a tidy data frame and assign the name of `dem_score_tidy` to the resulting long-
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Read in the life expectancy data stored at <https://moderndive.com/data/le_mess.csv> and convert it to a tidy data frame.

```{block, type='learncheck', purl=FALSE}
\vspace{-0.25in}
\vspace{-0.25in}
```


Expand Down Expand Up @@ -464,7 +493,7 @@ library(stringr)
library(forcats)
```

You've seen the first 4 of the these packages: `ggplot2` for data visualization, `dplyr` for data wrangling, `tidyr` for converting data to "tidy" format, and `readr` for importing spreadsheet data into R. The remaining packages (`purrr`, `tibble`, `stringr`, and `forcats`) are left for a more advanced book; check out [R for Data Science](http://r4ds.had.co.nz/) to learn about these packages.
You've seen the first 4 of these packages: `ggplot2` for data visualization, `dplyr` for data wrangling, `tidyr` for converting data to "tidy" format, and `readr` for importing spreadsheet data into R. The remaining packages (`purrr`, `tibble`, `stringr`, and `forcats`) are left for a more advanced book; check out [R for Data Science](http://r4ds.had.co.nz/) to learn about these packages.

For the remainder of this book, we'll start every chapter by running `library(tidyverse)`, instead of loading the various component packages individually. The `tidyverse` "umbrella" package gets its name from the fact that all the functions in all its packages are designed to have common inputs and outputs: data frames are in "tidy" format. This standardization of input and output data frames makes transitions between different functions in the different packages as seamless as possible. For more information, check out the [tidyverse.org](https://www.tidyverse.org/) webpage for the package.

Expand All @@ -482,14 +511,16 @@ generate_r_file_link("04-tidy.R")

If you want to learn more about using the `readr` \index{R packages!readr!cheatsheet} and `tidyr` \index{R packages!tidyr!cheatsheet} package, we suggest you that you check out RStudio's "Data Import Cheat Sheet."

You can access these cheatsheets by going to the RStudio Menu Bar -> Help -> Cheatsheets -> "Browse Cheatsheets" -> Scroll down the page to the "Data Import Cheat Sheet". The first page of this cheatsheet has information on using the `readr` package to import data while the second page has information on using the `tidyr` package to "tidy" data. You can see previews of both cheatsheets in Figures \@ref(fig:import-cheatsheet) and \@ref(fig:tidyr-cheatsheet).
You can access these cheatsheets by going to the RStudio Menu Bar -> Help -> Cheatsheets -> "Browse Cheatsheets" -> Scroll down the page to the "Data Import Cheat Sheet". The first page of this cheatsheet has information on using the `readr` package to import data while the second page has information on using the `tidyr` package to "tidy" data. `r if(knitr::is_html_output()) "You can see a preview of both cheatsheets in the figures below."`

```{r import-cheatsheet, echo=FALSE, fig.cap="Data Import cheatsheet (first page): readr package.", out.width="66%"}
include_graphics("images/data-import-1.png")
if(knitr::is_html_output())
include_graphics("images/cheatsheets/data-import-1.png")
```

```{r tidyr-cheatsheet, echo=FALSE, fig.cap="Data Import cheatsheet (second page): tidyr package.", out.width="66%"}
include_graphics("images/data-import-2.png")
if(knitr::is_html_output())
include_graphics("images/cheatsheets/data-import-2.png")
```


Expand Down
Loading

0 comments on commit 381ea1b

Please sign in to comment.