-
Notifications
You must be signed in to change notification settings - Fork 0
/
lesson-4.Rmd
168 lines (107 loc) · 4.35 KB
/
lesson-4.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
---
output: html_document
---
# Databases to Documents with RMarkdown
Instructor: Ian Carroll
```{r include = FALSE}
knitr::opts_chunk$set(echo = TRUE, eval = FALSE)
```
## Review
The workflow we want to create begins with data, moves through processing, analysis and visualization, and ends with content for a final report or perhaps the complete report itself. We use `git` and GitHub, which facilitate collaboration and maintain project integrity, to manage the creation of this workflow.
But how do we tie it all together to make the workflow ... work?
## Objective
1. Use RMarkdown and `knitr` to contain the whole workflow in one document.
1. Connect the workflow to a database rather than a CSV files.
1. Recognize how modularization helps make workflows successful.
## RMarkdown and `knitr`
RMarkdown, like R itself, is both a language and an interpreter. The language component is a set of special characters and structural rules to incorporate in a plain text document that serve as formatting instructions. The interpreter reads the instructions along with any other text to generate a formatted document.
The `knitr` package will excute everything you've indicated as R script within an RMarkdown document, and optionally include results in the formatted document.
## Seeing is believing
```{r eval = TRUE}
vals <- c(4, 5, 6)
data <- data.frame(counts = vals)
data
```
Take a look at `lesson-4.Rmd` -- there's no output written here.
## RMarkdown Formatting
The document begins with a header section between `---` lines that assigns values to configuration variables:
---
output: html_document
incremental: true
---
***
Any number of `#` symbols denotes headers of the corresponding size
# The Biggest Heading
## The Next Biggest Heading
### Etc ...
Block quotes, like these two syntax examples, are produced with text indented 4 spaces.
***
A code chunk is fenced with ` ```{r} `, that's three backtick characters, above and ` ``` ` below the code.
> \`\`\`{r eval = TRUE, echo = FALSE}
> vals <- c(4, 5, 6)
> \`\`\`
Code chunk options, such as `eval = TRUE` and `echo = FALSE` above, are specified in a comma separated list.
## Code chunks
A value defined in one code chunk, i.e.
```{r eval = FALSE}
...
```
is accessible in another:
```{r eval = FALSE}
...
```
## Exercise
Create a code chunk that reads the "surveys.csv" table into a variable named "surveys", but is essentially invisible.
...
## Create your formatted document
Choose the "Knit HTML" option at the top of the editor.
## Databases
The `dplyr` package includes `src_*` commands for connecting to three types of databases: SQLite, MySQL and PostgreSQL.
### Features of a database
- There is a standardized language, SQL, for scripting interactions with data.
- MySQL and PostgreSQL are server based and multi-user.
- It can hold more data than your computer can hold in memory.
## Connecting to a database
```{r eval = TRUE}
library(dplyr)
db <- src_sqlite("data/portal.sqlite")
```
## Accessing tables
```{r eval = FALSE}
surveys <- tbl(db, 'surveys')
...
```
## Tidy databases
Recall the principles of tidy data:
- each variable forms a column
- each observation forms a row
- each type of observational unit forms a table
The last principle encapsulates a core approach to database design: try to never duplicate data about the same observation across rows in the "wrong" table.
***
Combine information in two tables based on matching the values of variables they share. We use this feature to add information (from the *species* table) pertaining to each species listed in the *counts_1990_winter* summary.
```{r eval = FALSE}
counts_1990_winter <- filter(surveys, year == 1990) %>%
...
...
...
inner_join(counts_1990_winter, species)
```
## When **un-**tidy data is okay
Untidy data is often needed for analysis, but it's not a good way to store data. Instead, join the tables only when needed.
```{r eval = FALSE}
library(ggplot2)
surveys_1990_winter <- filter(surveys, year == 1990) %>%
select(-year) %>%
...
...
...
ggplot(data = ...,
aes(x = genus)) +
geom_boxplot(...)
```
## Exercise
Create a bar plot for the abundance of species in the rodent taxa.
...
## Exercise
The third table, plots, gives details on the plot type. Create a code chunk that shows box plots for the weight of rodents in 1990 across the types of plot.
...