lesson-4.Rmd

---
output: html_document
---

# Databases to Documents with RMarkdown

Instructor: Ian Carroll

```{r include = FALSE}
knitr::opts_chunk$set(echo = TRUE, eval = FALSE)
```

## Review

The workflow we want to create begins with data, moves through processing, analysis and visualization, and ends with content for a final report or perhaps the complete report itself. We use `git` and GitHub, which facilitate collaboration and maintain project integrity, to manage the creation of this workflow.

But how do we tie it all together to make the workflow ... work?

## Objective

1. Use RMarkdown and `knitr` to contain the whole workflow in one document.
1. Connect the workflow to a database rather than a CSV files.
1. Recognize how modularization helps make workflows successful.
   
## RMarkdown and `knitr`

RMarkdown, like R itself, is both a language and an interpreter. The language component is a set of special characters and structural rules to incorporate in a plain text document that serve as formatting instructions. The interpreter reads the instructions along with any other text to generate a formatted document.

The `knitr` package will excute everything you've indicated as R script within an RMarkdown document, and optionally include results in the formatted document.

## Seeing is believing

```{r eval = TRUE}
vals <- c(4, 5, 6)
data <- data.frame(counts = vals)
data
```

Take a look at `lesson-4.Rmd` -- there's no output written here.

## RMarkdown Formatting

The document begins with a header section between `---` lines that assigns values to configuration variables:

    ---
    output: html_document
    incremental: true
    ---

***

Any number of `#` symbols denotes headers of the corresponding size

    # The Biggest Heading
    ## The Next Biggest Heading
    ### Etc ...

Block quotes, like these two syntax examples, are produced with text indented 4 spaces.

***

A code chunk is fenced with ` ```{r} `, that's three backtick characters, above and ` ``` ` below the code.

> \`\`\`{r eval = TRUE, echo = FALSE}  
> vals <- c(4, 5, 6)  
> \`\`\`

Code chunk options, such as `eval = TRUE` and `echo = FALSE` above, are specified in a comma separated list.

## Code chunks

A value defined in one code chunk, i.e.

```{r eval = FALSE}
    ...
```

is accessible in another:

```{r eval = FALSE}
...
```

## Exercise

Create a code chunk that reads the "surveys.csv" table into a variable named "surveys", but is essentially invisible.

    ...

## Create your formatted document

Choose the "Knit HTML" option at the top of the editor.

## Databases

The `dplyr` package includes `src_*` commands for connecting to three types of databases: SQLite, MySQL and PostgreSQL.

### Features of a database

- There is a standardized language, SQL, for scripting interactions with data.
- MySQL and PostgreSQL are server based and multi-user.
- It can hold more data than your computer can hold in memory.

## Connecting to a database

```{r eval = TRUE}
library(dplyr)
db <- src_sqlite("data/portal.sqlite")
```

## Accessing tables

```{r eval = FALSE}
surveys <- tbl(db, 'surveys')
...
```

## Tidy databases

Recall the principles of tidy data:

- each variable forms a column
- each observation forms a row
- each type of observational unit forms a table

The last principle encapsulates a core approach to database design: try to never duplicate data about the same observation across rows in the "wrong" table.

***

Combine information in two tables based on matching the values of variables they share. We use this feature to add information (from the *species* table) pertaining to each species listed in the *counts_1990_winter* summary.

```{r eval = FALSE}
counts_1990_winter <- filter(surveys, year == 1990) %>%
  ...
  ...
  ...

inner_join(counts_1990_winter, species)
```

## When **un-**tidy data is okay

Untidy data is often needed for analysis, but it's not a good way to store data. Instead, join the tables only when needed.

```{r eval = FALSE}
library(ggplot2)
surveys_1990_winter <- filter(surveys, year == 1990) %>%
  select(-year) %>%
  ...
  ...
  ...

ggplot(data = ...,
       aes(x = genus)) +
  geom_boxplot(...)
```

## Exercise

Create a bar plot for the abundance of species in the rodent taxa.

    ...

## Exercise

The third table, plots, gives details on the plot type. Create a code chunk that shows box plots for the weight of rodents in 1990 across the types of plot.

   ...