slides.Rmd

---
title: '<font color="1C63AF"> Applied R Munich: <br> “A Grammar of Data Manipulation” --– <br>  Eine Einführung in das Paket dplyr </font> '
author: "Philipp J. Rösch"
date: "26.10.2015"
output: 
  slidy_presentation:
    highlight: haddock
    font_adjustment: +3
    smart: true
    footer: "Applied R Munich, 26. Oktober 2015: dplyr"
---

##

Folien und Tutorial sind online auf [https://github.com/lmu-applied-r/2015-dplyr](https://github.com/lmu-applied-r/2015-dplyr).

## Dank!

Die Folien dieser Präsentation basieren auf Kevin Markhams Code von [www.dataschool.io](http://dataschool.io)

## Data Cleaning

"A New York Times article [...] discovers the 80-20 rule: that 80% of a typical data science project is sourcing cleaning and preparing the data, while the remaining 20% is actual data analysis." - [Revolution Analytics Blog](http://blog.revolutionanalytics.com/2014/08/data-cleaning-is-a-critical-part-of-the-data-science-process.html)

"In Data Science, 80% of time spent prepare data, 20% of time spent complain about need for prepare data." - [BigDataBorat](https://twitter.com/bigdataborat/status/306596352991830016)

## Was kann dplyr?

* Großartig zur Datenexploration und -manipulation
* Intuitiv zu schreiben und einfach zu lesen (vor allem wegen der “Chaining”-Syntax)
* Schneller als base R Befehle

Von wem?

* Hadely Wickham (ggplot2, plyr, reshape2, lubridate, stringr, httr, roxygen2, testthat, devtools, lineprof, staticdocs)
* Romain Francois 

## [Cheat Sheet](https://www.rstudio.com/resources/cheatsheets/)

![alt text](https://www.rstudio.com/wp-content/uploads/2015/03/data-wrangling-cheatsheet.png "RStudio Cheat Sheet dplyr")


## Was kann dplyr?

Aufgabe | Verb 
--- | --- 
Wähle Zeilen | __`filter`__ (+ Window Functions), `distinct`, `sample_n`, `sample_frac`, `slice`, `top_n` 
Wähle Variablen | __`select`__ (+ Helper Functions)
Bearbeite Variablen | __`arrange`__, `rename`
Erstelle Variablen | __`mutate`__, `mutate_each`, __`transmute`__ (+ Window Functions)
Fasse Daten zusammen | __`summarise`__, `summarise_each` (+ Summary Functions)
Gruppiere Daten | __`group_by`__

Es werden beinahe alle dplyr-Funktionen in den Folien erwähnt.

## dplyr laden und Beispielsdaten 

* dplyr überlagert ein paar base functions 
* Falls plyr verwendet werden soll, lade plyr zuerst (dplyr: 20-1000x schneller)
* `hflights` sind Abflugsdaten von zwei Flughäfen in Houston aus dem Jahr 2011

```{r}
# load packages
suppressMessages(library(dplyr))
library(hflights)
```

## Exkurs [magrittr](https://github.com/smbache/magrittr): "Chaining" oder "Pipelining"

* magrittr wird automatisch mit dplyr geladen
* Chaining erhöht Lesbarkeit, vor allem bei der Verwendung von vielen Befehlen
* Befehle können hintereinander geschrieben werden und werden durch den Operator `%>%` ("then") verknüpft


```{r eval=FALSE}
# f(x, y) is the same as
x %>% f(y) 

# f(x, y, z ) is the same as
y %>% f(x, ., z) 
```


## Exkurs [magrittr](https://github.com/smbache/magrittr): "Chaining" oder "Pipelining"

```{r results='hide'}
# create two vectors
x1 <- 1:5; x2 <- 2:6
    
# calculate Euclidian distance between them
sqrt(sum((x1-x2)^2))

# chaining method
(x1-x2)^2 %>% 
  sum() %>% 
  sqrt()
```

Bei Chaining können Klammern ohne Argumente weggelassen werden.

## dpylr und Beispieldaten

```{r}
# explore data
data(hflights)
head(hflights, n = 4)
```

## Local Data Frame

`tbl_df` erstellt ein "Local Data Frame" also einen Wrapper, der Data Frames in der Konsole "schön" ausgibt

```{r}
# convert to local data frame and print
flights <- tbl_df(hflights) %>% print(n = 3)
```

<!---
## Printing
```{r eval=TRUE}
flights %>% print(n = 5)
```

* Default ist `print(n = 10, width = Inf)` 
* Optionen über `options(dplyr.print_min = 10, dplyr.width = Inf)` verstellbar
--->

## Structure Summary

```{r results='hide'}
# base R approach to view the structure of an object
str(iris)
```

```{r}
# dplyr approach: better formatting, and adapts to your screen width
glimpse(iris)
```


## Row Names

dplyr unterstützt keine Row Names

```{r, eval=FALSE}
# add rownames
mtcars_new <- mtcars %>% 
  add_rownames("car_names") %>%
  tbl_df() 
```

<!---
## Allgemeines zu dplyr

* Bei base R muss Data Frame-Name wiederholt werden
* dplyr-Methode einfach zu schreiben und zu lesen
* Befehlsstruktur (für alle dplyr-Verben):
    * Erstes Argument: Data Frame-Name
    * Ausgabewert: Data Frame
    * Ohne Zuweisung wird Data Frame nicht verändert

--->

## Auswählen von Zeilen

Verb | Aufgabe
--- | ---
__`filter(flights, Month < 3)`__ | filtert aus Zeilen
`distinct(flights)` | entfernt Duplikate 
`slice(flights, 11:20)` | wählt 11. bis 20. Observation
`sample_n(flights, 10)` | zieht random n Observationen ohne Zurücklegen 
`sample_frac(flights, .2)` | zieht 20% der Observationen ohne Zurücklegen
__`top_n(flights, 10)`__ | wählt Top 10-Werte, gegeben Gruppierung und Sortierung 

## Auswählen von Zeilen: `filter` mit AND

```{r results='hide'}
# base R approach to view all flights on January 1
flights[flights$Month==1 & flights$DayofMonth==1, ]
```

```{r}
# dplyr approach
# note: you can use comma or ampersand to represent AND condition
filter(flights, Month==1, DayofMonth==1) %>% print(n = 3) # filter(flights, Month==1 & DayofMonth==1)
```

## Auswählen von Zeilen: `filter` mit AND

```{r results='hide'}
# alternatively with chaining
flights %>% 
  filter(Month==1, DayofMonth==1) %>%
  print(n = 3) 
```

Mögliche logische Operationen: 
`<`, `>`, `<=`, `>=`, `==`, `!=`, `%in%`, `is.na`, `!is.na`, `&`, `|`, `xor`, `any`, `all`

## Auswählen von Zeilen: `filter` mit OR und IN

```{r}
# use pipe for OR condition
flights %>% filter(UniqueCarrier=="AA" | UniqueCarrier=="UA") %>% print(n = 5)
```

```{r results='hide'}
# you can also use %in% operator
flights %>% filter(UniqueCarrier %in% c("AA", "UA"))
```

## Auswählen von Variablen: `select`

```{r results='hide'}
# base R approach to select DepTime, ArrTime, and FlightNum columns
flights[ , c("DepTime", "ArrTime", "FlightNum")]
```

```{r}
# dplyr approach
flights %>% select(DepTime, ArrTime, FlightNum) %>% print(n = 5)
```


## Auswählen von Variablen: `select` und seine Helper Functions 

* `Year:DayofMonth`
* `contains("Taxi")`
* `starts_with("Taxi")`
* `ends_with("Delay")`
* `matches(".Dela.")`
* `num_range("x", 1:5, width = 2)` 
* `-TaxiIn`
* `one_of(vector)`   
* `everything()`

## Auswählen von Variablen: `select` mit Helper Function

```{r}
flights %>% select(Year:DayofMonth, contains("Taxi"), -TaxiIn, ends_with("Delay")) %>% print(n = 3)
```

## Sortieren von Variablen: `arrange`

```{r results='hide'}
# base R approach to select UniqueCarrier and DepDelay columns and sort by DepDelay
head(hflights[order(hflights$DepDelay), c("UniqueCarrier", "DepDelay")], n = 3)
```

```{r}
# dplyr approach
flights %>%
    select(UniqueCarrier, DepDelay) %>%
    arrange(DepDelay) %>%
    print(n = 3)
```

## Sortieren von Variablen: `arrange`

```{r}
# use `desc` for descending
flights %>%
    select(UniqueCarrier, DepDelay) %>%
    arrange(desc(DepDelay)) %>%
    print(n = 3)

```

## Erstellen neuer Variablen: `mutate` 

```{r results='hide'}
# base R approach to create a new variable Distance in km. 
flights$Distance_km <- flights$Distance * 1.60934
flights[ , c("Distance", "Distance_km")]
```

```{r}
# dplyr approach (prints the new variable but does not store it)
flights %>%
    select(Distance) %>%
    mutate(Distance_km = Distance * 1.60934) %>%
    print(n = 3)

# store the new variable
flights <- flights %>% mutate(Distance_km = Distance * 1.60934)
```

## Erstellen neuer Variablen: `transmute`

Nur die neue Variable wird ausgegeben (Achtung bei Zuweisungen)

```{r}
flights %>% transmute(Distance_km = Distance * 1.60934)
```

## Daten zusammenfassen: `summarise`

* Summary Function nimmt n Observationen und gibt einen Wert zurück
* Vor allem nützlich bei Daten, die zuvor gruppiert wurden
* __`group_by`__ erstellt die Gruppen auf denen die `summarise`-Operation ausgeführt wird 

<br>

Summary Functions:

* __`n()`__, __`n_distinct()`__ 
* `first()`, `last()`, `nth()` 
* `min()`, `max()`
* __`mean()`__, `median()` 
* `var()`, `sd()`
    
    
## Daten zusammenfassen: `summarise`

```{r results='hide'}
# base R approaches to calculate the average arrival delay to each destination
head(with(data = flights, expr = tapply(X = ArrDelay, INDEX = Dest, FUN = mean, na.rm = TRUE)))
head(aggregate(ArrDelay ~ Dest, flights, mean))
```

```{r}
# dplyr approach: create a table grouped by Dest, then summarise each group by taking the mean of ArrDelay
flights %>%
    group_by(Dest) %>%
    summarise(avg_delay = mean(ArrDelay, na.rm = TRUE)) %>%
    print(n = 3)
```

## Daten zusammenfassen: Summary Function `n()` 

`n()` zählt Zeilen in einer Gruppe

```{r}
# for each day of the year, count the total number of flights and sort in descending order
flights %>%
    group_by(Month, DayofMonth) %>%
    summarise(flight_count = n()) %>%
    arrange(desc(flight_count)) %>%
    print(n = 3)
```


## Daten zusammenfassen: Summary Function `n()` 

Vereinfachbar mit `tally(sort = TRUE)` anstelle von `summarise(flight_count = n()) %>% arrange(desc(flight_count))`.

```{r}
# for each day of the year, count the total number of flights and sort in descending order
flights %>%
    group_by(Month, DayofMonth) %>%
    tally(sort = TRUE) %>%
    print(n = 3)
```


## Daten zusammenfassen: Summary Function `n_distinct(vector)`

`n_distinct(vector)` zählt die Anzahl der einmalig auftretenden Zeilen in einem Vektor

```{r}
# for each destination, count the total number of flights and 
# the number of distinct planes that flew there
flights %>%
    group_by(Dest) %>%
    summarise(flight_count = n(), plane_count = n_distinct(TailNum)) %>%
    print(n = 3)
```

<!---
## Daten zusammenfassen: Summary Function `summarise_each`

Wendet die Summary Function auf mehrere Variablen gleichzeitig an 

```{r}
# for each carrier, calculate the percentage of flights cancelled or diverted
flights %>%
    group_by(UniqueCarrier) %>%
    summarise_each(funs(mean), Cancelled, Diverted)  %>%
    print(n = 2)
```

## Daten zusammenfassen: Summary Function `summarise_each`

```{r}
# for each carrier, calculate the minimum and maximum arrival and departure delays
flights %>%
    group_by(UniqueCarrier) %>%
    summarise_each(funs(min(., na.rm = TRUE), max(., na.rm = TRUE)), matches("Delay")) %>%
    print(n = 2)
```
--->
## Grouping ohne `summarise`

```{r}
# for each destination, show the number of cancelled and not cancelled flights
flights %>%
    group_by(Dest) %>%
    select(Cancelled) %>%
    table() %>%
    head()
```

## Window Functions

* Summary Function (wie `mean`) nimmt n Zeilen und gibt einen Wert zurück
* [Window Function](http://cran.r-project.org/web/packages/dplyr/vignettes/window-functions.html) nimmt n Zeilen und gibt n Werte zurück
* Speziell für `filter` und `mutate`

<br> 

| | |
| --- | --- | 
| Ranking and Ordering | __`min_rank`__, `dense_rank`, `percent_rank`, `row_number` |
| Offset Functions | `lead` and __`lag`__ |
| Kumulative Aggregate | `cummean`, `cumall`, `cumany`, `cumsum`, `cummax`, `cummin`, `cumprod` |

Außerdem gibt es noch
`between`, `ntile`, `cume_dist`, ...


## Window Functions: `min_rank`

```{r}
# for each carrier, calculate which two days of the year they had their longest departure delays
flights %>%
    group_by(UniqueCarrier) %>%
    select(Month, DayofMonth, DepDelay) %>%
    filter(min_rank(desc(DepDelay)) <= 2) %>%
    arrange(UniqueCarrier, desc(DepDelay)) %>%
    print(n = 4)
```

Vereinfachbar mit `top_n(2)` anstelle von `filter(min_rank(desc(DepDelay)) <= 2)`


## Window Functions: `lag` 

```{r}
# for each month, calculate the number of flights and the change from the previous month
flights %>%
    group_by(Month) %>%
    summarise(flight_count = n()) %>%
    mutate(change = flight_count - lag(flight_count)) %>%
    print(n = 5)
```
<!---
```{r results='hide'}
# rewrite more simply with the `tally` function
flights %>%
    group_by(Month) %>%
    tally() %>%
    mutate(change = n - lag(n))
```
--->

## Datenbanken

* dplyr kann sich mit Datenbanken verbinden und Tables in Data Frames laden 
<!---* Verwendet die gleiche Syntax für Local Data Frames und Datenbanken--->
* Unterstützt SQLite, MySQL and PostgreSQL und Googles BigQuery
* Kann nur SELECT Statements erzeugen
* Mehr Informationen in der [Datenbank Vignette](http://cran.r-project.org/web/packages/dplyr/vignettes/databases.html)

```{r eval=FALSE}
# connect to an PostgreSQL database containing the hflights data
my_db <- src_postgres(dbname = "testdb", host = "localhost", user = "my_name")

# connect to the "hflights" table in that database
flights_tbl <- tbl(my_db, "hflights")

# identical query using the database
flights_tbl %>%
    select(UniqueCarrier, DepDelay) %>%
    arrange(desc(DepDelay))
```

<!---
## Datenbanken: SQL-Befehle

```{r eval=FALSE}
# send SQL commands to the database
tbl(my_db, sql('SELECT "UniqueCarrier", "DepDelay" FROM hflights ORDER BY "DepDelay" DESC'))
```

## Datenbanken: SQL-Befehle

dplyr kann das SQL-Statement und den Query Execution Plan zurückgeben

```{r eval=FALSE}
# ask dplyr for the SQL commands
flights_tbl %>%
    select(UniqueCarrier, DepDelay) %>%
    arrange(desc(DepDelay)) %>%
    explain() 
```
--->

## Weitere Operationen mit `do`

```{r}
m <- flights %>%
  group_by(UniqueCarrier) %>%
  do(mod = lm(ArrDelay ~ ArrTime, data = .))
```
```{r}
m$mod[1]
```


## Funktionen, die nicht behandelt wurden

* Neben `tbl_df` gibt es auch `tbl_dt`, `tbl_sql` und `tbl_cube`.
* [Joins](http://image.slidesharecdn.com/gbdc-r-04-150117002911-conversion-gate02/95/r-data-wrangling-predicting-nfl-with-elo-like-nate-silver-538-14-638.jpg?cb=1421535048): `full_join`, `inner_join`, `left_join`, `semi_join`, `anti_join` 
* Datensatzoperationen: `union`, `intersect`, `setdiff`, `setequal` 
* Datensätze verbinden: `bind`, `bind_cols`, `bind_rows`
* `ungroup`
* Innerhalb eigener Funktionen (Standard Evaluation): `filter_`, `select_`, ... 
* `data_frame`

## Quellen

* Zwei gute Tutorials auf YouTube von Kevin Markhams: [Tutorial 1](https://www.youtube.com/watch?v=jWjqLW-u3hc), [Tutorial 2](https://www.youtube.com/watch?v=2mh1PqfsXVI)
* [Offizielles dplyr Reference Manual und Vignettes auf CRAN](http://cran.r-project.org/web/packages/dplyr/index.html)
* [Webinar über dplyr (and ggvis) von Hadley Wickham (2014)](http://pages.rstudio.net/Webinar-Series-Recording-Essential-Tools-for-R.html) und die zugehörigen [Slides](https://github.com/rstudio/webinars/tree/master/2014-01)
* [dplyr Tutorial von Hadley Wickham](https://www.dropbox.com/sh/i8qnluwmuieicxc/AAAgt9tIKoIm7WZKIyK25lh6a) bei der useR! 2014
* [dplyr GitHub Repo](https://github.com/hadley/dplyr) und [Release Liste](https://github.com/hadley/dplyr/releases)