-
Notifications
You must be signed in to change notification settings - Fork 18
/
Copy path202_tidyverse.Rmd
115 lines (78 loc) · 3.26 KB
/
202_tidyverse.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
---
title: "tidyverse"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## The Tidyverse approach to Data Analysis
<https://www.tidyverse.org/>
<https://cran.r-project.org/web/packages/tidyverse/vignettes/manifesto.html>
`tidyverse` is a family of packages promoted by RStudio with the purpose of standardizing the processes in data science
## Packages included
- `readr`
- `tibble`
- `dplyr` for data manipulation
- `tidyr` for "tidy"ing data
- `forcats`
- `ggplot2` for plotting data
- `purrr`
- `stringr`
- and the Pipe operator `%>%`
### The Pipe operator in packages magrittr and dplyr
The pipe %\>% operator was first introduced in the magrittr package. Now, it is an essential part of the tidyverse library. It allows to concatenate a sequence of methods or calls to functions. The operator is read from "left to right". The %\>% takes the data that is on the left hand side and passes it to the function that is on the right hand side.
It comes in handy when applying a sequence of functions to data frames.
```{r, echo=FALSE}
library(magrittr)
print("Compute the square root of the elements of a vector")
x <- 0:5
x
sqrt(x)
" Using the pipe %>% ..." %>% print()
x %>% sqrt()
x %>% summary()
summary(x)
library(dplyr)
mtcars %>% select(mpg, cyl, hp)
mtcars %>% select(mpg, cyl, hp) %>% filter(cyl > 4) %>% summary()
mtcars_2 <- mtcars %>% select(mpg, cyl, hp) %>% filter(cyl > 4) %>% head()
mtcars_2
```
## dplyr
dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:
mutate() adds new variables that are functions of existing variables
select() picks variables based on their names.
filter() picks cases based on their values.
summarise() reduces multiple values down to a single summary.
arrange() changes the ordering of the rows.
group_by() which allows you to perform any operation “by group”.
### Examples
`glimpse`
```{r}
glimpse(mtcars)
```
`select` is used for subsetting variables
```{r}
select(mtcars,mpg)
select(mtcars, mpg:disp,-cyl) # mpg to disp, except cyl
```
`mutate` adds new columns to a dataset
`filter` selects cases based on the values of the rows
```{r}
mtcars %>% filter(hp>100)
```
Group_by is used to group data by one or more columns. Usually
```{r}
mtcars %>% filter(hp>100) %>% group_by(cyl) %>% summarize(avg_mpg=mean(mpg))
```
arrange is used to sort cases in ascending or descending order.
```{r}
mtcars %>% filter(hp>100) %>% group_by(cyl) %>% summarize(avg_mpg=mean(mpg)) %>% arrange(desc(cyl))
```
## Tibble
"Tibbles are a modern take on data frames. They keep the features that have stood the test of time, and drop the features that used to be convenient but are now frustrating (i.e. converting character vectors to factors)." <https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html>
### Criticisms to this approach
- <https://towardsdatascience.com/a-thousand-gadgets-my-thoughts-on-the-r-tidyverse-2441d8504433>
- <https://github.com/matloff/TidyverseSkeptic/blob/master/READMEFull.md>
- <https://www.r-bloggers.com/2019/12/why-i-dont-use-the-tidyverse/>
- Jared P. Lander, R for Everyone. Advanced Analytics and Graphics, 2nd ed., Addison-Wesley, 2017