-
Notifications
You must be signed in to change notification settings - Fork 0
/
Data_Wrangling_2.Rmd
537 lines (380 loc) Β· 16.2 KB
/
Data_Wrangling_2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
## Data Wrangling Part 2
![](images/wrangler.png){width=1in}
![](images/tidyR.png){width=1in}
![](images/dplyr.png){width=1.5in}
The video below offers a brief overview of the content covered in this section of the tutorial. Feel free to watch the video and follow along or simply work through the tutorial.
<!-- ![](https://youtu.be/8vqQljFcMsw) -->
<iframe width="560" height="315" src="https://www.youtube.com/embed/8vqQljFcMsw" frameborder="0" allowfullscreen></iframe>
### Notes on tidy R
![](images/tidyR.png){width=1in}
Keep it tidy
If you are following this tutorial by running the code on your local machine (recommended) then it may make sense to check your R version by running the following code in your R console:
```{r example-version, exercise=TRUE}
version
```
At the time of writing this I am using R version 4.0.4 `Lost Library Book` [@R-base]. If you do not have this version or something newer it may make sense to update so that you can follow along without pesky version issues.
The easiest way to get libraries for today is to install the whole tidyverse [@R-tidyverse] by typing `install.packages("tidyverse")` in the R console and then running `library(tidyverse)`:
```
install.packages("tidyverse")
library(tidyverse)
```
If you save your work to an .R file (recommended) be sure to annotate any code that you do not intend to run each time with the `#` symbol. You should only need to install tidyverse once and should be sure to either change that line of code to `#install.packages("tidyverse")` or remove it from your script.
Read more style tips in the [tidyverse style guide](http://style.tidyverse.org/) [@R-tidyverse].
### Notes on tidy R browseVignettes
![](images/tidyR.png){width=1in}
Keep it tidy
Get a lot of examples and details about the tidyverse by running the following code in the R console: `browseVignettes(package = "tidyverse")`. Nearly every R library has a collection of vignettes that walk through examples and show, often in explicit detail, the authors intended use of the library.
### The tidy tools manifesto
In this tutorial we will be following the basic ideas behind the tidyverse.
![](images/tidy_paper.jpg){width=5in}
Read the [full tidyverse manifesto here](https://cran.r-project.org/web/packages/tidyverse/vignettes/manifesto.html).
### Notes on R: tidyR process
![](images/tidyR.png){width=1in}
Keep it tidy
![](images/tidyR_process.png){width=4in}
- Good coding style is like correct punctuation:
- withoutitthingsarehardtoread
- When your data is tidy, each column is a variable, and each row is an observation
- Consistent structure lets you focus your struggle on questions about the data, not fighting to get the data into the right form for different functions
![](images/wrangler.png){width=2in}
Read more style tipes in the [tidyverse style guide](http://style.tidyverse.org/) [@R-tidyverse].
### Notes on R: Tidy Data
Three things make a dataset tidy:
- Each variable with its own column.
- Each observation with its own row.
- Each value with its own cell.
![](images/tidydata.png){width=5in}
Read more about this from [Wickham's paper](www.jstatsoft.org/v59/i10/paper) in the Journal of Statistical Software.
### Wrangling: transform
- Once you have __tidy__ data, a common first step is to __transform__ it
- narrowing in on observations of interest
- creating new variables that are functions of existing variables
- calculating a set of summary statistics
![](images/Wrangling_Data.png){width=3in}
www.codeastar.com/data-wrangling/
### Wrangling: dplyr arguments
Format of __dplyr__
![](images/dplyr.png){width=1in}
Arguments start with a data frame
- __select__: return a subset of the columns
- __filter__: extract a subset of rows
- __rename__: rename variables
- __mutate__: add new variables and columns or transform
- __group_by__: split data into groups
- __summarize__: generate tables of summary statistics
https://dplyr.tidyverse.org/
### Getting your data in R
Load data
![](images/R_logo.png){width=1in}
The data we will use for this course is [on Github](https://raw.githubusercontent.com/CWWhitney/teaching_R/master/participants_data.csv) and you can save it as a .csv to your local folder.
- Load the data using the `read.csv` function
```
# Use this on your machine
participants_data <- read.csv("participants_data.csv")
```
Learn more about what this function does by typing `?read.csv` in the R console.
You can also get the data from this Github repository by using the `read_csv` function from the `readr` library [@R-readr] and `url` function from base R. In this case you will want to use the 'save as' option for the webpage so that you can have it stored locally as a `comma separated values` (.csv) file on your machine.
```{r wrangle2-git-data, exercise=TRUE}
library(readr)
urlfile = "https://raw.githubusercontent.com/CWWhitney/teaching_R/master/participants_data.csv"
participants_data <- read_csv(url(urlfile))
```
- Keep your data in the same folder structure as .RProj
- at or below the level of .RProj
### Looking at the data
- View the full data in the console (see the `View` function to see it in the Rstudio 'Environment')
```{r wrangle2-names_participants-data, exercise=TRUE}
participants_data
```
- Look at the top rows of the data with the `head` function. The default of the `head` function is to show 6 rows. This can be changed with the `n` argument.
```{r wrangle2-head-participants-data, exercise=TRUE}
# Change the number of rows displayed to 7
head(participants_data,
n = 4)
```
```{r wrangle2-head-participants-data-hint-1}
# use the ?head option to learn the details of the function
?head
```
```{r wrangle2-head-participants-data-hint-2}
# look at the 'Arguments' section for the 'n' argument
?head
```
```{r wrangle2-head-participants-data-hint-3}
# The 'n' argument should be changed from 'n = 4' to 'n = 7'
head(participants_data,
n = 7)
```
```{r wrangle2-head-participants-data-solution}
head(participants_data,
n = 7)
```
- Check the names of the variables in the data with the `names` function
```{r wrangle2-names-participants-data, exercise=TRUE}
names(participants_data)
```
- Look at the structure of the data with the `str` function
```{r wrangle2-str-participants-data, exercise=TRUE}
str(participants_data)
```
- Call a particular variable in your data with `$`
```{r wrangle2-call-variable, exercise=TRUE}
# Change the variable to gender
participants_data$age
```
```{r wrangle2-call-variable-solution}
participants_data$gender
```
Follow these steps to see the result of the rest of the transformations we perform with `tidyverse`.
### Wrangling: dplyr library
Using __dplyr__ ![](images/dplyr.png){width=1in}
Load the dplyr library by running `library(dplyr)` in the R console. do the same for other libraries we need today `library(tidyr)` and `library(magrittr)` [@R-tidyr, @R-magrittr, @R-dplyr].
Inspiration for many of the following materials comes from Roger Peng's [dplyr tutorial](genomicsclass.github.io/book/pages/dplyr_tutorial).
![](images/github.png){width=1in}
Read more about the [dplyr library](https://dplyr.tidyverse.org/) [@R-dplyr].
### Wrangling: dplyr::select aca_work_set
Subsetting
![](images/dplyr.png){width=1in}
__Select__
Create a subset of the data with the `select` function:
```{r wrangle2-select-aca-parents, exercise=TRUE}
# Change the selection to batch and age
select(participants_data,
academic_parents,
working_hours_per_day)
```
```{r wrangle2-select-aca-parents-solution}
select(participants_data,
batch,
age)
```
https://dplyr.tidyverse.org/
### Wrangling: dplyr::select non_aca_work_filter
Subsetting ![](images/dplyr.png){width=1in}
__Select__
Try creating a subset of the data with the `select` function:
```{r wrangle2-select-non-aca-parents, exercise=TRUE}
# Change the selection
# without batch and age
select(participants_data,
-academic_parents,
-working_hours_per_day)
```
```{r wrangle2-select-non-aca-parents-solution}
select(participants_data,
-batch,
-age)
```
https://dplyr.tidyverse.org/
### Wrangling: dplyr::filter work_filter
Subsetting ![](images/dplyr.png){width=1in}
__Filter__
Try creating a subset of the data with the `filter` function:
```{r wrangle2-filter-work, exercise=TRUE}
# Change the selection to
# those who work more than 5 hours a day
filter(participants_data,
working_hours_per_day >10)
```
```{r wrangle2-filter-work-solution}
filter(participants_data,
working_hours_per_day >5)
```
https://dplyr.tidyverse.org/
### Wrangling: dplyr::filter work_name_filter
Subsetting ![](images/dplyr.png){width=1in}
__Filter__
Create a subset of the data with multiple options in the `filter` function:
```{r wrangle2-filter-work-name, exercise=TRUE}
# Change the filter to those who
# work more than 5 hours a day and
# names are longer than three letters
filter(participants_data,
working_hours_per_day >10 &
letters_in_first_name >6)
```
```{r wrangle2-filter-work-name-solution}
filter(participants_data,
working_hours_per_day >5 &
letters_in_first_name >3)
```
https://dplyr.tidyverse.org/
### Wrangling: dplyr::rename name_length
__Rename__ ![](images/dplyr.png){width=1in}
Change the names of the variables in the data with the `rename` function:
```{r wrangle2-rename-letters, exercise=TRUE}
# Rename the variable km_home_to_office as commute
rename(participants_data,
name_length = letters_in_first_name)
```
```{r wrangle2-rename-letters-solution}
rename(participants_data,
commute = km_home_to_office)
```
https://dplyr.tidyverse.org/
### Wrangling: dplyr::mutate
__Mutate__ ![](images/dplyr.png){width=1in}
```{r wrangle2-rename-work, exercise=TRUE}
# Mutate a new column named age_mean that is a function of the age multiplied by the mean of all ages in the group
mutate(participants_data,
labor_mean = working_hours_per_day*
mean(working_hours_per_day))
```
```{r wrangle2-rename-work-solution}
mutate(participants_data,
age_mean = age*
mean(age))
```
https://dplyr.tidyverse.org/
### Wrangling: dplyr::mutate
__Mutate__ ![](images/dplyr.png){width=1in}
Create a commute category with the `mutate` function:
```{r wrangle2-mutate-commute, exercise=TRUE}
# Mutate new column named response_speed
# populated by 'slow' if it took you
# more than a day to answer my email and
# 'fast' for others
mutate(participants_data,
commute =
ifelse(km_home_to_office > 10,
"commuter", "local"))
```
```{r wrangle2-mutate-commute-solution}
mutate(participants_data,
response_speed =
ifelse(days_to_email_response > 1,
"slow", "fast"))
```
https://dplyr.tidyverse.org/
### Wrangling: dplyr::summarize
__Summarize__
![](images/dplyr.png){width=1in}
Get a summary of selected variables with `summarize`
```{r wrangle2-summar-commute, exercise=TRUE}
# Create a summary of the participants_mutate data
# with the mean number of siblings
# and median years of study
summarize(participants_data,
mean(years_of_study),
median(letters_in_first_name))
```
```{r wrangle2-summar-commute-solution}
summarize(participants_data,
mean(number_of_siblings),
median(years_of_study))
```
### Wrangling: magrittr use
__Pipeline %>%__
- Do all the previous with a `magrittr` pipeline %>%. Use the `group_by` function to get these results for comparison between groups.
```{r wrangle2-pipe-long, exercise=TRUE}
# Use the magrittr pipe to summarize
# the mean days to email response,
# median letters in first name,
# and maximum years of study by gender
participants_data %>%
group_by(research_continent) %>%
summarize(mean(days_to_email_response),
median(letters_in_first_name),
max(years_of_study))
```
```{r wrangle2-pipe-long-solution}
participants_data %>%
group_by(gender) %>%
summarize(mean(days_to_email_response),
median(letters_in_first_name),
max(years_of_study))
```
Now use the `mutate` function to subset the data and use the `group_by` function to get these results for comparisons between groups.
```{r wrangle2-pipe-long2, exercise=TRUE}
# Use the magrittr pipe to create a new column
# called commute, where those who travel
# more than 10km to get to the office
# are called "commuter" and others are "local".
# Summarize the mean days to email response,
# median letters in first name,
# and maximum years of study.
participants_data %>%
mutate(response_speed = ifelse(
days_to_email_response > 1,
"slow", "fast")) %>%
group_by(response_speed) %>%
summarize(mean(number_of_siblings),
median(years_of_study),
max(letters_in_first_name))
```
```{r wrangle2-pipe-long2-solution}
participants_data %>%
mutate(commute = ifelse(
km_home_to_office > 10,
"commuter", "local")) %>%
group_by(commute) %>%
summarize(mean(days_to_email_response),
median(letters_in_first_name),
max(years_of_study))
```
### purrr: Apply a function to each element of a vector
![](images/purrr.jpg){width=4in}
We will use the `purrr` library to run a regression [@R-purrr]. Run the code `library(purrr)` in your local R console to load the library.
![](images/purrr.jpg){width=1in}
Now we will use the `purrr` library for a simple linear regression [@R-purrr]. Note that when using base R functions with the `magrittr` pipeline we use '.' to refer to the data. The functions `split` and `lm` are from base R and stats [@R-base].
Use purrr to solve: split a data frame into pieces, fit a model to each piece, compute the summary, then extract the R^2.
```{r wrangle2-purr-regression, exercise=TRUE}
# Split the data frame by batch,
# fit a linear model formula
# (days to email response as dependent
# and working hours as independent)
# to each batch, compute the summary,
# then extract the R^2.
participants_data %>%
split(.$gender) %>%
map(~
lm(number_of_publications ~
number_of_siblings,
data = .)) %>%
map(summary) %>%
map_dbl("r.squared")
```
```{r wrangle2-purr-regression-solution}
participants_data %>%
split(.$batch) %>% # from base R
map(~
lm(days_to_email_response ~
working_hours_per_day,
data = .)) %>%
map(summary) %>%
map_dbl("r.squared")
```
Learn more about purrr from in [the tidyverse](https://purrr.tidyverse.org/) and from [varianceexplained](http://varianceexplained.org/r/teach-tidyverse/).
[Check out the purr Cheatsheet](https://github.com/rstudio/cheatsheets/blob/master/purrr.pdf)
![](images/tidyR.png){width=1in}
![](images/dplyr.png){width=1in}
![](images/magrittr.png){width=1in}
### Test your new skills
**Your turn to perform**
Up until this point the code has been provided for you to work on. Now it is time for you to apply your new found skills. Please work through the wrangling tasks we just went though. Use the `diamonds` data and make the steps in long format (i.e. assigning each step to an object) and short format with (i.e. with the magrittr pipeline):
- select: carat and price
- filter: only where carat is > 0.5
- rename: rename price as cost
- mutate: create a variable with 'expensive' if greater than mean of cost and 'cheap' otherwise
- group_by: split into cheap and expensive
- summarize: give some summary statistics of your choice
The diamonds data is built in with the `ggplot2` library. It is already available in your R environment. Look at the help file with `?diamonds` to learn more about it.
```{r wrangle2-final, exercise=TRUE}
```
```{r wrangle2-final-solution}
diamonds %>%
# - select: carat and price
select(carat, price) %>%
# - filter: only where carat is > 0.5
filter(carat > 0.5) %>%
# - rename: rename price as cost
rename(cost = price) %>%
# - mutate: create a variable 'cheap_expensive' with 'expensive' if greater than mean of cost and 'cheap' otherwise
mutate(cheap_expensive = ifelse(
cost > mean(cost),
"expensive", "cheap")) %>%
# - group_by: split into cheap and expensive
group_by(cheap_expensive) %>%
# - summarize: give some summary statistics of your choice
summarize(mean(cost), mean(carat))
```