-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path4-data-analysis-with-dplyr.Rmd
798 lines (515 loc) · 27.9 KB
/
4-data-analysis-with-dplyr.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
# Data Analysis with dplyr
<iframe width="560" height="315" src="https://www.youtube.com/embed/gr1Refk2kyE" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
<iframe width="560" height="315" src="https://www.youtube.com/embed/-Y1sINe5D3I" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
## Instructions
This tutorial is aiming to introduce you to how to manipulate data and transfer the raw data into ready-to-analysis form in R. It will guide you to learn and practice the basic and useful functions in the dplyr package. In this tutorial, follow the step-by-step instruction as well as the examples which demonstrating how the functions work.
Accompanying this tutorial is **a short [Google quiz](https://forms.gle/17WnX6paHx5U14jx6)** for your own self-assessment. The instructions of this tutorial will clearly indicate when you should answer which question.
## Learning Objectives
* Be able to import nhanesA and dplyr package. Be able to understand dataframe.
* Be able to use rename() and select() to rename and select variables (columns) in a dataframe.
* Be able to use filter() to subset a dataframe based on conditions.
* Be able to use arrange() to re-order the rows in a dataframe.
* Be able to use mutate() and transmute() to add new variables that are computed from existing variables in a dataframe.
* Be able to use summarize() to get summary statistics from a dataframe; Be able to perform grouping with summarize(), filter(), and mutate().
* Be able to use pipe to re-write mutilpe operations in a more readable way.
* Be able to check missing values existence and deal with missing values while using the above functions.
## Set up
### Install and load packages
First, we need to install and load the dplyr package as well as bring in our National Health and Nutrition Examination Survey (NHANES) dataset. This particular tutorial uses data from 2013-2014.
For more information about NHANES, you can visit this website. It is recommended that you explore this website to familiar yourself with the data that we will be using throughout this tutorial.
```{r}
# install.packages("dplyr")
library(dplyr)
```
```{r}
# install.packages("nhanesA")
library(nhanesA)
```
### Set working directory
We want to set the working directory using `setwd()` - you may need to change the working directory to match your hard drive set-up.
```{r}
# setwd('/kaggle/working')
```
### Import dataset
This particular tutorial uses the Demographics dataset and Blood pressure dataset from NHANES dataset (2013-2014).
The Demographics dataset contains a huge records of demographics information such as gender, race, annual income for each participant. The Blood pressure dataset is in the Examination data and contains a hugh records of measurements that related to blood pressure measurement.
More information on the imported dataset can be found here:
* [Demographics data](https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=Demographics&CycleBeginYear=2013)
* [Examination data](https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=Examination&CycleBeginYear=2013)
* [Complete variable dictionary](https://wwwn.cdc.gov/Nchs/Nhanes/Search/variablelist.aspx?Component=Demographics&CycleBeginYear=2013)
* [DEMO_H Code book](https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/DEMO_H.htm#Codebook) (Demographic Variables)
* [BPX_H Code book](https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/BPX_H.htm#Codebook) (Blood Pressure)
```{r}
demo <- nhanes('DEMO_H')
bpx <- nhanes('BPX_H')
```
#### Functions debunked {-}
`nhanes()` is the function we use to import our NHANES data and save it in a dataframe. What is a dataframe? We'll introduce it below. The arguments are as follows:
nhanes('Name of Dataset')
**For example:** `nhanes('DEMO_H')`
### Explore our dataset
Recall the basic functions that we learned in tutorial 1, it is good practice for us to explore our datasets before doing any analysis.
We can check the dimension of the datasets:
```{r}
dim(demo)
dim(bpx)
```
We can also check their first few rows:
```{r}
head(demo, 3)
head(bpx, 3)
```
It does not hurt to also check the last few rows of the datasets:
```{r}
tail(demo, 3)
tail(bpx, 3)
```
#### is.na()
In real world, missing values are unavoidable. There are many reasons for missing values - may be a result from nonresponse or incorrect data collection and it happens all the time. It is always a good idea to check if there is any missing value before proceeding.
In a dataframe, you will see NAs if there're missing values and *NA* indicates missing values in R.
Note: Be careful when doing manipulation on real dataset!!!Missing values may be recorded in other ways, for example, *infinity* or other numbers. To deal with missing values in this case, one way is to convert these values into NAs and then treat them as regular NAs. Details will not be discussed in this tutorial but will likely be included in later tutorials.
To check if there is any missing values (NAs) in a dataframe, use `is.na()`:
```{r}
head(is.na(demo),5)
```
To find the total number of missing values in a dataframe:
```{r}
sum(is.na(demo))
```
To check if there is any missing values (NAs) in a column,for example, gender:
```{r}
head(is.na(demo['RIAGENDR']),5)
```
To find the total number of missing values in a column:
```{r}
sum(is.na(demo['RIAGENDR']))
```
To find the number of missing values for each column, use the summary() function:
```{r, results="hide"}
summary(demo)
```
If there is a missing value in one cell, is.na() will return **TRUE**; if there is no missing value in one cell, it will return **FALSE**
For example, (1,1) (1st row 1st column) is FALSE - it means that (1,1) is not a missing value; (1,6) (1st row 6th column) is TRUE - it means that (1,6) is NA.
[**is.na()**](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/NA) is the function we use to inspect any missing values in a dataframe - it is housed in the base package.
The arguments are as follows:
is.na(
> NAME OF DATAFRAME
)
**For example:** `is.na(demo)`
#### **DO QUESTIONS 1-3 OF THE QUIZ NOW** {-}
* What package does the function `nhanes()` belong to?
* Which code below is the correct one to print the last 10 rows of a dataframe called temp?
* Suppose a cell contains NA. What is the output after perform is.na() on it?
## Dataset Preparation
Now that we've successfully imported the two datasets and we're ready to do further data manipulations. But before we proceed, there are three issues in the datasets:
* The two datasets are **Huge**. Recall the dimensions of the two datasets in section 1, there are 10175 rows and 47 columns in the demo dataset and 9813 rows and 23 columns in the bpx dataset.
+ For demostration purpose, this tutorial will only focus on parts of the two datasets.
+ For the demo dataset, we want to keep the following variables as our primary interests only:
+ *SEQN* (Respondent sequence number)
+ *RIAGENDR* (Gender)
+ *RIDAGEYR* (Age in years at screening)
+ *RIDRETH3* (Race)
+ *DMDEDUC2* (Education level - Adults 20+)
+ For the bpx dataset, we want to keep the following variables as our primary interests only:
+ *SEQN* (Respondent sequence number)
+ *PEASCST1* (Blood Pressure Status)
+ *PEASCTM1* (Blood Pressure Time in Seconds)
+ *BPXSY1* (Systolic: Blood pres (1st rdg) mm Hg)
+ *BPXDI1* (Diastolic: Blood pres (1st rdg) mm Hg)
+ For both datasets, we want to keep the top 5 rows only.
* There is something odd about the demo dataset. For example,if you run demo alone, you will see that *RIAGENDR* (gender) is coded as 1 and 2. For ease of future use, we want to translate this 1 and 2 into male and female.
* The variable names are long are ambiguous. For ease of further use, we want to rename the variables so that they can be easily understand.
+ For the demo dataframe, we want to rename the variables in this way:
+ SEQN -> id
+ RIAGENDR -> gender
+ RIDAGEYR -> age
+ RIDRETH3 -> race
+ DMDEDUC2 -> race
+ For the bpx dataframe, we want to rename the variables in this way:
+ SEQN -> id
+ PEASCST1 -> bp_status
+ PEASCTM1 -> bpt_sec
+ BPXSY1 -> systolic
+ BPXDI1 -> diastolic
+ **Note**, there are many different ways of renaming as long as the new names are straight-forward. We will use the above rename strategy in this tutorial for the sake of consistency.
With the three issues listed above, here's an example of how to resolve them in the demo dataset:
First, we need to translate the way of variable encoding using the `nhanesTranslate()` function:
```{r}
demo_translate <- nhanesTranslate('DEMO_H',
c('SEQN', # Respondent sequence number
'RIAGENDR', # Gender
'RIDAGEYR', # Age in years at screening
'RIDRETH3', # Race
'DMDEDUC2'), # Education level - Adults 20+
data = demo)
head(demo_translate,5)
```
Second, we want to keep the variabls which we're interested in only:
```{r}
new_demo <- select(demo_translate,
c(SEQN,
RIAGENDR,
RIDAGEYR,
RIDRETH3,
DMDEDUC2))
head(new_demo,5)
```
Third, we want to rename the variables using the `rename()` function:
```{r}
new_demo <- rename(new_demo,
id = SEQN, # Respondent sequence number
gender = RIAGENDR, # Gender
age = RIDAGEYR, # Age in years at screening
race = RIDRETH3, # Race/Hispanic origin
edu = DMDEDUC2, # Education level - Adults 20+
)
head(new_demo,5)
```
Last, we want to keep the top 10 rows using the `head()` function and save the dataframe as `final_demo`:
```{r}
final_demo <- head(new_demo,10)
final_demo
```
[**rename()**](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/select) is the function we use to rename the variables and return all variables - it is housed in the dplyr package. The arguments are as follows:
rename(**Name of Data Frame**, **New Variable Name** = **Old Variable Name**)
**For example:** `rename(demo, id = SEQN)`
[**select()**](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/select) is the function we use to only return the varibles we want - it is housed in the dplyr package. The arguments are as follows:
select(**Name of Data Frame**, **Variables (can be written as a vector)**)
**For example:** `select(demo, c(id,gender,age))`
**For example:** `select(demo, id:age))`
### [Try it yourself 5.1][5.1] {-}
In the bpx dataframe, keep the following variables and top 5 rows only:
+ SEQN
+ PEASCST1
+ PEASCTM1
+ BPXSY1
+ BPXDI1
### [Try it yourself 5.2][5.2] {-}
rename the variables in the following way:
+ SEQN -> id
+ PEASCST1 -> bp_status
+ PEASCTM1 -> bpt_sec
+ BPXSY1 -> systolic
+ BPXDI1 -> diastolic
#### **DO QUESTIONS 4 & 5 OF THE QUIZ NOW** {-}
* Given the following code: `select(dataset, A, B)`. What is the purpose of the code?
* Which code below is the correct one to rename column A to B in a dataframe called dataset?
## Filter
Sometimes, we want to focus on a subset of the dataset that satisifying some conditions for further analysis. In this case, we need to filter the dataset based on variables' values and conditions.
For example, we use the following code to filter the observations that patients are 40 years old at screening.
```{r}
filter(final_demo, age == 40)
```
We may also interested in the observations that patients are 40 **or** 41 years old at screening:
```{r}
filter(final_demo, age == 40 | age == 41)
```
Another way of writing the code above is to use the `%in%` operator:
`x %in% y` is equivalent to the condition that **the value of x is in one of the values of y**
```{r}
filter(final_demo, age %in% c(40, 41))
```
#### Functions debunked {-}
[**filter()**](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/filter) is the function we use to subset the dataset based on their values and conditions - it is housed in the dplyr package. The arguments are as follows:
filter(**Name of Dataset**, **Condition 1**, **Condition 2**, **...**, **Condition n**)
**For example:** `filter(new_demo, gender == 1)`
**Note:**
1. Pay attention to the difference between `=` and `==`
> `=` is assignment operator
> `==` is comparison operator
More comparison operators are `<`, `<=`,`>`, `>=`,`!=`.
2. Logical operators
Mutiple conditions can be combined using logical operators. Common logical operators are:
> `and` is `&`
> `or` is `|`
> `not` is `!`
**For example:** `filter(new_demo, age == 40 | age == 41)`
#### Missing values
Note, we don't have any missing values in the age column (can be checked using `is.na(new_demo['age'])`).
What if we have missing values and how does filter() treat missing values?
filter() only keeps the rows where the condition is **TRUE** and remove the rows where the condition is failed due to **FALSE OR NA**.
If you want to keep the NAs, you need to add the condition explicitly: `is.na(VARIABLE_NAME)` and use `|` to combine with other conditions.
**For example:** `filter(new_demo, is.na(income) | income <= 50)`
### [Try it yourself 5.3][5.3] {-}
In the demo dataframe, find all the records that:
a. The participant who is a male
b. The participant who is a male and is older tha 50 years old
c. The education level is missing
#### **DO QUESTIONS 6-8 OF THE QUIZ NOW** {-}
Fill in the blank to answer the 'Try it yourself' sections:
* filter(final_demo, gender `___`)
* filter(final_demo, gender == 'Male' `___` age > 50)
* filter(final_demo, `___`)
## Re-order the Rows
Suppose we want to re-order the observations by certain variables, we can use the arrange() function in R.
First, let's look at the top 5 rows in the demo dataframe:
```{r}
final_demo
```
We notice that the column "age" doesn't follow any order. We use the following code to re-order the dataframe by age:
```{r}
arrange(final_demo, age)
```
If we want to change it to descending order, use desc():
```{r}
arrange(final_demo, desc(age))
```
What if we want to change the order by multiple columns?
We can simply add multiple columns in the arguments! It's also useful when there are ties in the values of one column and the subsquent columns are used to break the ties.
For example, re-order the observations by age,id:
```{r}
arrange(final_demo,age,id)
```
#### Functions debunked {-}
[**arrange()**](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/arrange) is the function we use to change the order of the observations by columns - it is housed in the dplyr package. The arguments are as follows:
arrange(
> NAME OF DATAFRAME,
> COLUMN 1
> COLUMN 2
> ...
> COLUMN n
)
**For example:** `arrange(new_demo, gender)`
Use `desc(COLUMN_NAME)` to reorder the dataframe by *COLUMN_NAME* in descending order.
#### Missing values
In the last example, we notice that there are missing values in column 'income' and 'income_ratio'.
How does arrange() deal with missing value? **All NAs will be retained at the end.** Check the output above!
### [Try it yourself 5.4][5.4] {-}
Re-order the rows in the bpx dataset by Blood Pressure Time in Seconds (bpt_sec) in descending order:
#### **DO QUESTION 9 OF THE QUIZ NOW** {-}
Fill in the blank to answer the 'Try it yourself' section:
* `___`(final_bpx,`___`)
## Add new variables
Suppose we want to gain more information about our observations and we want to constrcut a new variable based on the variables we have. The mutate() and transmute() in R helps us to add new variables into a dataframe.
For example, we want to add a new variable called born_year in the demo dataframe and born_year is calculated in this way: born_year = 2021- age:
```{r}
mutate(final_demo,
born_year = 2021 - age
)
```
If we want to keep the added variables only, use transmute() instead:
```{r}
transmute(final_demo,
born_year = 2021 - age
)
```
#### Functions debunked {-}
[**mutate()**](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/mutate) is the function we use to create new variables based on variables we have and add them to the original dataframe - it is housed in the dplyr package. The arguments are as follows:
mutate(
> NAME OF DATAFRAME,
> NEW_VARIABLE = FUNCTION OF EXISTING VARIBALES
> ...
)
**For example:** `mutate(new_demo, rescale_income = income/2)`
[**transmute()**](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/mutate) is the function we use to create new variables based on variables we have and only keep the added variables - it is housed in the dplyr package. The arguments are as follows:
transmute(
> NAME OF DATAFRAME,
> NEW_VARIABLE = FUNCTION OF EXISTING VARIBALES
> ...
)
**For example:** `transmute(new_demo, rescale_income = income/2)`
You can find more details in how to use creation functions to create new variables in [Chapter I. Explore Chapter 3](https://learning.oreilly.com/library/view/r-for-data/9781491910382/ch03.html#transform).
### [Try it yourself 5.5][5.5] {-}
a. Create a new variable called called **rescale_bpt_sec** that records the Blood Pressure Time in miuntes. Keep both original and new variables.
b. Create **rescale_bpt_sec** in the same way above and **only keep new variables**.
Note: Try to avoid using select().
#### **DO QUESTION 10 OF THE QUIZ NOW** {-}
Fill in the blank to answer both parts of the 'Try it yourself' sections.
* `___`(final_bpx, rescale_bpt_sec = bpt_sec/60)
## Summary Statistics and group_by
Information about the whole dataframe such as mean is very useful for data analysis. However, things could be very complex when aggregating multiple functions. We'll start with the simple one: find a summary statistic from the whole dataset.
For example, find the average age in the whole demo dataframe:
```{r}
summarize(final_demo,average_age = mean(age,na.rm = TRUE))
```
It gets more complex when we change the unit of analysis into groups, i.e, find summary statistics grouping by variables. To do so, we first convert the dataframe into a grouped dataframe using group_by() and then apply the summarize() to the grouped dataframe.
For example, find the average age in the demo dataframe per gender:
```{r}
by_gender <- group_by(final_demo,gender)
summarize(by_gender, average_age = mean(age, na.rm = TRUE))
```
#### Functions debunked {-}
[**summarize()**](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/summarise) is the function we use to compute the summary statistics for the whole dataframe - it is housed in the dplyr package. The arguments are as follows:
summarize(**Name of Data Frame**, **Name of Summary Statistic** = **Function()**)
**For example:** `summarize(new_demo, mean(age,na.rm = TRUE))`
[**group_by()**](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/group_by) is the function we use to create a grouped dataframe grouping by one or more variables - it is housed in the dplyr package. The arguments are as follows:
group_by(**Name of Data Frame**, **Name of Variable**)
**For example:** `group_by(new_demo,age)`
**Note:** if there is missing value in the variable, group_by() treats it as a new group
### [Try it yourself 5.6][5.6] {-}
Find the average age in the demo dataframe per education level
#### **DO QUESTION 11 OF THE QUIZ NOW** {-}
Fill in the blanks to answer the 'Try it yourself' section:
a. by_edu <- group_by(`___`)
b. summarize(`___`, average_age = `___`(age, na.rm = TRUE))
### Group_by() extension
group_by() is also useful when conjuncting with filter() and mutate().
For example, return the observations which has more than 2 records in each education group.
```{r}
by_edu <-group_by(final_demo,edu)
filter(by_edu,n() > 2)
```
Here's another example: using group_by() with mutate() to compute the difference in each age and the average mean in each gender group
```{r}
by_gender <-group_by(final_demo,gender)
mutate(by_gender,
diff_age = age - mean(age, na.rm = T))
```
#### Functions debunked {-}
[**n()**](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/n) is the function we use to count the number of observations in each group - it is housed in the dplyr package. It can only be used with the existence of summarize(), filter(), and mutate() and there is **no** argument in it.
### [Try it yourself 5.7][5.7] {-}
Return the observations which has more than 3 records in each gender group.
### [Try it yourself 5.8][5.8] {-}
Compute the difference in each age and the average mean in each education level group
#### **DO QUESTIONS 12 & 13 OF THE QUIZ NOW** {-}
Fill in the blanks to answer the 'Try it yourself' sections:
a. by_gender <-group_by(final_demo,gender)
* filter(by_edu,`___`)
b. by_edu <-group_by(final_demo,edu)
* mutate(by_edu, `___`)
## Pipe
You may have noticed that when doing multiple-step operations, we need to assign the output to a new variable every time when a step is done. It becomes more annoying when there are more steps to do. The pipe operator [**%>%**](https://www.rdocumentation.org/packages/magrittr/versions/2.0.1/topics/magrittr-package) - which is housed in the magrittr package - saves us from create many unnecessary variables: it takes the output from one function as an input to the following function.
Here's example without using pipe: we want to first keep the observations in the demo dataframe with an age greater than 40 and then create a new variable called born_year calculated by 2021 - age.
We need to do this in two steps.The first step is to filter the dataframe and save the output as a new variable *temp*:
```{r}
temp <- filter(final_demo, age > 40)
temp
```
The second step is to create the new variable:
```{r}
temp <- mutate(temp, born_year = 2021 - age)
temp
```
The intermediate variable *temp* can be avoided by using the pipe operatore:
```{r}
final_demo %>%
filter(age > 40) %>%
mutate(born_year = 2021 - age)
```
### [Try it yourself 5.9][5.9] {-}
Re-write the following code using pipe operator
```{r}
# temp <- filter(final_bpx, systolic > 120)
# temp <- mutate(temp, bpt_min = bpt_sec/60)
# temp
```
#### **DO QUESTION 14 OF THE QUIZ NOW** {-}
Fill in the blanks to answer the'Try it yourself' section:
* `___` %>%
* filter(`___`,systolic > 120) %>%
* mutate(`___`,bpt_min = bpt_sec/60)
## Summary of dealing with missing values
Dealing with missing values can be complex. You need to be careful when using functions since different functions have different ways in dealing with missing values.
filter(): excludes missing values. If you do want to retain missing values, use **is.na()** explicitly.
arrange(): retains missing values and sorts them at the end
select(): retains missing values
mutate(): retains missing values
summarize(): retains missing values. If you want to remove missing values, set **na.rm = TRUE** in aggregation functions.
group_by(): treats missing values as a group
## Alternatives to NHANESTranslate()
### `case_when()`
`case_when()` is another useful function that we can use that will aid our data analysis. This function acts like `mutate()`, except it changes the actual values of the variable. Here is an example of how we can use `case_when()` and `mutate()` together to change values within a variable:
```{r}
x <- c(97, 36, 55, 50, 49, 65, 46, 87, 100)
```
```{r}
case_when(x < 50 ~ "Fail",
x > 65 ~ "Pass",
TRUE ~ "Retake recommended")
```
In the example above, we have a vector of student grades. Using `case_when()`, we have translated the grades to "Fail", "Pass", and "Retake recommended" depending on the student grades.
* **Fail** if students receive a grade of less than 50,
* **Pass** if students receive a grade of greater than 65,
* **Retake recommended** if students receive a grade between 50 and 65.
## Translating NHANES using `case_when()`
Another general method for us to translate our NHANES data into conventional language is to use `mutate()` and `case_when()`! However, this method requires us to know what each numerical value of each variable means. For example, if we want to translate the RIDRETH3 values, then we need to check the [Codebook and Frequencies](https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/DEMO_H.htm) for the following translations:
* 1: "Mexican American"
* 2: "Other Hispanic"
* 3: "Non-Hispanic White"
* 4: "Non-Hispanic Black"
* 6: "Non-Hispanic Asian"
* 7: "Other Race - Including Multi-Racial"
* .: "Missing"
Knowing this, we can use `mutate()` and `case_when()` like so:
```{r}
translated_demo <- demo %>%
mutate(Race = case_when(
RIDRETH3 == 1 ~ "Mexican American",
RIDRETH3 == 2 ~ "Other Hispanic",
RIDRETH3 == 3 ~ "Non-Hispanic White",
RIDRETH3 == 4 ~ "Non-Hispanic Black",
RIDRETH3 == 6 ~ "Non-Hispanic Asian",
RIDRETH3 == 7 ~ "Other Race - Including Multi-Racial",
RIDRETH3 == "." ~ "Missing"
)) %>%
select(ID = SEQN, Race)
```
```{r}
head(translated_demo, 10)
```
But what if we want to have less categories or want to combine some of the categories? What do we do then?
`case_when()` is still the way to go! In this case, our codes should look like this:
```{r}
less_categories <- demo %>%
mutate(Race = case_when(
RIDRETH3 == 1 ~ "Hispanic",
RIDRETH3 == 2 ~ "Hispanic",
RIDRETH3 == 3 ~ "White",
RIDRETH3 == 4 ~ "Black",
RIDRETH3 == 6 ~ "Asian",
RIDRETH3 == 7 ~ "Other",
RIDRETH3 == "." ~ "Missing"
)) %>%
select(ID = SEQN, Race)
```
```{r}
head(less_categories, 10)
```
#### Functions debunked {-}
**[case_when()](https://dplyr.tidyverse.org/reference/case_when.html)** is a function that we use to change or translate the values of our variables into something else that is still meaningful. The arguments are as follows:
case_when(**A Variable** == **Value as it is written in the original dataset** ~ **"New/Translated Value"**
)
Note that if none of the cases match, then R will automatically regard it as a missing (NA) value.
**For example:** `case_when(x < 50 ~ "Fail", x > 65 ~ "Pass", TRUE ~ "Retake recommended")`
## `recode()` from dplyr
Another alternative is `recode()`. This function works the same way as `case_when()`, except you only need to identify the variable once like so:
```{r}
# demo_translate2 <- demo %>%
# mutate(Race = dplyr::recode(RIDRETH3,
# `1` = "Hispanic",
# `2` = "Hispanic",
# `3` = "White",
# `4` = "Black",
# `6` = "Asian",
# `7` = "Other",
# .default = "Missing"
# )) %>%
# select(ID = SEQN, Race)
```
```{r}
# head(demo_translate2, 10)
```
## `recode()` from car
Another option is to use `recode()` from the car package. This function is mostly used for translating numerical data into more meaningful character data or strings.
```{r}
library(car)
```
While looking at the codes below, take note of the different types of quotation marks: single ('') or double ("") and the semi-colon (;) as the argument separator. Note also how we can use `1:2` instead of separating them into 1 and 2 like what we did with earlier functions.
```{r}
demo_translate3 <- demo %>%
mutate(Race = car::recode(RIDRETH3,
"1:2 = 'Hispanic';
3 = 'White';
4 = 'Black';
6 = 'Asian';
7 = 'Other';
else = 'Missing'"
)) %>%
select(ID = SEQN, Race)
```
```{r}
head(demo_translate3, 10)
```
## Summary and Takeaways
By the end of this tutorial, you should be familiar with the dplyr package and be able to do basic data wrangling by yourself.
For more study materials on the dplyr package, check out this [textbook](https://learning.oreilly.com/library/view/r-for-data/9781491910382/ch03.html#transform).