-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path7-data-summary-with-tableone.Rmd
397 lines (258 loc) · 15.7 KB
/
7-data-summary-with-tableone.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
# Data Summary with tableone
<iframe width="560" height="315" src="https://www.youtube.com/embed/NvRjjMxTUyU" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
## Instructions
In this tutorial, we will be exploring how to summarize all variables of our datasets in one single table. We will familiarize ourselves with the R package tableone and its associated functions. This tutorial will show you how to be more efficient in analyzing data on R.
Accompanying this tutorial is **a short [Google quiz](https://forms.gle/vDvUJhHDAuKHVNt98)** for your own self-assessment. The instructions of this tutorial will clearly indicate when you should answer which question.
## Learning Objectives
* Understand the basics the tableone package and its applications.
* Efficiently summarize whole datasets into one single table.
* Be familiar with the function `CreateTableOne()` and a few of its basic arguments.
* Know how to tell tableone which variables are continuous and which variables are categorical.
* Be familiar with different `print()` arguments to customize a tableone.
## Set Up
For this tutorial, the main package that we will be working with is the **tableone** package. We will also need the dplyr package for a few basic functions and data from the nhanesA package. Let's go ahead and load them in our session!
```{r}
#install.packages("tableone")
library(tableone)
#install.packages("dplyr")
library(dplyr)
#install.packages("nhanesA")
library(nhanesA)
```
Alright, so we are going back to the NHANES dataset for this tutorial. Let's, once again, download the "DEMO_H" dataset and save it in an object called "demo_original".
```{r, cache=TRUE}
demo_original <- nhanes("DEMO_H")
```
Just a reminder to everyone that this is what our raw dataset look like.
```{r}
head(demo_original)
```
As we can see, the data is quite overwhelming! Let's only select a few familiar variables to make the summary a bit more manageable and comprehensible.
```{r}
demo <- select(demo_original,
c("RIAGENDR", # Gender
"RIDAGEYR", # Age
"RIDRETH3", # Race
"DMDEDUC2") # Education
)
```
```{r}
head(demo)
```
Awesome, our data is looking much better now!
We have learned how to analyze it with dplyr and visualize it with ggplot. But in this tutorial, we are going to learn how to summarize the data in this large dataset into one simple table.
## What is tableone?
[tableone](http://rstudio-pubs-static.s3.amazonaws.com/13321_da314633db924dc78986a850813a50d5.html) is an R package that helps us construct "Table 1", or the baseline table that we see in biomedical research papers. This package gives us access to a lot of useful data summary function that we can use to summarize both categorical and continuous data. In addition, we can also identify normal and nonnormal variables so that R can analyze it more accurately.
tableone is unique in that it is very simple and easy to use. One single function can do tremendous data summary as we will see in the later sections in this tutorial.
#### DO QUESTIONS 1 & 2 OF THE QUIZ NOW {-}
* tableone is part of the tidyverse core. (True or False)
* What sort of data can tableone summarize? (Select all that apply)
## Creating a tableone
### CreateTableOne
The simples way that we can use tableone is to use the function `CreateTableOne()` with the nested dataset between then `()` like so:
```{r}
CreateTableOne(data = demo)
```
As we can see in the output above, this function has cleanly summarize all of our data into one table. It gives us how many records there are in the dataset (n), as well as the mean and standard deviation of all of our variables!
It looks pretty neat right now, but recall that the variables **RIAGENDR (Gender), RIDAGEYR (Age), and RIDRETH3 (Race)** are all categorical! So it does not make any sense to have a mean for these variables at all.
But do not worry at all! There are actually several ways that we can solve this problem:
1. First solution is, we can use nhanesTranslate and these variables will instantly be converted to categorical, and
2. Second solution is, we can use the `factorVars` argument in `CreateTableOne()` to identify categorical variables.
### Solution 1: nhanesTranslate & CreateTableOne
First, let's translate all of our variables using the `nhanesTranslate()` function that we have learned in previous tutorials like so.
```{r}
demo_translate <- nhanesTranslate("DEMO_H",
c("RIAGENDR",
"RIDAGEYR",
"RIDRETH3",
"DMDEDUC2"),
data = demo)
```
After that, for ease of communication, let's also change the column names to something that we can all understand.
```{r}
names(demo_translate) <- c("Gender", "Age", "Race", "Education")
```
### [Try it yourself 8.1][8.1] {-}
**Challenge**: Why do you think we need to change the names of our variables AFTER we translate them?
**Hint**: Think about the `data = demo` argument in `nhanesTranslate()`
Now, this is what our dataset should look like. Look familiar?
```{r}
head(demo_translate)
```
This table should look exactly like the one that you have seen in previous tutorials! The only difference here is that, **in this tutorial, we are using and summarizing the ENTIRE dataset!** We will not be scaling down to only analyzing or visualizing the first or last few rows!
Now if we use the `CreateTableOne()` function again but on our new `demo_translate` object, we should be able to see a quite different table.
```{r}
(tab_nhanes <- CreateTableOne(data = demo_translate))
```
The count of records (n) is still there and we are still provided with the mean and standard deviation of participants' age. However, instead of a single mean and standard deviation for gender, race, and education, we now have all of the categories of these variables fleshed out. In addition, we are also given the count and percentage of each category!
You may have also noticed that "Female" is the only gender that is shown in this table. This is because this variable only has two levels: Female and Male. For this reason, we can infer the count and percentage of the other category just based on the one that tableone gives us. There is a way that we can force tableone to show all categories of a variable. We will cover this in a later section of this tutorial.
#### DO QUESTIONS 3 & 4 OF THE QUIZ NOW {-}
* What kind of information is summarized when the data is continuous?
* What kind of information is summarized when the data is categorical?
### Solution 2: Identify Numerical Categorical Data
Before we hop to this second solution, again, let's rename all of our variables to something more comprehensible so that everything is easier to understand. In this subsection, however, we will be renaming our `demo` dataset, instead of the `demo_translate` dataset that we renamed earlier.
```{r}
names(demo) <- c("Gender", "Age", "Race", "Education")
```
Okay, now we are ready to go! Note that this second solution is more transferrable and will work for datasets that do not come from NHANES.
The second way that we can help tableone know which variable is categorical is by telling it directly using the argument `factorVars`. `factorVars` is especially useful for identifying numerical categorical data like the ones that we have.
Coupled with `factorVars` is also `vars`. `vars` is used to select which variables we want to keep in our tableone. Combined what we have learned about `CreateTableOne()` so far with `factorVars` and `vars`, this is what our function with clearly identified numerical categorical data should look like:
```{r}
CreateTableOne(data = demo,
vars = c("Gender", "Age", "Race", "Education"),
factorVars = c("Gender", "Race", "Education")
)
```
As we can see, this tableone that we just created should look somewhat familiar to the table that we created above. The only difference is that because we did not use nhanesTranslate, all of the categories in our categorical variables are numerical. This will not be an issue if we know which number corresponds to which gender, race, or education level of the participants. Other than that, the counts and percentages of these categorical variables should be identical.
If the amount of vectors `c()` and strings in the code above is a bit confusing and hard on our eyes, we can also define `factorVars` and `vars` before inputting them into `CreateTableOne()` like so:
```{r}
vars <- c("Gender", "Age", "Race", "Education")
```
```{r}
factorVars <- c("Gender", "Race", "Education")
```
```{r}
CreateTableOne(data = demo,
vars = vars,
factorVars = factorVars
)
```
We should be able to see that both tables in this subsection are identical!
### [Try it yourself 8.2][8.2] {-}
Create a tableone without the `vars` argument. What do you see?
Do you think the `vars` argument is necessary in our case? If not, in what situation(s) do you think it would be necessary?
## Other Arguments to Customize tableone
There are other arguments of `CreateTableOne()` that we can use to customize and adjust our tableone!
### Show All Levels
Recall how our Gender variable only shows the "Female" category. If we want both categories "Female" and "Male" to be shown, we can add `showAllLevels = TRUE` to our `print()` function like so:
```{r}
print(tab_nhanes,
showAllLevels = TRUE)
```
Another way that we can show both Male and Femal is to use `cramVars`. But this argument only works on 2-level variables (i.e. variables with only 2 categories) because all categories will be placed in the same row.
```{r}
print(tab_nhanes,
cramVars = "Gender")
```
#### DO QUESTION 5 OF THE QUIZ NOW {-}
* What is the difference between `showAllLevels` and `cramVars`?
### Nonnormal
Right now, our tableones assume that the data of all of our continuous variables are normal, but what if our data is not normal?
If we know that some or all of our continous variables are not normal, we can tell R this by using the `nonnormal` argument of `print()`. For example, if our Age variable is nonnormal, then:
```{r}
print(tab_nhanes,
showAllLevels = TRUE,
nonnormal = "Age"
)
```
In the table above, we can see that instead of the usual mean and standard deviation, we are provided with the median and interquartile range (IQR) for our nonnormal Age variable!
### [Try it yourself 8.3][8.3] {-}
How do you know if a variable is nonnormal? Try using the function `summary()` and look at the number under skew. How do you decide if something is normal or nonnormal? Is the decision to make “Age” nonnormal accurate?
#### DO QUESTION 6 OF THE QUIZ NOW {-}
* The decision to make “Age” nonnormal is accurate. (True or False)
### Show Categorical or Continuous Variables Only
We also have the option to only create tableones with only categorical or continuous variables.
```{r}
## Categorical variables only
tab_nhanes$CatTable
```
```{r}
## Continuous variables only
print(tab_nhanes$ContTable, nonnormal = "Age")
```
### Strata
In a way, strata is like the function `group_by()` in dplyr or facets in ggplot. It groups data together into groups or "strata" and then summarizes each group individually.
Note that while `showAllLevels` and `nonnormal` are arguments of the function `print()`, `strata` is an argument of the function `CreateTableOne()`.
For example, if we want to separate our data summary by Gender, we would need to write a code like so:
```{r}
strata <- CreateTableOne(data = demo_translate,
vars = c("Age", "Race", "Education"), ## Note that Gender is not included because we already have strata = Gender
factorVars = c("Race","Education"), ## Again, Gender is not included because it is in the strata argument
strata = "Gender"
)
```
```{r}
print(strata,
nonnormal = "Age",
cramVars = "Gender")
```
Let's unpack this table together. Firstly, we have the usual mean and standard deviation OR median and IQR for each category of each variable. Except now, we can see that all of the variables and their categories are summarized by or stratified by Gender.
Second of all, we can also see a second table below our usual table with p-values and test. This only appears when we have stratified our data into two groups for comparison. The default test for categorical variables is `chisq.test()` and the default for continuous variables is `oneway.test()` (regular ANOVA). tableone also considers nonnorm as present by the word "nonnorm" under "test" in the table above. Otherwise, we also have the option to use `krushal.test()` for nonnormal continuous variables.
### [Try it yourself 8.4][8.4] {-}
Create a tableone using the demo_translate dataset. Keep all variables and stratified the data using “Age”. What do you see? Do you think this is a helpful tableone?
#### DO QUESTION 7 OF THE QUIZ NOW {-}
* Which of the following is the least appropriate to stratify our dataset by?
## Export tableone
Finally, let's export our tableone!
Recall that we can use the function `write.csv()` to export data from R to a csv file. But before we can use this function, we need to save the table into an object using `print()` like so first:
```{r}
tab_csv <- print(strata,
nonnormal = "Age",
printToggle = FALSE)
```
#### DO QUESTION 8 OF THE QUIZ NOW {-}
* What does the argument `printToggle = FALSE` do?
Now we can use our `write.csv()` function like normal.
```{r}
write.csv(tab_csv, file = "data/NHANES_Summary.csv")
```
Tada! Now our table is saved as a csv file in our working directory!
```{r}
dir()
```
#### DO QUESTIONS 9 & 10 OF THE QUIZ NOW {-}
* Which of the following arguments can be nested in `CreateTableOne()`?
* Which of the following arguments can be nested in `print()`?
## Alternatives to tableone
Data summary is one of the many applications that R specializes at. With this said, there are multiple other R packages that also do data summary aside from tableone. We will not go over any of these packages, but know that each package has its own strengths and so are most optimally used in different situations.
Here are the other data summary packages and its main data summary function:
### base R
In base R, we have `summary()` and `by()`:
```{r}
summary(demo_translate)
```
```{r}
by(demo_translate, demo_translate$Gender, summary)
```
### Hmisc
In Hmisc, we have `describe()`:
```{r}
#install.packages("Hmisc")
library(Hmisc)
```
```{r}
describe(demo_translate)
```
### psych
In psych, we have `describe()` and `describeBy()`. Note how the categorical variables are marked with an asterisk (*).
```{r}
#install.packages("psych")
library(psych)
```
```{r}
describe(demo_translate)
```
```{r}
describeBy(demo_translate, demo_translate$Gender)
```
### desctable
In desctable, we have `desctable()`:
```{r}
# install.packages("desctable")
library(desctable)
```
```{r}
desctable(demo_translate)
```
### skimr
In skimr, we have `skim()`:
```{r}
#install.packages("skimr")
library(skimr)
```
```{r}
skim(demo_translate)
```
## Summary and Takeaways
Congratulations on finishing tutorial 7 on Data Summary with tableone! After this tutorial, you should be familiar with the R package tableone as well as the function `CreateTableOne()`. In addition, you should also be familiar with the different arguments of `print()` to customize your own tableone.
There are a lot more powerful functions in the tableone package. You are free to explore them on your own using [this document](https://cran.r-project.org/web/packages/tableone/vignettes/introduction.htmlhttps://cran.r-project.org/web/packages/tableone/vignettes/introduction.html).