-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path01-practical_data_dict.Rmd
234 lines (172 loc) · 8.5 KB
/
01-practical_data_dict.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
# Data dictionaries {#data-dict}
## Objectives
Learn to:
- prepare an OCA-style data dictionary, starting from either a raw dataset or a ODK/Kobo dictionary
- compare datasets and corresponding data dictionaries to ensure completeness and validity
## Background
Documenting the variables in a dataset is a crucial part of data management and
ensures that a dataset is interpretable by researchers who were not directly
involved in study design or data collection. Whereas data files will generally
contain codenames for variables (e.g. `patagegrp` for “Patient’s age group”),
and sometimes also coded data values, the data dictionary describes each
variable and the set of possible values in plain language that is more broadly
interpretable.
Data collection platforms such as Kobo, REDCap, and OpenClinica have their own
specialized data dictionary format. To facilitate dataset and data dictionary
validation, the OCA Data Sharing Platform uses its own standardized dictionary
format, defined in the next section.
## Dictionary format
Every dataset shared or archived on the OCA platform must have an accompanying
data dictionary that includes, at a minimum, the fields described below. Each
variable in the dataset (i.e. each column) must have a dictionary entry for each
of the required fields.
| Required field | Description | Example entry |
| -------------- | ----------- | ------------- |
| variable_name | Variable name (i.e. exact column name within the corresponding dataset) | "sample_type" |
| short_label | Short phrase describing the variable in words | "Type of laboratory sample collected" |
| type | Variable type (options: "Numeric", "Date", "Time", "Datetime", "Coded list", or "Free text") | "Coded list" |
| choices | The list of options (pairs of codes and labels) corresponding to a variable of type "Coded list". | "1, Blood \| 2, Nasal swab \| 3, Throat swab \| 4, Other" |
| origin | Was the variable a part of the original data collection instrument (option "original"), or was it later derived (option "derived") | "original" |
| status | Is the variable shared (option "shared"), or withheld to reduce disclosure risk (option "withheld") ? | "shared" |
## The `datadict` package
The R package [`datadict`](https://github.com/epicentre-msf/datadict) contains a
variety of functions to aid in the preparation of an OCA-style data dictionary:
- `dict_from_data()`: prepare a dictionary template from a raw dataset
- `dict_from_odk()`: prepare a dictionary template from an ODK/Kobo dictionary
- `dict_from_redcap()`: prepare a dictionary template from a REDCap dictionary
- `valid_dict()`: verify that a dictionary is consistent with the OCA format
- `valid_data()`: verify that a dataset corresponds to its associated data dictionary
Note that dictionary templates produces by the `dict_from_` functions may still
require further processing by the user (e.g. with additional R scripts, or by
hand in Excel).
## Exercises
This repository includes an example dataset based on a mortality survey, and
corresponding ODK data dictionary (see section [1.1 Setup](#setup) for data
download links). Load the dataset and data dictionary using the example code
below (note you may need to modify the file path, depending on how you setup
your project), and work through the following exercises using functions from the
`datadict` package where possible.
```{r, message=FALSE}
library(rio)
library(here)
# import dataset
dat <- rio::import(here("data/mortality_survey_simple_data.xlsx"), setclass = "tbl")
# import ODK dictionary (note the main dictionary and multiple-choice options are in separate sheets)
odk_survey <- rio::import(here("data/mortality_survey_simple_kobo.xlsx"), sheet = "survey", setclass = "tbl")
odk_choices <- rio::import(here("data/mortality_survey_simple_kobo.xlsx"), sheet = "choices", setclass = "tbl")
```
### Exercise 1
With the `datadict` package we can prepare a dictionary template *either* from
the raw dataset (using function `dict_from_data()`) or from the ODK dictionary
(using function `dict_from_odk()`). Try both approaches separately and compare
the resulting dictionary templates. What are some differences? Can you guess why
these differences are occurring?
<details style="color:#b30000">
<summary><b>See proposed answer</b></summary>
```{r}
dict_dat <- datadict::dict_from_data(dat)
dict_odk <- datadict::dict_from_odk(odk_survey, odk_choices)
if (!dir.exists(here("output"))) dir.create(here("output"))
rio::export(dict_dat, here("output/ex1_dict_dat.xlsx"))
rio::export(dict_odk, here("output/ex1_dict_odk.xlsx"))
```
- Function `dict_from_data()` doesn't populate column `short_label`, whereas
`dict_from_odk()` does
- Variable types in column `type` are different. Function `dict_from_data()`
guesses only 'Free text' and 'Coded list', whereas `dict_from_odk()` also
returns 'Date' and 'Numeric'
</details>
### Exercise 2
When producing a dictionary template using `dict_from_data()`, the variable type
is determined by the class of the original column (e.g. character, numeric,
Date). The column classes that are read in by e.g. `rio::import()` might not
always correspond to the variable types that we have in mind (e.g. numbers,
dates, and times are sometimes read in as class "character"). Where necessary,
transform the columns of `dat` using functions like `as.numeric()` or
`as.Date()` and then produce another dictionary template using
`dict_from_data()`. What are the differences that remain between this dictionary
template and the template derived from the ODK dictionary?
<details style="color:#b30000">
<summary><b>See proposed answer</b></summary>
```{r}
vars_date <- c(
"date",
"date_arrived",
"date_departed",
"date_born",
"date_died"
)
vars_numeric <- c(
"cluster",
"age_months",
"age_years",
"muac"
)
for (j in vars_date) {
dat[[j]] <- as.Date(dat[[j]])
}
for (j in vars_numeric) {
dat[[j]] <- as.numeric(dat[[j]])
}
dict_dat <- datadict::dict_from_data(dat)
rio::export(dict_dat, here("output/ex2_dict_dat.xlsx"))
```
- Function `dict_from_data()` classifies some variables as "Coded list" that
`dict_from_odk()` classifies as "Free text" (`source_water_other`,
`ilness_other`, `cause_death_other`)
- Various differences in column `choices` (order varies, some entries missing
with `dict_from_data()`, labels same but values differ)
- Function `dict_from_odk()` retains extra columns from ODK data dictionary
(e.g. `constrain`, `relevant`, etc.)
</details>
### Exercise 3
Examine the options for the 'Coded list' type variables in the dictionary
produced in exercise 2, and compare these to the corresponding options produced
by `dict_from_odk()`. Why does the dictionary produced by `dict_from_data()`
have fewer 'Coded list' options for some variables? Does this matter in terms of
data sharing?
Hint: check out the function `datadict::coded_options()` to extract a long-form
table of Coded list variables and corresponding options from a dictionary.
<details style="color:#b30000">
<summary><b>See proposed answer</b></summary>
```{r}
opts_dat <- datadict::coded_options(dict_dat)
opts_odk <- datadict::coded_options(dict_odk)
rio::export(opts_dat, here("output/ex3_opts_dat.xlsx"))
rio::export(opts_odk, here("output/ex3_opts_odk.xlsx"))
```
- Function `dict_from_data()` only includes options that appear in the dataset,
whereas `dict_from_odk()` includes all options from original ODK dictionary
</details>
### Exercise 4
Use the function `valid_dict()` to check that the dictionary you produced in
Exercise 2 complies with the OCA standard. Assuming it does, edit the dictionary
so that it fails at least two of the checks implemented by `valid_dict()`.
<details style="color:#b30000">
<summary><b>See proposed answer</b></summary>
```{r, error=TRUE}
datadict::valid_dict(dict_dat)
dict_dat_fail1 <- dict_dat
dict_dat_fail1$variable_name[4] <- NA_character_
datadict::valid_dict(dict_dat_fail1)
dict_dat_fail2 <- dict_dat
dict_dat_fail2$type <- "Number"
datadict::valid_dict(dict_dat_fail2)
```
</details>
### Exercise 5
Use the function `valid_data()` to check for consistency between the dataset and
dictionary produced in Exercise 2. Assuming all checks pass, edit the dataset so
that it fails at least two of the checks implemented by `valid_data()`.
<details style="color:#b30000">
<summary><b>See proposed answer</b></summary>
```{r}
datadict::valid_data(dat, dict_dat)
# add nonvalid date
dat$date <- as.character(dat$date)
dat$date[4] <- "July ?, 2021"
# remove column present in dictionary
dat$oedema <- NULL
datadict::valid_data(dat, dict_dat)
```
</details>