-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path220908-Assignment0.Rmd
159 lines (109 loc) · 6.72 KB
/
220908-Assignment0.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
---
title: "Assignment_0"
author: "Thomas Steinthal"
date: "2022-09-08"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r}
pacman::p_load(tidyverse,
janitor)
```
Load the three data sets, after downloading them from dropbox and saving them in your working directory:
* Demographic data for the participants: https://www.dropbox.com/s/w15pou9wstgc8fe/demo_train.csv?dl=0
* Length of utterance data: https://www.dropbox.com/s/usyauqm37a76of6/LU_train.csv?dl=0
* Word data: https://www.dropbox.com/s/8ng1civpl2aux58/token_train.csv?dl=0
```{r}
demo<-read_csv("demo_train.csv")
LU<-read_csv("LU_train.csv")
token<-read_csv("token_train.csv")
```
Explore the 3 data sets (e.g. visualize them, summarize them, etc.). You will see that the data is messy, since the psychologist collected the demographic data, the linguist analyzed the length of utterance in May 2014 and the fumbling jack-of-all-trades analyzed the words several months later.
In particular:
- the same variables might have different names (e.g. participant and visit identifiers)
- the same variables might report the values in different ways (e.g. participant and visit IDs)
Welcome to real world of messy data :-)
Before being able to combine the data sets we need to make sure the relevant variables have the same names and the same kind of values.
```{r}
demo
```
So:
2a. Identify which variable names do not match (that is are spelled differently) and find a way to transform variable names.
Pay particular attention to the variables indicating participant and visit.
Tip: look through the chapter on data transformation in R for data science (http://r4ds.had.co.nz). Alternatively you can look into the package dplyr (part of tidyverse), or google "how to rename variables in R". Or check the janitor R package. There are always multiple ways of solving any problem and no absolute best method.
```{r}
#ID-rename
demo<-rename(demo, ID = Child.ID)
LU<-rename(LU, ID = SUBJ)
token<-rename(token, ID = SUBJ)
#visit-rename
LU<-rename(LU, Visit = VISIT)
token<-rename(token, Visit = VISIT)
```
2b. Find a way to homogenize the way "visit" is reported (visit1 vs. 1).
Tip: The stringr package is what you need. str_extract () will allow you to extract only the digit (number) from a string, by using the regular expression \\d.
```{r}
demo$Visit<-str_extract(demo$Visit, "\\d")
LU$Visit<-str_extract(LU$Visit, "\\d")
token$Visit<-str_extract(token$Visit, "\\d")
```
2c. We also need to make a small adjustment to the content of the Child.ID column in the demographic data. Within this column, names that are not abbreviations do not end with "." (i.e. Adam), which is the case in the other two data sets (i.e. Adam.). If The content of the two variables isn't identical the rows will not be merged.
A neat way to solve the problem is simply to remove all "." in all data sets.
Tip: stringr is helpful again. Look up str_replace_all
Tip: You can either have one line of code for each child name that is to be changed (easier, more typing) or specify the pattern that you want to match (more complicated: look up "regular expressions", but less typing)
```{r}
demo$ID<-str_extract(demo$ID, ".")
```
2d. Now that the nitty gritty details of the different data sets are fixed, we want to make a subset of each data set only containig the variables that we wish to use in the final data set.
For this we use the tidyverse package dplyr, which contains the function select().
The variables we need are:
* Child.ID,
* Visit,
* Diagnosis,
* Ethnicity,
* Gender,
* Age,
* ADOS,
* MullenRaw,
* ExpressiveLangRaw,
* Socialization
* MOT_MLU,
* CHI_MLU,
* types_MOT,
* types_CHI,
* tokens_MOT,
* tokens_CHI.
Most variables should make sense, here the less intuitive ones.
* ADOS (Autism Diagnostic Observation Schedule) indicates the severity of the autistic symptoms (the higher the score, the worse the symptoms). Ref: https://link.springer.com/article/10.1023/A:1005592401947
* MLU stands for mean length of utterance (usually a proxy for syntactic complexity)
* types stands for unique words (e.g. even if "doggie" is used 100 times it only counts for 1)
* tokens stands for overall amount of words (if "doggie" is used 100 times it counts for 100)
* MullenRaw indicates non verbal IQ, as measured by Mullen Scales of Early Learning (MSEL https://link.springer.com/referenceworkentry/10.1007%2F978-1-4419-1698-3_596)
* ExpressiveLangRaw indicates verbal IQ, as measured by MSEL
* Socialization indicates social interaction skills and social responsiveness, as measured by Vineland (https://cloudfront.ualberta.ca/-/media/ualberta/faculties-and-programs/centres-institutes/community-university-partnership/resources/tools---assessment/vinelandjune-2012.pdf)
Feel free to rename the variables into something you can remember (i.e. nonVerbalIQ, verbalIQ)
```{r}
```
2e. Finally we are ready to merge all the data sets into just one.
Some things to pay attention to:
* make sure to check that the merge has included all relevant data (e.g. by comparing the number of rows)
* make sure to understand whether (and if so why) there are NAs in the data set (e.g. some measures were not taken at all visits, some recordings were lost or permission to use was withdrawn)
```{r}
```
2f. Only using clinical measures from Visit 1
In order for our models to be useful, we want to minimize the need to actually test children as they develop. In other words, we would like to be able to understand and predict the children's linguistic development after only having tested them once. Therefore we need to make sure that our ADOS, MullenRaw, ExpressiveLangRaw and Socialization variables are reporting (for all visits) only the scores from visit 1.
A possible way to do so:
* create a new data set with only visit 1, child id and the 4 relevant clinical variables to be merged with the old dataset
* rename the clinical variables (e.g. ADOS to ADOS1) and remove the visit (so that the new clinical variables are reported for all 6 visits)
* merge the new data set with the old
```{r}
```
2g. Final touches
Now we want to
* anonymise our participants (they are real children!).
* make sure the variables have sensible values. E.g. right now gender is marked 1 and 2, but in two weeks you will not be able to remember, which gender were connected to which number, so change the values from 1 and 2 to Female and Male in the gender variable (calling Female F would create issues, since F is also used for FALSE). For the same reason, you should also change the values of Diagnosis from A and B to ASD (autism spectrum disorder) and TD (typically developing). Tip: Try taking a look at ifelse(), or google "how to rename levels in R".
* Save the data set using into a csv file. Hint: look into write.csv()
```{r}
```