-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtutorial.Rmd
503 lines (340 loc) · 13.4 KB
/
tutorial.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
---
title: "Tutorial"
author: "Alexandre Piche"
date: '2017-03-06'
output:
pdf_document:
toc: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# McGill wins $84 million grant for neuroscience
Great opportunities for undergraduate students with a background in neuro and knowledge of statistics and statistical software!
https://www.mcgill.ca/newsroom/channels/news/mcgill-wins-84-million-grant-neuroscience-262441
# To Do Before the Tutorial
## Download R
https://cran.r-project.org/
## Download R studio
https://www.rstudio.com/products/rstudio/download/
## Opening RStudio for the First Time
On the right hand side there is the console. It is where we are going to communicate with R by submitting our instructions.
On the left hand side you have the Environment and the History in the top panel. The Environment lists all the variables that you currently have in your work space (i.e. that you can call in the console). History registers all the operations you have sent to R. You can browse it to see your previous commands in the console.
## New Project
I encourage creating a new project for the course as File -> New Project... -> New Directory -> Empty Project -> Directory Name -> R_Tutorial and browse to choose the folder where you want the project to be. It will be easier to manage your project and dataset.
## Rmarkdown
You may export your code and graphs easily to MS word or pdf following these instructions.
File -> New File -> R markdown -> select the output format to Word -> Ok -> Knit
Note that what is written in MS Word is not connected to the console, and vice-versa. You will need to write the command in both places if you want to keep their workspaces the same.
To insert code in your file click on "Insert" at the top of the top left panel.
## Required Packages
We need to install certain packages for today's tutorial. It can be done this way:
```{r}
# install.packages(c("tidyr", "reshape2"))
```
# Tutorial
## Use R for Basic Operations
Typying in the console we can use R as a basic calculator.
```{r}
1+2
4/2
5*(1+2)
5^2
```
We can also assign value to variable,
```{r}
my_variable <- 7
```
and access it by calling the variable name in the console.
```{r}
my_variable
```
## Basic Data Structure
### Vectors
We can create vectors using the "c" command, where "c" stands for combine.
```{r}
x <- c(6,7,8,9,10)
x
x[1]
x[c(2,4)]
x*2
```
```{r}
x <- 1:7
x
```
Vectors in R are more general than their mathematical equivalent. They can hold different type of data e.g. string or level.
```{r}
y <- c("hello", "world")
y
```
However, all elements will be coerce to the same type, they are coerce to string in the following case.
```{r}
y <- c(1, "hello")
y
```
### Dataframe
Usually, we will have to load our own datasets in R using the "Import Dataset" command in the Environment panel, more on that later. For simplicity, we will now experiment on a build-in dataset in R studio named "mtcars". Using the command "head(mtcars)" let us access the first 6 records of the dataset and their names.
```{r}
dim(mtcars)
head(mtcars)
```
We can use vectors to assess different rows and columns.
```{r}
mtcars[1:4, c(1,4)]
```
Let say that we are interested in the median miles per gallon (mpg) in this dataset. We can use the command "median(mtcars\$mpg)", where "mtcars" is the dataset and "mpg" is the column of interest that we accessed with the dollar sign operator "\$".
```{r}
mtcars$mpg
median(mtcars$mpg)
mean(mtcars$mpg)
var(mtcars$mpg)
```
To have more information on a function, we can use the question mark before it e.g. "?mean".
## Data Cleaning
Most of the analysis consist of data cleaning. Here is an example of some manipulation that you might be asked to do.
# Attribute Description
-- -----------------------------------------
1. mpg - Miles/(US) gallon
2. cyl - Number of cylinders
3. disp - Displacement (cu.in.)
4. hp - Gross horsepower
5. drat - Rear axle ratio
6. wt - Weight (1000 lbs)
7. qsec - 1/4 mile time
8. vs - V/S
9. am - Transmission (0 = automatic, 1 = manual)
10. gear - Number of forward gears
11. carb - Number of carburetors
```{r}
str(mtcars)
```
Should the number of cylinders and the weight be both considered as numeric variable? Weight is a continuous variable while the number of cylinders is categorical e.g. 4,6 or 8 cylinders.
```{r}
mtcars$cyl <- factor(mtcars$cyl)
mtcars$vs <- factor(mtcars$vs)
mtcars$am <- factor(mtcars$am)
mtcars$gear <- factor(mtcars$gear)
mtcars$carb <- factor(mtcars$carb)
str(mtcars)
```
It might be annoying to have to remember that 0 stands for automatic and 1 manual transmission. Let us fix it.
```{r}
levels(mtcars$am) <- c("automatic", "manual")
str(mtcars)
```
## Exploring the Data
### Graphing with ggplot2
#### Exploring the Variability in a Dataset
Suppose being interested in the distribution of mpg. The first argument is the dataset name, the second is "aes" (standing for aesthetic) which that the dependent and indepdent variables.
We then add a layer to the plot using the "+" sign e.g. the density:
```{r}
# To use the ggplot2 library we have to call it with the library command
# One could think of the command "install.packages" as buying shoes and "library" as putting them on.
library(ggplot2)
ggplot(mtcars, aes(mpg)) + geom_density(fill="lightblue")
```
The number of cylinders in a car might impact its consumption. Let us stratify the densities by cylinders. Note the transparency of the plot using the parameter alpha.
```{r}
ggplot(mtcars, aes(mpg, fill=cyl)) + geom_density(alpha=.5)
```
It might be useful to be able to see every observation on the graph. We can do so by adding the layer: geom_jitter.
```{r}
p <- ggplot(mtcars, aes(cyl, mpg, fill=cyl, color=cyl)) + geom_violin() + coord_flip()
p
p + geom_jitter()
```
#### Relationship Across Variables
Using the regression, we can clearly see that the heavier a vehicule is the worst Miles per gallon it is going to have.
```{r}
ggplot(mtcars, aes(wt, mpg)) + stat_smooth(method="lm")
```
Furthermore, we can divide the regression by the number of cylinders. We can see that the negative correlation is still present, but is not the same for every category.
```{r}
p <- ggplot(mtcars, aes(wt, mpg, color=cyl)) + stat_smooth(method="lm")
p
p + geom_point()
```
#### Complex Patterns
We can stratify the data to observe more complex patterns.
```{r}
p <- ggplot(mtcars, aes(hp, mpg, color=am, shape=am)) + geom_point()
p <- p + theme(legend.position="top")
p
p + facet_grid(gear~cyl)
```
### dplyr
The package dplyr is very fast for large datasets. It has main functions/verbs to work with data.table that we are going to explore here. The functions are filter, select, mutate, arrange, and summarise.
```{r, message=F}
library(dplyr)
```
#### Group by and Summarise
```{r}
summarise(mtcars, mean_mpg=mean(mpg), mean_wt=mean(wt))
temp <- group_by(mtcars, cyl)
temp1 <- summarise(temp, mean_mpg=mean(mpg), mean_wt=mean(wt))
temp1
# one could alternatively chain the operation using the pipe operation %>%
# mtcars %>% group_by(cyl) %>% summarise(mean_mpg=mean(mpg), mean_wt=mean(wt))
```
#### Select
```{r}
mtcars %>% select(mpg, cyl) %>% head()
```
#### Arrange
```{r}
mtcars %>% select(cyl,am,wt) %>% arrange(cyl,am,wt) %>% head()
```
#### Mutate
Let's define heavy as more than 3000 lbs.
```{r}
temp1 <- mtcars %>% mutate(heavy=factor(ifelse(wt < 3, "Light", "Heavy")))
str(temp1)
```
#### Filter
```{r}
temp1 %>% filter(heavy=="Heavy") %>% head()
```
## Real World Problems
### Parkinsons
We will try to predict the UPDRS score of the patient given their age, gender and different measures.
#### Attribute Information:
# Attribute Description
-- -----------------------------------------
1. subject# - Integer that uniquely identifies each subject
2. age - Subject age
3. sex - Subject gender '0' - male, '1' - female
4. test_time - Time since recruitment into the trial.
The integer part is the number of days since recruitment.
5. motor_UPDRS - Clinician's motor UPDRS score, linearly interpolated
6. total_UPDRS - Clinician's total UPDRS score, linearly interpolated
7. Jitter(%),Jitter(Abs),Jitter:RAP,Jitter:PPQ5,Jitter:DDP -
Several measures of variation in fundamental frequency
8. Shimmer,Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,Shimmer:APQ11,Shimmer:DDA -
Several measures of variation in amplitude
9. NHR,HNR - Two measures of ratio of noise to tonal components in the voice
10. RPDE - A nonlinear dynamical complexity measure
11. DFA - Signal fractal scaling exponent
12. PPE - A nonlinear measure of fundamental frequency variation
#### data.table
The data.table package provides a very efficient back end for dplyr. I recommend using it if you are working with large datasets. Particularly "fread" to read in datasets.
```{r, message=F}
library(data.table)
```
```{r}
# A Tsanas, MA Little, PE McSharry, LO Ramig (2009)
# 'Accurate telemonitoring of Parkinson.s disease progression by non-invasive speech tests',
# url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/
# parkinsons/telemonitoring/parkinsons_updrs.data"
#parkinson_dat <- fread(url)
#names(parkinson_dat)[1] <- "subject"
#write.csv(parkinson_dat, file="parkinson_dat.csv")
parkinson_dat <- fread("parkinson_dat.csv")
str(parkinson_dat)
```
```{r}
parkinson_dat$sex <- factor(parkinson_dat$sex)
levels(parkinson_dat$sex) <- c("male", "female")
```
Inspecting the distribution of subjects based on gender.
```{r}
parkinson_dat %>% distinct(subject, .keep_all = TRUE) %>% group_by(sex) %>%
summarise(count=n(), mean_age=mean(age))
```
Inspecting the distribution of older subjects based on gender.
```{r}
parkinson_dat_old <- parkinson_dat %>% filter(age >= 65)
parkinson_dat_old %>% distinct(subject, .keep_all = TRUE) %>% group_by(sex) %>%
summarise(count=n(), mean_age=mean(age))
```
Inspecting the score distribution based on gender.
```{r}
library(ggplot2)
ggplot(parkinson_dat, aes(sex, motor_UPDRS, fill=sex)) + geom_boxplot()
```
```{r}
new_parkinson_dat <- parkinson_dat %>% group_by(subject) %>%
summarise(mean_motor_UPDRS = mean(motor_UPDRS), age=mean(age))
new_parkinson_dat
```
Looking at the score distribution for people over/under 65 years old.
```{r}
new_parkinson_dat$old <- new_parkinson_dat$age >= 65
ggplot(new_parkinson_dat, aes(old, mean_motor_UPDRS, fill=old)) +
geom_boxplot() #+ ggtitle("my title") + xlab("x lab") + ylab("y lab")
```
Looking at the score distribution of the first 6 subjects and filling the boxplot base don their age.
```{r}
parkinson_dat_sub <- subset(parkinson_dat, subject<=6)
ggplot(parkinson_dat_sub, aes("", total_UPDRS, fill=age)) + geom_boxplot() +
facet_wrap(~subject, ncol = 3)
```
#### Advanced ggplot
```{r}
parkinson_dat_select <- parkinson_dat %>% select(total_UPDRS,PPE,DFA)
head(parkinson_dat_select)
library(reshape2)
parkinson_dat_melt <- melt(parkinson_dat_select, id="total_UPDRS")
head(parkinson_dat_melt)
ggplot(parkinson_dat_melt, aes(value, total_UPDRS)) + stat_smooth(method = "lm") +
geom_point(alpha=0.15) + facet_wrap(~variable, scales="free",ncol=1)
```
#### Regression
We can do a linear regression to predict the UPDR score from the age, sex, PPE, and DFA variables.
```{r}
mymodel <- lm(total_UPDRS~age+sex+PPE+DFA+0, parkinson_dat) # + 0 is fot no intercept, coherent with the paper from which the data comes from
summary(mymodel)
```
### Breast Cancer
We will classify the tumor into benign and malignant based on their thickness.
#### Attribute Information:
# Attribute Domain
-- -----------------------------------------
1. Sample code number id number
2. Clump Thickness 1 - 10
3. Uniformity of Cell Size 1 - 10
4. Uniformity of Cell Shape 1 - 10
5. Marginal Adhesion 1 - 10
6. Single Epithelial Cell Size 1 - 10
7. Bare Nuclei 1 - 10
8. Bland Chromatin 1 - 10
9. Normal Nucleoli 1 - 10
10. Mitoses 1 - 10
11. Class: (2 for benign, 4 for malignant)
```{r}
#breast_cancer <- fread("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data")
#names(breast_cancer) <- c("id_number", "clump_thickness", "cell_size", "cell_shape", "marginal_adhesion", "single_epithelial", "bare_nuclei", "bland_chromatin", "normal_nucleoli", "mitoses", "class")
#write.csv(breast_cancer, file="breast_cancer.csv")
breast_cancer <- fread("breast_cancer.csv")
str(breast_cancer)
```
```{r}
breast_cancer$class <- factor(breast_cancer$class)
levels(breast_cancer$class) <- c("benign", "malignant")
str(breast_cancer)
```
Let us see how many tumor of each class is present in our dataset
```{r}
breast_cancer %>% group_by(class) %>% summarise(count=n())
```
We can look at the distribution of the clumb thickness using a barplot.
```{r}
ggplot(breast_cancer, aes(clump_thickness)) + geom_bar()
```
Using ggplot to inspect the distribution of the clump's thickness given the tumor's class.
```{r}
ggplot(breast_cancer, aes(class, clump_thickness, fill=class)) + geom_boxplot()
```
#### Classification
Let us use a generalize linear model to classify the tumor into malignant and benign.
```{r}
model <- glm(class~clump_thickness, family = "binomial", data=breast_cancer)
summary(model)
```
# Ressources
## Datacamp
https://www.datacamp.com/
## R for Data Science
http://r4ds.had.co.nz/
## Advanced R
http://adv-r.had.co.nz/