-
Notifications
You must be signed in to change notification settings - Fork 9
/
Data_Science_Machine_Learning.Rmd
5817 lines (4164 loc) · 214 KB
/
Data_Science_Machine_Learning.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Data Science Machine Learning"
output:
bookdown::epub_book:
number_sections: true
word_document: default
pdf_document:
latex_engine: xelatex
html_document: default
urlcolor: blue
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
The textbook for the Data Science course series is [freely available online](https://rafalab.github.io/dsbook/){target="_blank"}.
# Learning Objectives
* The basics of machine learning
* How to perform cross-validation to avoid overtraining
* Several popular machine learning algorithms
* How to build a recommendation system
* What regularization is and why it is useful
## Course Overview
There are six major sections in this course: introduction to machine learning; machine learning basics; linear regression for prediction, smoothing, and working with matrices; distance, knn, cross validation, and generative models; classification with more than two classes and the caret package; and model fitting and recommendation systems.
### Introduction to Machine Learning
In this section, you'll be introduced to some of the terminology and concepts you'll need going forward.
### Machine Learning Basics
In this section, you'll learn how to start building a machine learning algorithm using training and test data sets and the importance of conditional probabilities for machine learning.
### Linear Regression for Prediction, Smoothing, and Working with Matrices
In this section, you'll learn why linear regression is a useful baseline approach but is often insufficiently flexible for more complex analyses, how to smooth noisy data, and how to use matrices for machine learning.
### Distance, Knn, Cross Validation, and Generative Models
In this section, you'll learn different types of discriminative and generative approaches for machine learning algorithms.
### Classification with More than Two Classes and the Caret Package
In this section, you'll learn how to overcome the curse of dimensionality using methods that adapt to higher dimensions and how to use the caret package to implement many different machine learning algorithms.
### Model Fitting and Recommendation Systems
In this section, you'll learn how to apply the machine learning algorithms you have learned.
# Section 1 - Introduction to Machine Learning Overview
In the **Introduction to Machine Learning** section, you will be introduced to machine learning.
After completing this section, you will be able to:
* Explain the difference between the **outcome** and the **features**.
* Explain when to use **classification** and when to use **prediction**.
* Explain the importance of **prevalence**.
* Explain the difference between **sensitivity** and **specificity**.
This section has one part: **introduction to machine learning**.
## Notation
There is a link to the relevant section of the textbook: [Notation](https://rafalab.github.io/dsbook/introduction-to-machine-learning.html#notation-1){target="_blank"}
**Key points**
* $X_1, ..., X_p$ denote the features, $Y$ denotes the outcomes, and $\hat{Y}$ denotes the predictions.
* Machine learning prediction tasks can be divided into **categorical** and **continuous** outcomes. We refer to these as **classification** and **prediction**, respectively.
## An Example
There is a link to the relevant section of the textbook: [An Example](https://rafalab.github.io/dsbook/introduction-to-machine-learning.html#an-example){target="_blank"}
**Key points**
* $Y_i$ = an outcome for observation or index i.
* We use boldface for $\mathbf{X_i}$ to distinguish the vector of predictors from the individual predictors $X_{i,1}, ..., X_{i,784}$.
* When referring to an arbitrary set of features and outcomes, we drop the index i and use $Y$ and bold $\mathbf{X}$.
* Uppercase is used to refer to variables because we think of predictors as random variables.
* Lowercase is used to denote observed values. For example, $\mathbf{X} = \mathbf{x}$.
## Comprehension Check - Introduction to Machine Learning
1. True or False: A key feature of machine learning is that the algorithms are built with data.
- [X] A. True
- [ ] B. False
2. True or False: In machine learning, we build algorithms that take feature values (X) and train a model using known outcomes (Y) that is then used to predict outcomes when presented with features without known outcomes.
- [X] A. True
- [ ] B. False
# Section 2 - Machine Learning Basics Overview
In the **Machine Learning Basics** section, you will learn the basics of machine learning.
After completing this section, you will be able to:
* Start to use the **caret** package.
* Construct and interpret a **confusion matrix**.
* Use **conditional probabilities** in the context of machine learning.
This section has two parts: **basics of evaluating machine learning algorithms** and **conditional probabilities**.
## Caret package, training and test sets, and overall accuracy
There is a link to the relevant sections of the textbook: [Training and test sets](https://rafalab.github.io/dsbook/introduction-to-machine-learning.html#training-and-test-sets){target="_blank"} and [Overall accuracy](https://rafalab.github.io/dsbook/introduction-to-machine-learning.html#overall-accuracy){target="_blank"}
**Key points**
* Note: the ```set.seed()``` function is used to obtain reproducible results. If you have R 3.6 or later, please use the ```sample.kind = "Rounding"``` argument whenever you set the seed for this course.
* To mimic the ultimate evaluation process, we randomly split our data into two — a training set and a test set — and act as if we don’t know the outcome of the test set. We develop algorithms using only the training set; the test set is used only for evaluation.
* The ```createDataPartition()``` function from the **caret** package can be used to generate indexes for randomly splitting data.
* Note: contrary to what the documentation says, this course will use the argument p as the percentage of data that goes to testing. The indexes made from ```createDataPartition()``` should be used to create the test set. Indexes should be created on the outcome and not a predictor.
* The simplest evaluation metric for categorical outcomes is overall accuracy: the proportion of cases that were correctly predicted in the test set.
*Code*
```{r}
if(!require(tidyverse)) install.packages("tidyverse")
if(!require(caret)) install.packages("caret")
if(!require(dslabs)) install.packages("dslabs")
library(tidyverse)
library(caret)
library(dslabs)
data(heights)
# define the outcome and predictors
y <- heights$sex
x <- heights$height
# generate training and test sets
set.seed(2, sample.kind = "Rounding") # if using R 3.5 or earlier, remove the sample.kind argument
test_index <- createDataPartition(y, times = 1, p = 0.5, list = FALSE)
test_set <- heights[test_index, ]
train_set <- heights[-test_index, ]
# guess the outcome
y_hat <- sample(c("Male", "Female"), length(test_index), replace = TRUE)
y_hat <- sample(c("Male", "Female"), length(test_index), replace = TRUE) %>%
factor(levels = levels(test_set$sex))
# compute accuracy
mean(y_hat == test_set$sex)
heights %>% group_by(sex) %>% summarize(mean(height), sd(height))
y_hat <- ifelse(x > 62, "Male", "Female") %>% factor(levels = levels(test_set$sex))
mean(y == y_hat)
# examine the accuracy of 10 cutoffs
cutoff <- seq(61, 70)
accuracy <- map_dbl(cutoff, function(x){
y_hat <- ifelse(train_set$height > x, "Male", "Female") %>%
factor(levels = levels(test_set$sex))
mean(y_hat == train_set$sex)
})
data.frame(cutoff, accuracy) %>%
ggplot(aes(cutoff, accuracy)) +
geom_point() +
geom_line()
max(accuracy)
best_cutoff <- cutoff[which.max(accuracy)]
best_cutoff
y_hat <- ifelse(test_set$height > best_cutoff, "Male", "Female") %>%
factor(levels = levels(test_set$sex))
y_hat <- factor(y_hat)
mean(y_hat == test_set$sex)
```
## Comprehension Check - Basics of Evaluating Machine Learning Algorithms
1. For each of the following, indicate whether the outcome is continuous or categorical.
* Digit reader - categorical
* Height - continuous
* Spam filter - categorical
* Stock prices - continuous
* Sex - categorical
2. How many features are available to us for prediction in the ```mnist``` digits dataset?
You can download the ```mnist``` dataset using the ```read_mnist()``` function from the **dslabs** package.
```{r}
mnist <- read_mnist()
ncol(mnist$train$images)
```
## Confusion matrix
There is a link to the relevant section of the textbook: [Confusion Matrix](https://rafalab.github.io/dsbook/introduction-to-machine-learning.html#the-confusion-matrix){target="_blank"}
**Key points**
* Overall accuracy can sometimes be a deceptive measure because of unbalanced classes.
* A general improvement to using overall accuracy is to study sensitivity and specificity separately. **Sensitivity**, also known as the true positive rate or recall, is the proportion of actual positive outcomes correctly identified as such. **Specificity**, also known as the true negative rate, is the proportion of actual negative outcomes that are correctly identified as such.
* A confusion matrix tabulates each combination of prediction and actual value. You can create a confusion matrix in R using the ```table()``` function or the ```confusionMatrix()``` function from the **caret** package.
*Code*
```{r}
# tabulate each combination of prediction and actual value
table(predicted = y_hat, actual = test_set$sex)
test_set %>%
mutate(y_hat = y_hat) %>%
group_by(sex) %>%
summarize(accuracy = mean(y_hat == sex))
prev <- mean(y == "Male")
confusionMatrix(data = y_hat, reference = test_set$sex)
```
## Balanced accuracy and F1 score
There is a link to the relevant section of the textbook: [Balanced accuracy and F1 Score](https://rafalab.github.io/dsbook/introduction-to-machine-learning.html#balanced-accuracy-and-f_1-score){target="_blank"}
**Key points**
* For optimization purposes, sometimes it is more useful to have a one number summary than studying both specificity and sensitivity. One preferred metric is **balanced accuracy**. Because specificity and sensitivity are rates, it is more appropriate to compute the *harmonic* average. In fact, the **F1-score**, a widely used one-number summary, is the harmonic average of precision and recall.
* Depending on the context, some type of errors are more costly than others. The **F1-score** can be adapted to weigh specificity and sensitivity differently.
* You can compute the **F1-score** using the ```F_meas()``` function in the **caret** package.
*Code*
```{r}
# maximize F-score
cutoff <- seq(61, 70)
F_1 <- map_dbl(cutoff, function(x){
y_hat <- ifelse(train_set$height > x, "Male", "Female") %>%
factor(levels = levels(test_set$sex))
F_meas(data = y_hat, reference = factor(train_set$sex))
})
data.frame(cutoff, F_1) %>%
ggplot(aes(cutoff, F_1)) +
geom_point() +
geom_line()
max(F_1)
best_cutoff <- cutoff[which.max(F_1)]
best_cutoff
y_hat <- ifelse(test_set$height > best_cutoff, "Male", "Female") %>%
factor(levels = levels(test_set$sex))
sensitivity(data = y_hat, reference = test_set$sex)
specificity(data = y_hat, reference = test_set$sex)
```
## Prevalence matters in practice
There is a link to the relevant section of the textbook: [Prevalence matters in practice](https://rafalab.github.io/dsbook/introduction-to-machine-learning.html#prevalence-matters-in-practice){target="_blank"}
**Key points**
* A machine learning algorithm with very high sensitivity and specificity may not be useful in practice when prevalence is close to either 0 or 1. For example, if you develop an algorithm for disease diagnosis with very high sensitivity, but the prevalence of the disease is pretty low, then the precision of your algorithm is probably very low based on Bayes' theorem.
## ROC and precision-recall curves
There is a link to the relevant section of the textbook: [ROC and precision-recall curves](https://rafalab.github.io/dsbook/introduction-to-machine-learning.html#roc-and-precision-recall-curves){target="_blank"}
**Key points**
* A very common approach to evaluating accuracy and F1-score is to compare them graphically by plotting both. A widely used plot that does this is the **receiver operating characteristic (ROC) curve**. The ROC curve plots sensitivity (TPR) versus 1 - specificity or the false positive rate (FPR).
* However, ROC curves have one weakness and it is that neither of the measures plotted depend on prevalence. In cases in which prevalence matters, we may instead make a **precision-recall plot**, which has a similar idea with ROC curve.
*Code*
Note: your results and plots may be slightly different.
```{r}
p <- 0.9
n <- length(test_index)
y_hat <- sample(c("Male", "Female"), n, replace = TRUE, prob=c(p, 1-p)) %>%
factor(levels = levels(test_set$sex))
mean(y_hat == test_set$sex)
# ROC curve
probs <- seq(0, 1, length.out = 10)
guessing <- map_df(probs, function(p){
y_hat <-
sample(c("Male", "Female"), n, replace = TRUE, prob=c(p, 1-p)) %>%
factor(levels = c("Female", "Male"))
list(method = "Guessing",
FPR = 1 - specificity(y_hat, test_set$sex),
TPR = sensitivity(y_hat, test_set$sex))
})
guessing %>% qplot(FPR, TPR, data =., xlab = "1 - Specificity", ylab = "Sensitivity")
cutoffs <- c(50, seq(60, 75), 80)
height_cutoff <- map_df(cutoffs, function(x){
y_hat <- ifelse(test_set$height > x, "Male", "Female") %>%
factor(levels = c("Female", "Male"))
list(method = "Height cutoff",
FPR = 1-specificity(y_hat, test_set$sex),
TPR = sensitivity(y_hat, test_set$sex))
})
# plot both curves together
bind_rows(guessing, height_cutoff) %>%
ggplot(aes(FPR, TPR, color = method)) +
geom_line() +
geom_point() +
xlab("1 - Specificity") +
ylab("Sensitivity")
if(!require(ggrepel)) install.packages("ggrepel")
library(ggrepel)
map_df(cutoffs, function(x){
y_hat <- ifelse(test_set$height > x, "Male", "Female") %>%
factor(levels = c("Female", "Male"))
list(method = "Height cutoff",
cutoff = x,
FPR = 1-specificity(y_hat, test_set$sex),
TPR = sensitivity(y_hat, test_set$sex))
}) %>%
ggplot(aes(FPR, TPR, label = cutoff)) +
geom_line() +
geom_point() +
geom_text_repel(nudge_x = 0.01, nudge_y = -0.01)
# plot precision against recall
guessing <- map_df(probs, function(p){
y_hat <- sample(c("Male", "Female"), length(test_index),
replace = TRUE, prob=c(p, 1-p)) %>%
factor(levels = c("Female", "Male"))
list(method = "Guess",
recall = sensitivity(y_hat, test_set$sex),
precision = precision(y_hat, test_set$sex))
})
height_cutoff <- map_df(cutoffs, function(x){
y_hat <- ifelse(test_set$height > x, "Male", "Female") %>%
factor(levels = c("Female", "Male"))
list(method = "Height cutoff",
recall = sensitivity(y_hat, test_set$sex),
precision = precision(y_hat, test_set$sex))
})
bind_rows(guessing, height_cutoff) %>%
ggplot(aes(recall, precision, color = method)) +
geom_line() +
geom_point()
guessing <- map_df(probs, function(p){
y_hat <- sample(c("Male", "Female"), length(test_index), replace = TRUE,
prob=c(p, 1-p)) %>%
factor(levels = c("Male", "Female"))
list(method = "Guess",
recall = sensitivity(y_hat, relevel(test_set$sex, "Male", "Female")),
precision = precision(y_hat, relevel(test_set$sex, "Male", "Female")))
})
height_cutoff <- map_df(cutoffs, function(x){
y_hat <- ifelse(test_set$height > x, "Male", "Female") %>%
factor(levels = c("Male", "Female"))
list(method = "Height cutoff",
recall = sensitivity(y_hat, relevel(test_set$sex, "Male", "Female")),
precision = precision(y_hat, relevel(test_set$sex, "Male", "Female")))
})
bind_rows(guessing, height_cutoff) %>%
ggplot(aes(recall, precision, color = method)) +
geom_line() +
geom_point()
```
## Comprehension Check - Practice with Machine Learning, Part 1
The following questions all ask you to work with the dataset described below.
The ```reported_heights``` and ```heights``` datasets were collected from three classes taught in the Departments of Computer Science and Biostatistics, as well as remotely through the Extension School. The Biostatistics class was taught in 2016 along with an online version offered by the Extension School. On 2016-01-25 at 8:15 AM, during one of the lectures, the instructors asked student to fill in the sex and height questionnaire that populated the ```reported_heights``` dataset. The online students filled out the survey during the next few days, after the lecture was posted online. We can use this insight to define a variable which we will call ```type```, to denote the type of student, ```inclass``` or ```online```.
The code below sets up the dataset for you to analyze in the following exercises:
```{r}
if(!require(dplyr)) install.packages("dplyr")
if(!require(lubridate)) install.packages("lubridate")
library(dplyr)
library(lubridate)
data(reported_heights)
dat <- mutate(reported_heights, date_time = ymd_hms(time_stamp)) %>%
filter(date_time >= make_date(2016, 01, 25) & date_time < make_date(2016, 02, 1)) %>%
mutate(type = ifelse(day(date_time) == 25 & hour(date_time) == 8 & between(minute(date_time), 15, 30), "inclass","online")) %>%
select(sex, type)
y <- factor(dat$sex, c("Female", "Male"))
x <- dat$type
```
1. The ```type``` column of ```dat``` indicates whether students took classes in person ("inclass") or online ("online"). What proportion of the inclass group is female? What proportion of the online group is female?
Enter your answer as a percentage or decimal (eg "50%" or "0.50") to at least the hundredths place.
```{r}
dat %>% group_by(type) %>% summarize(prop_female = mean(sex == "Female"))
```
2. In the course videos, height cutoffs were used to predict sex. Instead of height, use the ```type``` variable to predict sex. Assume that for each class type the students are either all male or all female, based on the most prevalent sex in each class type you calculated in Q1. Report the accuracy of your prediction of sex based on type. You do not need to split the data into training and test sets.
Enter your accuracy as a percentage or decimal (eg "50%" or "0.50") to at least the hundredths place.
```{r}
y_hat <- ifelse(x == "online", "Male", "Female") %>%
factor(levels = levels(y))
mean(y_hat==y)
```
3. Write a line of code using the ```table()``` function to show the confusion matrix between ```y_hat``` and ```y```. Use the **exact** format ```function(a, b)``` for your answer and do not name the columns and rows. Your answer should have exactly one space.
```{r}
table(y_hat, y)
```
4. What is the sensitivity of this prediction? You can use the ```sensitivity()``` function from the **caret** package. Enter your answer as a percentage or decimal (eg "50%" or "0.50") to at least the hundredths place.
```{r}
sensitivity(y_hat, y)
```
5. What is the specificity of this prediction? You can use the ```specificity()``` function from the **caret** package. Enter your answer as a percentage or decimal (eg "50%" or "0.50") to at least the hundredths place.
```{r}
specificity(y_hat, y)
```
6. What is the prevalence (% of females) in the ```dat``` dataset defined above? Enter your answer as a percentage or decimal (eg "50%" or "0.50") to at least the hundredths place.
```{r}
mean(y == "Female")
```
## Comprehension Check - Practice with Machine Learning, Part 2
We will practice building a machine learning algorithm using a new dataset, ```iris```, that provides multiple predictors for us to use to train. To start, we will remove the ```setosa``` species and we will focus on the ```versicolor``` and ```virginica``` iris species using the following code:
```{r}
data(iris)
iris <- iris[-which(iris$Species=='setosa'),]
y <- iris$Species
```
The following questions all involve work with this dataset.
7. First let us create an even split of the data into ```train``` and ```test``` partitions using ```createDataPartition()``` from the **caret** package. The code with a missing line is given below:
```{r, include=TRUE, eval=FALSE}
# set.seed(2) # if using R 3.5 or earlier
set.seed(2, sample.kind="Rounding") # if using R 3.6 or later
# line of code
test <- iris[test_index,]
train <- iris[-test_index,]
```
Which code should be used in place of # line of code above?
- [ ] A. test_index <- createDataPartition(y,times=1,p=0.5)
- [ ] B. test_index <- sample(2,length(y),replace=FALSE)
- [X] C. test_index <- createDataPartition(y,times=1,p=0.5,list=FALSE)
- [ ] D. test_index <- rep(1,length(y))
```{r}
# set.seed(2) # if using R 3.5 or earlier
set.seed(2, sample.kind="Rounding") # if using R 3.6 or later
test_index <- createDataPartition(y,times=1,p=0.5,list=FALSE)
test <- iris[test_index,]
train <- iris[-test_index,]
```
8. Next we will figure out the singular feature in the dataset that yields the greatest overall accuracy when predicting species. You can use the code from the introduction and from Q7 to start your analysis.
Using only the ```train``` iris dataset, for each feature, perform a simple search to find the cutoff that produces the highest accuracy, predicting virginica if greater than the cutoff and versicolor otherwise. Use the ```seq``` function over the range of each feature by intervals of 0.1 for this search.
Which feature produces the highest accuracy?
```{r}
foo <- function(x){
rangedValues <- seq(range(x)[1],range(x)[2],by=0.1)
sapply(rangedValues,function(i){
y_hat <- ifelse(x>i,'virginica','versicolor')
mean(y_hat==train$Species)
})
}
predictions <- apply(train[,-5],2,foo)
sapply(predictions,max)
```
- [ ] A. Sepal.Length
- [ ] B. Sepal.Width
- [X] C. Petal.Length
- [ ] D. Petal.Width
9. For the feature selected in Q8, use the smart cutoff value from the training data to calculate overall accuracy in the test data. What is the overall accuracy?
```{r}
predictions <- foo(train[,3])
rangedValues <- seq(range(train[,3])[1],range(train[,3])[2],by=0.1)
cutoffs <-rangedValues[which(predictions==max(predictions))]
y_hat <- ifelse(test[,3]>cutoffs[1],'virginica','versicolor')
mean(y_hat==test$Species)
```
10. Notice that we had an overall accuracy greater than 96% in the training data, but the overall accuracy was lower in the test data. This can happen often if we overtrain. In fact, it could be the case that a single feature is not the best choice. For example, a combination of features might be optimal. Using a single feature and optimizing the cutoff as we did on our training data can lead to overfitting.
Given that we know the test data, we can treat it like we did our training data to see if the same feature with a different cutoff will optimize our predictions.
Which feature best optimizes our overall accuracy?
```{r}
foo <- function(x){
rangedValues <- seq(range(x)[1],range(x)[2],by=0.1)
sapply(rangedValues,function(i){
y_hat <- ifelse(x>i,'virginica','versicolor')
mean(y_hat==test$Species)
})
}
predictions <- apply(test[,-5],2,foo)
sapply(predictions,max)
```
- [ ] A. Sepal.Length
- [ ] B. Sepal.Width
- [ ] C. Petal.Length
- [X] D. Petal.Width
11. Now we will perform some exploratory data analysis on the data.
Notice that ```Petal.Length``` and ```Petal.Width``` in combination could potentially be more information than either feature alone.
Optimize the the cutoffs for ```Petal.Length``` and ```Petal.Width``` separately in the train dataset by using the ```seq``` function with increments of 0.1. Then, report the overall accuracy when applied to the test dataset by creating a rule that predicts virginica if ```Petal.Length``` is greater than the length cutoff OR ```Petal.Width``` is greater than the width cutoff, and versicolor otherwise.
What is the overall accuracy for the test data now?
```{r}
data(iris)
iris <- iris[-which(iris$Species=='setosa'),]
y <- iris$Species
plot(iris,pch=21,bg=iris$Species)
# set.seed(2) # if using R 3.5 or earlier
set.seed(2, sample.kind="Rounding") # if using R 3.6 or later
test_index <- createDataPartition(y,times=1,p=0.5,list=FALSE)
test <- iris[test_index,]
train <- iris[-test_index,]
petalLengthRange <- seq(range(train$Petal.Length)[1],range(train$Petal.Length)[2],by=0.1)
petalWidthRange <- seq(range(train$Petal.Width)[1],range(train$Petal.Width)[2],by=0.1)
length_predictions <- sapply(petalLengthRange,function(i){
y_hat <- ifelse(train$Petal.Length>i,'virginica','versicolor')
mean(y_hat==train$Species)
})
length_cutoff <- petalLengthRange[which.max(length_predictions)] # 4.7
width_predictions <- sapply(petalWidthRange,function(i){
y_hat <- ifelse(train$Petal.Width>i,'virginica','versicolor')
mean(y_hat==train$Species)
})
width_cutoff <- petalWidthRange[which.max(width_predictions)] # 1.5
y_hat <- ifelse(test$Petal.Length>length_cutoff | test$Petal.Width>width_cutoff,'virginica','versicolor')
mean(y_hat==test$Species)
```
## Conditional probabilities
There is a link to the relevant section of the textbook: [Conditional probabilities](https://rafalab.github.io/dsbook/introduction-to-machine-learning.html#conditional-probabilities-1){target="_blank"}
**Key points**
* Conditional probabilities for each class:
$p_{k}(x) = Pr(Y = k|X = x), for\, k = 1, ..., K$
* In machine learning, this is referred to as **Bayes' Rule**. This is a theoretical rule because in practice we don't know $p(x)$. Having a good estimate of the $p(x)$ will suffice for us to build optimal prediction models, since we can control the balance between specificity and sensitivity however we wish. In fact, estimating these conditional probabilities can be thought of as the main challenge of machine learning.
## Conditional expectations and loss function
There is a link to the relevant sections of the textbook: [Conditional expectations](https://rafalab.github.io/dsbook/introduction-to-machine-learning.html#conditional-expectations){target="_blank"} and [Loss functions](https://rafalab.github.io/dsbook/introduction-to-machine-learning.html#conditional-expectation-minimizes-squared-loss-function){target="_blank"}
**Key points**
* Due to the connection between **conditional probabilities** and **conditional expectations**:
$p_{k}(x) = Pr(Y = k|X = x),\,\text{for}\,k = 1, ..., K$
we often only use the expectation to denote both the conditional probability and conditional expectation.
* For continuous outcomes, we define a loss function to evaluate the model. The most commonly used one is **MSE (Mean Squared Error)**. The reason why we care about the conditional expectation in machine learning is that the expected value minimizes the MSE:
$\hat{Y} = E(Y|X = x)\, \text{minimizes}\, E\{(\hat{Y} - Y)^2|X=x\}$
Due to this property, a succinct description of the main task of machine learning is that we use data to estimate for any set of features. **The main way in which competing machine learning algorithms differ is in their approach to estimating this expectation**.
## Comprehension Check - Conditional Probabilities, Part 1
1. In a previous module, we covered Bayes' theorem and the Bayesian paradigm. Conditional probabilities are a fundamental part of this previous covered rule.
$P(A|B) = P(B|A)\frac{P(A)}{P(B)}$
We first review a simple example to go over conditional probabilities.
Assume a patient comes into the doctor’s office to test whether they have a particular disease.
* The test is positive 85% of the time when tested on a patient with the disease (high sensitivity): $P(\text{test} + | \text{disease}) = 0.85$
* The test is negative 90% of the time when tested on a healthy patient (high specificity): $P(\text{test} - | \text{heathy}) = 0.90$
* The disease is prevalent in about 2% of the community: $P(\text{disease}) = 0.02$
Using Bayes' theorem, calculate the probability that you have the disease if the test is positive.
$P(\text{disease} | \text{test}+) = P(\text{test}+ | \text{disease}) \times \frac{P(\text{disease})}{P(\text{test}+)} = \frac{P(\text{test}+ | \text{disease})P(\text{disease})}{P(\text{test}+ | \text{disease})P(\text{disease})+P(\text{test}+ | \text{healthy})P(\text{healthy})]} = \frac{0.85 \times 0.02}{0.85 \times 0.02 + 0.1 \times 0.98} = 0.1478261$
The following 4 questions (Q2-Q5) all relate to implementing this calculation using R.
We have a hypothetical population of 1 million individuals with the following conditional probabilities as described below:
* The test is positive 85% of the time when tested on a patient with the disease (high sensitivity): $P(\text{test} + | \text{disease}) = 0.85$
* The test is negative 90% of the time when tested on a healthy patient (high specificity): $P(\text{test} - | \text{heathy}) = 0.90$
* The disease is prevalent in about 2% of the community: $P(\text{disease}) = 0.02$
Here is some sample code to get you started:
```{r}
# set.seed(1) # if using R 3.5 or earlier
set.seed(1, sample.kind = "Rounding") # if using R 3.6 or later
disease <- sample(c(0,1), size=1e6, replace=TRUE, prob=c(0.98,0.02))
test <- rep(NA, 1e6)
test[disease==0] <- sample(c(0,1), size=sum(disease==0), replace=TRUE, prob=c(0.90,0.10))
test[disease==1] <- sample(c(0,1), size=sum(disease==1), replace=TRUE, prob=c(0.15, 0.85))
```
2. What is the probability that a test is positive?
```{r}
mean(test)
```
3. What is the probability that an individual has the disease if the test is negative?
```{r}
mean(disease[test==0])
```
4. What is the probability that you have the disease if the test is positive? Remember: calculate the conditional probability the disease is positive assuming a positive test.
```{r}
mean(disease[test==1]==1)
```
5. Compare the prevalence of disease in people who test positive to the overall prevalence of disease.
If a patient's test is positive, by how many times does that increase their risk of having the disease? First calculate the probability of having the disease given a positive test, then divide by the probability of having the disease.
```{r}
mean(disease[test==1]==1)/mean(disease==1)
```
## Comprehension Check - Conditional Probabilities, Part 2
6. We are now going to write code to compute conditional probabilities for being male in the heights dataset. Round the heights to the closest inch. Plot the estimated conditional probability $P(x) = \mbox{Pr}(\mbox{Male} | \mbox{height}=x)$.
Part of the code is provided here:
```{r, include=TRUE, eval=FALSE}
data("heights")
# MISSING CODE
qplot(height, p, data =.)
```
Which of the following blocks of code can be used to replace **# MISSING CODE** to make the correct plot?
- [ ] A.
```{r, include=TRUE, eval=FALSE}
heights %>%
group_by(height) %>%
summarize(p = mean(sex == "Male")) %>%
```
- [ ] B.
```{r, include=TRUE, eval=FALSE}
heights %>%
mutate(height = round(height)) %>%
group_by(height) %>%
summarize(p = mean(sex == "Female")) %>%
```
- [ ] C.
```{r, include=TRUE, eval=FALSE}
heights %>%
mutate(height = round(height)) %>%
summarize(p = mean(sex == "Male")) %>%
```
- [X] D.
```{r, include=TRUE, eval=FALSE}
heights %>%
mutate(height = round(height)) %>%
group_by(height) %>%
summarize(p = mean(sex == "Male")) %>%
```
```{r}
data("heights")
heights %>%
mutate(height = round(height)) %>%
group_by(height) %>%
summarize(p = mean(sex == "Male")) %>%
qplot(height, p, data =.)
```
7. In the plot we just made in Q6 we see high variability for low values of height. This is because we have few data points. This time use the quantile $0.1,0.2,\dots,0.9$ and the ```cut()``` function to assure each group has the same number of points. Note that for any numeric vector ```x```, you can create groups based on quantiles like this: ```cut(x, quantile(x, seq(0, 1, 0.1)), include.lowest = TRUE)```.
Part of the code is provided here:
```{r, include=TRUE, eval=FALSE}
ps <- seq(0, 1, 0.1)
heights %>%
# MISSING CODE
group_by(g) %>%
summarize(p = mean(sex == "Male"), height = mean(height)) %>%
qplot(height, p, data =.)
```
Which of the following lines of code can be used to replace **# MISSING CODE** to make the correct plot?
- [ ] A.
```{r, include=TRUE, eval=FALSE}
mutate(g = cut(male, quantile(height, ps), include.lowest = TRUE)) %>%
```
- [X] B.
```{r, include=TRUE, eval=FALSE}
mutate(g = cut(height, quantile(height, ps), include.lowest = TRUE)) %>%
```
- [ ] C.
```{r, include=TRUE, eval=FALSE}
mutate(g = cut(female, quantile(height, ps), include.lowest = TRUE)) %>%
```
- [ ] D.
```{r, include=TRUE, eval=FALSE}
mutate(g = cut(height, quantile(height, ps))) %>%
```
```{r}
ps <- seq(0, 1, 0.1)
heights %>%
mutate(g = cut(height, quantile(height, ps), include.lowest = TRUE)) %>%
group_by(g) %>%
summarize(p = mean(sex == "Male"), height = mean(height)) %>%
qplot(height, p, data =.)
```
8. You can generate data from a bivariate normal distrubution using the **MASS** package using the following code:
```{r}
if(!require(MASS)) install.packages("MASS")
Sigma <- 9*matrix(c(1,0.5,0.5,1), 2, 2)
dat <- MASS::mvrnorm(n = 10000, c(69, 69), Sigma) %>%
data.frame() %>% setNames(c("x", "y"))
```
And you can make a quick plot using ```plot(dat)```.
```{r}
plot(dat)
```
Using an approach similar to that used in the previous exercise, let's estimate the conditional expectations and make a plot. Part of the code has again been provided for you:
```{r, include=TRUE, eval=FALSE}
ps <- seq(0, 1, 0.1)
dat %>%
# MISSING CODE
qplot(x, y, data =.)
```
Which of the following blocks of code can be used to replace **# MISSING CODE** to make the correct plot?
- [X] A.
```{r, include=TRUE, eval=FALSE}
mutate(g = cut(x, quantile(x, ps), include.lowest = TRUE)) %>%
group_by(g) %>%
summarize(y = mean(y), x = mean(x)) %>%
```
- [ ] B.
```{r, include=TRUE, eval=FALSE}
mutate(g = cut(x, quantile(x, ps))) %>%
group_by(g) %>%
summarize(y = mean(y), x = mean(x)) %>%
```
- [ ] C.
```{r, include=TRUE, eval=FALSE}
mutate(g = cut(x, quantile(x, ps), include.lowest = TRUE)) %>%
summarize(y = mean(y), x = mean(x)) %>%
```
- [ ] D.
```{r, include=TRUE, eval=FALSE}
mutate(g = cut(x, quantile(x, ps), include.lowest = TRUE)) %>%
group_by(g) %>%
summarize(y =(y), x =(x)) %>%
```
```{r}
ps <- seq(0, 1, 0.1)
dat %>%
mutate(g = cut(x, quantile(x, ps), include.lowest = TRUE)) %>%
group_by(g) %>%
summarize(y = mean(y), x = mean(x)) %>%
qplot(x, y, data =.)
```
# Section 3 - Linear Regression for Prediction, Smoothing, and Working with Matrices Overview
In the **Linear Regression for Prediction, Smoothing, and Working with Matrices Overview** section, you will learn why linear regression is a useful baseline approach but is often insufficiently flexible for more complex analyses, how to smooth noisy data, and how to use matrices for machine learning.
After completing this section, you will be able to:
* Use **linear regression for prediction** as a baseline approach.
* Use **logistic regression** for categorical data.
* Detect trends in noisy data using **smoothing** (also known as **curve fitting** or **low pass filtering**).
* Convert predictors to **matrices** and outcomes to **vectors** when all predictors are numeric (or can be converted to numerics in a meaningful way).
* Perform basic **matrix algebra** calculations.
This section has three parts: **linear regression for prediction**, **smoothing**, and **working with matrices**.
## Linear Regression for Prediction
There is a link to the relevant section of the textbook: [Linear regression for prediction](https://rafalab.github.io/dsbook/examples-of-algorithms.html#linear-regression){target="_blank"}
**Key points**
* Linear regression can be considered a machine learning algorithm. Although it can be too rigid to be useful, it works rather well for some challenges. It also serves as a baseline approach: if you can’t beat it with a more complex approach, you probably want to stick to linear regression.
*Code*
Note: the seed was not set before ```createDataPartition``` so your results may be different.
```{r}
if(!require(HistData)) install.packages("HistData")
library(HistData)
galton_heights <- GaltonFamilies %>%
filter(childNum == 1 & gender == "male") %>%
dplyr::select(father, childHeight) %>%
rename(son = childHeight)
y <- galton_heights$son
test_index <- createDataPartition(y, times = 1, p = 0.5, list = FALSE)
train_set <- galton_heights %>% slice(-test_index)
test_set <- galton_heights %>% slice(test_index)
avg <- mean(train_set$son)
avg
mean((avg - test_set$son)^2)
# fit linear regression model
fit <- lm(son ~ father, data = train_set)
fit$coef
y_hat <- fit$coef[1] + fit$coef[2]*test_set$father
mean((y_hat - test_set$son)^2)
```
## Predict Function
There is a link to the relevant section of the textbook: [Predict function](https://rafalab.github.io/dsbook/examples-of-algorithms.html#the-predict-function){target="_blank"}
**Key points**
* The ```predict()``` function takes a fitted object from functions such as ```lm()``` or ```glm()``` and a data frame with the new predictors for which to predict. We can use predict like this:
```{r, include=TRUE, eval=FALSE}
y_hat <- predict(fit, test_set)
```
* ```predict()``` is a generic function in R that calls other functions depending on what kind of object it receives. To learn about the specifics, you can read the help files using code like this:
```{r, include=TRUE, eval=FALSE}
?predict.lm # or ?predict.glm
```
*Code*
```{r}
y_hat <- predict(fit, test_set)
mean((y_hat - test_set$son)^2)
# read help files
?predict.lm
?predict.glm
```
## Comprehension Check - Linear Regression
1. Create a data set using the following code:
```{r}
# set.seed(1) # if using R 3.5 or earlier
set.seed(1, sample.kind="Rounding") # if using R 3.6 or later
n <- 100
Sigma <- 9*matrix(c(1.0, 0.5, 0.5, 1.0), 2, 2)
dat <- MASS::mvrnorm(n = 100, c(69, 69), Sigma) %>%
data.frame() %>% setNames(c("x", "y"))
```
We will build 100 linear models using the data above and calculate the mean and standard deviation of the combined models. First, set the seed to 1 again (make sure to use ```sample.kind="Rounding"``` if your R is version 3.6 or later). Then, within a ```replicate()``` loop, (1) partition the dataset into test and training sets with ```p = 0.5``` and using ```dat$y``` to generate your indices, (2) train a linear model predicting ```y``` from ```x```, (3) generate predictions on the test set, and (4) calculate the RMSE of that model. Then, report the mean and standard deviation (SD) of the RMSEs from all 100 models.
Report all answers to at least 3 significant digits.
```{r}
# set.seed(1) # if using R 3.5 or earlier
set.seed(1, sample.kind="Rounding") # if using R 3.6 or later
rmse <- replicate(100, {
test_index <- createDataPartition(dat$y, times = 1, p = 0.5, list = FALSE)
train_set <- dat %>% slice(-test_index)
test_set <- dat %>% slice(test_index)
fit <- lm(y ~ x, data = train_set)
y_hat <- predict(fit, newdata = test_set)
sqrt(mean((y_hat-test_set$y)^2))
})
mean(rmse)
sd(rmse)
```
2. Now we will repeat the exercise above but using larger datasets. Write a function that takes a size ```n```, then (1) builds a dataset using the code provided at the top of Q1 but with ```n``` observations instead of 100 and without the ```set.seed(1)```, (2) runs the ```replicate()``` loop that you wrote to answer Q1, which builds 100 linear models and returns a vector of RMSEs, and (3) calculates the mean and standard deviation of the 100 RMSEs.
Set the seed to 1 (if using R 3.6 or later, use the argument ```sample.kind="Rounding")``` and then use ```sapply()``` or ```map()``` to apply your new function to ```n <- c(100, 500, 1000, 5000, 10000)```.
Hint: You only need to set the seed once before running your function; do not set a seed within your function. Also be sure to use ```sapply()``` or ```map()``` as you will get different answers running the simulations individually due to setting the seed.
```{r}
# set.seed(1) # if R 3.5 or earlier
set.seed(1, sample.kind="Rounding") # if R 3.6 or later
n <- c(100, 500, 1000, 5000, 10000)
res <- sapply(n, function(n){
Sigma <- 9*matrix(c(1.0, 0.5, 0.5, 1.0), 2, 2)
dat <- MASS::mvrnorm(n, c(69, 69), Sigma) %>%
data.frame() %>% setNames(c("x", "y"))
rmse <- replicate(100, {
test_index <- createDataPartition(dat$y, times = 1, p = 0.5, list = FALSE)
train_set <- dat %>% slice(-test_index)
test_set <- dat %>% slice(test_index)
fit <- lm(y ~ x, data = train_set)
y_hat <- predict(fit, newdata = test_set)
sqrt(mean((y_hat-test_set$y)^2))
})
c(avg = mean(rmse), sd = sd(rmse))
})
res
```
3. What happens to the RMSE as the size of the dataset becomes larger?
- [X] A. On average, the RMSE does not change much as n gets larger, but the variability of the RMSE decreases.
- [ ] B. Because of the law of large numbers the RMSE decreases; more data means more precise estimates.
- [ ] C. n = 10000 is not sufficiently large. To see a decrease in the RMSE we would need to make it larger.
- [ ] D. The RMSE is not a random variable.
4. Now repeat the exercise from Q1, this time making the correlation between ```x``` and ```y``` larger, as in the following code:
```{r}
# set.seed(1) # if using R 3.5 or earlier
set.seed(1, sample.kind="Rounding") # if using R 3.6 or later