This repository has been archived by the owner on May 6, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 72
/
Copy path05-random-forest.Rmd
95 lines (64 loc) · 4.61 KB
/
05-random-forest.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
# Random Forests
## Load packages
```{r load_packages}
library(ranger)
library(vip)
library(ggplot2)
```
## Load data
Load `train_x_class`, `train_y_class`, `test_x_class`, and `test_y_class` variables we defined in 02-preprocessing.Rmd for this *classification* task.
```{r setup_data}
# Objects: task_reg, task_class
load("data/preprocessed.RData")
```
## Overview
The random forest algorithm seeks to improve on the performance of a single decision tree by taking the average of many trees. Thus, a random forest can be viewed as an **ensemble** method, or model averaging approach. The algorithm was invented by UC Berkeley's own Leo Breiman in 2001, who was also a co-creator of decision trees (see his [1984 CART book](https://www.amazon.com/Classification-Regression-Wadsworth-Statistics-Probability/dp/0412048418)).
Random forests are an extension of **bagging**, in which multiple samples of the original data are drawn with replacement (aka "bootstrap samples"). An algorithm is fit separately to each sample, then the average of those estimates is used for prediction. While bagging can be used by any algorithm, random forest uses decision trees as its base learner. Random forests add another level of randomness by also randomly sampling the features (or covariates) at each split in each decision tree. This makes the decision trees use different covariates and therefore be more unique. As a result, the average of these trees tends to be more accurate overall.
## Fit model
Fit a random forest model that predicts the number of people with heart disease using the other variables as our X predictors. If our Y variable is a factor, `ranger` will by default perform classification; if it is numeric/integer regression will be performed and if it is omitted it will run an unsupervised analysis.
```{r rf_fit}
set.seed(1)
(rf1 = ranger::ranger(train_y_class ~ .,
data = train_x_class,
# Number of trees
num.trees = 500,
# Number of variables randomly sampled as candidates at each split.
mtry = 5,
# Grow a probability forest?
probability = TRUE,
# We want the importance of predictors to be assessed.
importance = "permutation"))
```
The "OOB estimate of error rate" shows us how accurate our model is. $accuracy = 1 - error rate$. OOB stands for "out of bag" - and bag is short for "bootstrap aggregation". So OOB estimates performance by comparing the predicted outcome value to the actual value across all trees using only the observations that were not part of the training data for that tree.
We can examine the relative variable importance in table and graph form, but without all the hard coding that we did in 04-decision-trees.Rmd. Random Forest estimates variable importance by separately examining each variable and estimating how much the model's accuracy drops when that variable's values are randomly shuffled (permuted). The shuffling temporarily removes any relationship between that covariate's value and the outcome. If a variable is important then the model's accuracy will suffer a large drop when it is randomly shuffled. But if the model's accuracy doesn't change it means the variable is not important to the model - e.g. maybe it was never even chosen as a split in any of the decision trees.
## Investigate Results
```{r rf_varimp_plot}
vip::vip(rf1) + theme_bw()
# Raw data
vip::vi(rf1)
# Unhashtag to see all variables - tibbles are silly!
# View(vip::vi(rf1))
```
Read up on the [gini coefficient](https://en.wikipedia.org/wiki/Gini_coefficient) here. It's basically a measure of diversity or dispersion - a higher gini means the model is classifying better. The gini version does not randomly shuffle the variable values.
Now, the goal is to see how the model performs on the test dataset:
```{r}
# This will predict the outcome class.
predicted_label = as.integer(predict(rf1, data = test_x_class)$predictions[, 1] > 0.5)
str(predicted_label)
table(predicted_label, test_y_class)
```
Check the accuracy of the test set:
```{r prob_hist}
mean(predicted_label == test_y_class)
# We can also generated probability predictions, which are more granular.
predicted_prob = as.data.frame(predict(rf1, data = test_x_class)$predictions)
colnames(predicted_prob) = c("no", "yes")
summary(predicted_prob)
ggplot(predicted_prob, aes(x = yes)) +
geom_histogram() +
theme_minimal()
# TODO: add terminal node count in for ranger.
```
How did it do? Are the accuracies for the training and test sets similar?
## Challenge 3
Open Challenge 3 in the "Challenges" folder.