-
Notifications
You must be signed in to change notification settings - Fork 63
/
Copy pathImbalanced_HR.Rmd
208 lines (154 loc) · 5.95 KB
/
Imbalanced_HR.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
---
title: "Predicting Customer Churn: The Imbalanced HR Dataset"
output:
html_document:
toc: yes
toc_float: yes
code_folding: hide
---
# Situation
You are the Director of HR at a large multinational. Your boss - a VP who answers to the CEO - is concerned about employee churn over the last few years and wants you to fix it.
You have tasked your elite data science team to figure out who's leaving and why. Better yet, to build a predictive model that can predict which employees are likely to leave.
Your lead data scientist hands you a report, below. The report outlines a number of different models, each with pros and cons. You must decide which to use. How do you proceed?
# The Report
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r}
library(dplyr)
library(caret)
library(rpart)
library(rpart.plot)
library(purrr) # for functional programming (map)
library(rsample) # contains the IBM attrition data set
```
```{r}
# Helper function to print the confusion matrix and other performance metrics of the models.
printPerformance = function(pred, actual, positive="Yes") {
print(caret::confusionMatrix(data=pred, reference=actual, positive=positive, dnn=c("Predicted", "Actual")))
}
```
First, let's load in the data and take a peak.
```{r}
data(attrition, package="rsample")
df <- attrition
str(df)
head(df)
table(df$Attrition)
```
Next, let's split the data into training and testing.
```{r}
set.seed(123) # Set the seed to make it reproducible
train.index <- createDataPartition(df$Attrition, p = .8, list = FALSE)
train <- df[ train.index,]
test <- df[-train.index,]
```
Let's look at the imbalance of the classes in the full, training, and testing data sets.
```{r}
table(df$Attrition)/nrow(df)
table(train$Attrition)/nrow(train)
table(test$Attrition)/nrow(test)
```
# Model Training
We train eight different models.
- Original. A decision tree, using CV with Accuracy as the assessment metric. (I.e., no adjustments are made to deal with the class imbalance problem.)
- Kappa. Same as Original, except during CV, the kappa metric is used instead of the accuracy metric.
- Weighted. Same as Original, except "No" observations are weighted .84 and "yes" observations are weighted .16.
- Cost FN. Same as Original, except false negatives are given a cost of 4 and false positives are given a cost of 1.
- Cost FP. Same as Original, except false positives are given a cost of 4 and false negatives are given a cost of 1.
- Down. Same as Original, except training data is first down sampled.
- SMOTE. Same as Original, except new training data is artificially generated using the SMOTE method.
- All. Same as Original, except the kappa metric is used during CV, and down sampling method is used.
```{r, cache=TRUE}
metric = "Accuracy"
actual = test$Attrition
formula = Attrition ~ .
positive = "Yes"
set.seed(123)
ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 5, classProbs = FALSE)
orig_fit <- train(formula, data = train, method = "rpart", metric = metric, trControl = ctrl)
kappa_fit <- train(formula, data = train, method = "rpart", metric = "Kappa", trControl = ctrl)
weight = table(train$Attrition)["No"] / table(train$Attrition)["Yes"]
model_weights <- ifelse(train$Attrition == "Yes", weight, 1)
weight_fit <- train(formula, data = train, method = "rpart", metric = metric, weights=model_weights, trControl = ctrl)
FN_cost = 4
FP_cost = 1
cost_fn <- train(formula, data = train, method = "rpart", metric = metric,
parms=list(loss=matrix(c(0,FP_cost,FN_cost,0), byrow=TRUE, nrow=2)),
trControl = ctrl)
FN_cost = 1
FP_cost = 4
cost_fp <- train(formula, data = train, method = "rpart", metric = metric,
parms=list(loss=matrix(c(0,FP_cost,FN_cost,0), byrow=TRUE, nrow=2)),
trControl = ctrl)
ctrl$sampling = "down"
down_fit <- train(formula, data = train, method = "rpart", metric = metric, trControl = ctrl)
ctrl$sampling = "smote"
smote_fit <- train(formula, data = train, method = "rpart", metric = metric, trControl = ctrl)
ctrl$sampling = "down"
metric="Kappa"
all_fit <- train(formula, data = train, method = "rpart", metric = metric, trControl = ctrl)
```
# Performance Assessment
Here's a summary of the performance of all of the techniques.
```{r}
assessModel = function(m_name, m){
pred = predict(m, newdata=test)
a = caret::confusionMatrix(data=pred, reference=actual, positive=positive, dnn=c("Predicted", "Actual"))
res1 = data.frame(name=m_name,
accuracy=a$overall["Accuracy"],
precision=a$byClass["Precision"],
recall=a$byClass["Recall"],
specificity=a$byClass["Specificity"],
kappa=a$overall["Kappa"])
res1
}
res = data.frame(name=character(), accuracy=numeric(), precision=numeric(), recall=numeric(), specificity=numeric(), kappa=numeric())
res = rbind(res, assessModel("orig", orig_fit))
res = rbind(res, assessModel("kappa", kappa_fit))
res = rbind(res, assessModel("weights", weight_fit))
res = rbind(res, assessModel("cost fn", cost_fn))
res = rbind(res, assessModel("cost fp", cost_fp))
res = rbind(res, assessModel("down", down_fit))
res = rbind(res, assessModel("smote", smote_fit))
res = rbind(res, assessModel("all", all_fit))
row.names(res) = NULL
res
```
```{r}
# Function to show the confusion matrix and resulting tree
showResults = function(model){
pred = predict(model, test)
print(caret::confusionMatrix(data=pred, reference=actual, positive=positive, dnn=c("Predicted", "Actual")))
rpart.plot(model$finalModel, extra=2, type=2)
}
```
Here are the details of the performance of each technique.
## Original
```{r}
showResults(orig_fit)
```
## Kappa
```{r}
showResults(kappa_fit)
```
## Weights
```{r}
showResults(weight_fit)
```
## Costs - High FP cost
```{r}
showResults(cost_fp)
```
## Costs - High FN Cost
```{r}
showResults(cost_fn)
```
## Down sampling
```{r}
showResults(down_fit)
```
## All
```{r}
showResults(all_fit)
```