Ensembles_Banking.Rmd

---
title: "Illustrating Ensemble Models - Banking Data"
output: 
  html_document:
      toc: yes
      toc_float: yes
      code_folding: hide
---

This case comes from the UCL Machine Learning Repository, and is called the [Bank Marketing Data Set](https://archive.ics.uci.edu/ml/datasets/bank+marketing).

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') bought.

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

```{r, warning = FALSE, message = FALSE}
library(tidyverse)
library(MLmetrics)
```


```{r}
# Helper function to print the confusion matrix and other performance metrics of the models.
printPerformance = function(pred, actual, positive="yes") {
  print(table(actual, pred))
  print("")
  
  print(sprintf("Accuracy:    %.3f", Accuracy(y_true=actual, y_pred=pred)))
  print(sprintf("Precision:   %.3f", Precision(y_true=actual, y_pred=pred, positive=positive)))
  print(sprintf("Recall:      %.3f", Recall(y_true=actual, y_pred=pred, positive=positive)))
  print(sprintf("F1 Score:    %.3f", F1_Score(pred, actual, positive=positive)))
  print(sprintf("Sensitivity: %.3f", Sensitivity(y_true=actual, y_pred=pred, positive=positive)))
  print(sprintf("Specificity: %.3f", Specificity(y_true=actual, y_pred=pred, positive=positive)))
}
```

# Read in the data

```{r}
df <- read_csv("C:/Users/st50/Documents/sandbox/data/bank-additional/bank-additional-full.csv")
df = df %>% 
  dplyr::rename(bought=y) %>%
  mutate_if(is.character, as.factor)
head(df)
summary(df)
```

# Splitting the Data

```{r}
set.seed(123) # Set the seed to make it reproducible
train <- sample_frac(df, 0.8)
test <- setdiff(df, train)
train <- as.data.frame(train)
test <- as.data.frame(test)
actual = test$bought
formula = bought ~ .
positive = "yes"
```

# Decision Tree

```{r, warning = FALSE}
library(rpart)
library(rpart.plot) # For pretty trees
set.seed(123)
tree <- rpart(formula, method="class", data=train)
rpart.plot(tree, extra=2, type=2)
predicted = predict(tree, test, type="class") 
printPerformance(predicted, actual, positive = positive)
```

# Random Forests

```{r, warning = FALSE}
library(randomForest)
set.seed(123) 
rf = randomForest(formula, data=train, mtry=3, ntree=100, importance=TRUE)
rf.predicted = predict(rf, test, type="class") 
printPerformance(rf.predicted, actual, positive = positive)
varImpPlot(rf)
```

# Boosting

```{r, warning = FALSE}
library(fastAdaboost)
set.seed(123)
boost = adaboost(formula, data=train, nIter=20)
boost.predicted = predict(boost, newdata=test)
printPerformance(boost.predicted$class, actual, positive = positive)
```