CustomerChurnPredictionAnswers.Rmd

---
title: "Telco Customer Churn Prediction"
author: "Matthew Buddensick"
date: "9/25/2020"
output: 
    html_document:
        toc: true
        toc_depth: 2
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

```{r include=FALSE}
pacman::p_load(tidyverse, tidymodels, themis, naniar, janitor, cowplot)
theme_set(theme_classic())
```

## Data

```{r include=FALSE}
churn <- read_csv("WA_Fn-UseC_-Telco-Customer-Churn.csv") %>% 
    clean_names
```

In this markdown file we will be predicting whether a customer churns. The data can be found on [kaggle](https://www.kaggle.com/blastchar/telco-customer-churn) and has `r nrow(churn)` rows and `r ncol(churn)` columns. The data set we will be using is saved as **churn**.

```{r}
(missing_data <- miss_var_summary(churn))
```

```{r}
ggplot(missing_data[1:10, ], aes(x = variable, y = pct_miss / 100)) +
    geom_col(fill = "black") +
    labs(title = "Missing Data", x = "Variable", y = "Percent Missing") +
    scale_y_continuous(labels = scales::percent, limits = c(0, 1)) +
    coord_flip()
```

We can see from the above chart that there is almost no missing data. If we look at the results from the **miss_var_summary** function we can see that all the data is present except for about 0.16% of the data in the total_charges column. We will have to further explore each variable to make sure all the data seems reasonable, and fill in the missing values for the total_charges column. We will also drop the customer_id column since it provides no useful information for the analysis we will be doing.

```{r}
churn <- churn %>% 
    select(-customer_id)
```

```{r}
# Create a function to look at values in the data
# This function will take the data, type of variable (is.character, is.numeric, etc.) and a function. The function will be used inside of lapply, which will iterate over columns in the data and perform the specified function
my_apply <- function(data, type, rfunction, ...){
    data %>% 
        select(where(type)) %>% 
        lapply(FUN = rfunction, ...)
}
```

```{r}
# Look at all the unique values in character columns
my_apply(churn, is.character, unique)
```

Looking at all the categorical variables in our data set, and all the unique values in each variable, it does not look like anything is out of place. In other words, each unique value seems like it is reasonable to appear in that column. An example of something we would need to fix is if there were multiple spellings of the same thing, in this case we would need to re-code the values so that they are the same. For example, if we had another value called "Mailed checks" in the payment_method column, we would want to re-code that as "Mailed check", since that value is already present. 

We could also check how many times each value appears in each column to get a better idea of the class balance of each variable, most importantly the dependent variable we will be predicting, churn.

```{r}
# Look at how many times a value appears in character columns
my_apply(churn, is.character, table, useNA = "ifany")
```

Here we can see the class balance of all the character variables in the churn data. We can see that the class balances of the variable are okay, but no great, with phone service and multiple lines being the most unbalanced. We can also see that the variable we are going to be predicting, churn, is unbalanced. We will work on this when we start working on building a model.

```{r}
my_apply(churn, is.numeric, summary)
```

Here we can see that senior citizen should be coded as a factor variable, since it is just 0's and 1's. We can also see the range of the tenure (months customer has stayed with the company), monthly_charges and total_charges variables, and all the values seem reasonable. We will have to deal with the NA values in total_charges, which we can do once we start the modeling process.

```{r}
churn <- churn %>% 
    mutate_if(is.character, factor) %>% 
    mutate(senior_citizen = factor(ifelse(senior_citizen == 0, "No", "Yes")))
```

## EDA (Exploratory Data Analysis)

Before creating any models, we will explore the data by performing an EDA, or exploratory data analysis. This will allow us to analyze the data to find main characteristics, and visualize them in order to get a better understanding of the data.

### Categorical Variables

```{r}
churn %>% 
    count(churn) %>% 
    mutate(percent =  n / sum(n)) %>% 
    ggplot(aes(x = churn, y = percent, fill = churn)) +
    geom_col(position = "dodge") +
    scale_y_continuous(labels = scales::percent, limits = c(0,1)) +
    labs(title = "Percentage of Customer Churn", y = "Percent", x = "Churn") +
    theme(legend.position = "none")
```

* A little over 25% of customers left left within the last month

```{r}
plot_grid(
  ggplot(churn, aes(x = gender, fill = churn)) + 
    geom_bar(position = "fill") +
    scale_y_continuous(labels = scales::percent) +
    theme(legend.position = "none") + 
    labs(title = "How Demographic Information \nAffects Retention",
                                           y = "Percentage", x = "Gender"),
  ggplot(churn, aes(x = senior_citizen, fill = churn)) + 
    geom_bar(position = "fill") +
    scale_y_continuous(labels = scales::percent) +
    theme(legend.position = "none") + 
    theme(axis.title.y = element_blank()),
  ggplot(churn, aes(x = partner, fill = churn)) + 
    geom_bar(position = "fill") +
    scale_y_continuous(labels = scales::percent) +
    labs(y = "Percentage", x = "Partner") +
    theme(legend.position = "none") ,
  ggplot(churn, aes(x = dependents, fill = churn)) + 
    geom_bar(position = "fill") +
    scale_y_continuous(labels = scales::percent) +
    labs(x = "Dependents", fill = "Churn") +
    theme(axis.title.y = element_blank())
)
```

* Customer Churn is about even for Gender
* More senior citizens have left the service
* People with no partners seem to have left the service more often than people with partners
* People with no dependents have seem to have left the service more often than people with dependents

### Numeric Variables

```{r warning=FALSE}
ggplot(churn, aes(x = churn, y = total_charges, fill = churn)) +
  geom_boxplot() +
  labs(title = "Churn based on Total Charges", y = "Total Charges", x = "Churn") +
  theme(legend.position = "none")
```

It looks like the median amount of total charges for a customer who churned is less than the median amount of total charges for a customer who stayed with the service. The IQR for customers who churned is also varies less than the IQR for customers who did not churn. This is interesting behavior, and we would probably need more domain specific information to figure out why this is the case.

```{r}
ggplot(churn, aes(x = churn, y = monthly_charges, fill = churn)) +
  geom_boxplot() +
  labs(title = "Churn based on Monthly", y = "Monthly Charges", x = "Churn") +
  theme(legend.position = "none")
```

We can see from this graph that customers who left the service had higher monthly charges than customers who stayed with the service. This makes more sense than the plot of total charges, which might indicate that there is another variable affecting the total charge of a customer.

## Modeling

To create models we are going to be using the TidyModels package. You can find the [Tidymodels documentation here](https://www.tidymodels.org/). For a better look how you can use different models look at the [Parsnip models](https://www.tidymodels.org/find/parsnip/) page. For this project, we will create two models using a logistic regression and a XGBoost model.

### Preparations

```{r}
set.seed(42)
churn_split <- initial_split(churn, strata = churn)
churn_training <- training(churn_split)
churn_testing <- testing(churn_split)
```

**Initial_split** creates training and testing sets, and you can call the **training** and **testing** function on the output of that function to create your two data sets.

```{r}
set.seed(42)
churn_vfolds <- vfold_cv(churn_training)
```

**Vfold_cv** is the function used for cross-validation, and randomly splits the data into groups (10 by default) of equal size.

```{r}
churn_recipe <- recipe(churn ~., data = churn_training) %>% 
    step_knnimpute(total_charges) %>%
    step_dummy(all_nominal(), -all_outcomes()) %>% 
    step_normalize(all_numeric()) %>% 
    step_bsmote(churn)
```

One of the most important ideas of Tidymodels is the use of a **recipe**. With a recipe, you can create the formula you will be using to make models, as well as manipulate the data by adding additional steps. Here we used **step_knnimpute** to use a knn model to impute missing data in the total charges column. We used **step_dummy** to create dummy variables for all nominal columns besides the outcome variable (churn), we used **step_normalize** on all numeric columns, and **step_bsmote** (which creates new examples of the minority class using nearest neighbors) to handle the class imbalance of the churn variable. You can read more about creating recipes on the tidymodels [recipes page](https://recipes.tidymodels.org/index.html).

```{r}
churn_wf <- workflow() %>% 
    add_recipe(churn_recipe)
```

Another extremely important idea in the Tidymodels package is the use of a **workflow**. A workflow is an object that aggregates information that is used to fit and predict from a model. It can be a recipe or a model fit.

```{r}
metrics_used <- metric_set(roc_auc, accuracy, sensitivity, specificity)
```

### Logistic Regression

Logistic regression is one of the most used forms of regression, except it is used for classification problems! Here we will implement a logistic regression on the churn data using tidymodels. To read more about implementing logistic regression in tidymodels you can read the [documentation](https://parsnip.tidymodels.org/reference/logistic_reg.html).

```{r}
logistic_regression <- logistic_reg() %>% 
  set_mode("classification") %>% 
  set_engine("glm")
```

```{r warning=FALSE}
logistic_model <- churn_wf %>% 
    add_model(logistic_regression) %>%
    fit_resamples(
        resamples = churn_vfolds,
        metrics = metrics_used,
        control = control_resamples(save_pred = TRUE)
    )
```

```{r}
collect_metrics(logistic_model)
```

```{r}
logistic_model %>% 
    conf_mat_resampled()
```

```{r warning=FALSE}
logistic_final <- churn_wf %>% 
    add_model(logistic_regression) %>% 
    last_fit(churn_split)
```

```{r}
(logistic_results <- collect_metrics(logistic_final) %>% 
   cbind(Model = "Logistic"))
```

```{r}
collect_predictions(logistic_final) %>% 
    conf_mat(churn, .pred_class)
```

We can see that the logistic model results in an accuracy of about 79.20% and an roc_auc score of about 86.78%. This difference between accuracy and roc_auc is that accuracy is calculated by the proportion of true positives and negatives for the whole data set, while roc_auc is calculating the true positive rate and false positive rate trade off. Roc_auc scores are very popular when there is an unbalanced problem like the one we are working on. For example, if 99% of the objects in a data set are in the same class you can always pick that object and have 99% accuracy, however you would never pick a different object.

```{r}
(log_spec <- specificity(collect_predictions(logistic_final), churn, .pred_class, 
                         event_level = "second")) %>% 
  cbind(Model = "Logistic")
```

Specificity measures the proportion of negatives that are correctly identified (true negative). In this case, it would be percentage of customers that did not churn that were predicted correctly. Our logistic regression model correctly predicted about 80.50% of these cases. In our churn variable Yes is the 2nd class, but it should be our True Positive rate for calculations, so we have to set **event_level** to **second**. To calculate this from the confusion matrix we would do (1041 / (1041+252)).

```{r}
(log_sens <- sensitivity(collect_predictions(logistic_final), churn, .pred_class, 
                         event_level = "second")) %>% 
  cbind(Model = "Logistic")
```

Sensitivity measures the proportion of positives that were correctly identified (true positive). In this case, it would be the percentage of customers that did churn that were predicted correctly. Our logistic regression model correctly predicted about 75.59% of these cases. To calculate it from the confusion matrix we would do (353 / (353 +114)).

### XGBoost Model

XGBoost is becoming an extremely popular machine learning algorithm, and stands for Extreme Gradient Boosting. It is super fast, and usually performs as well if not better than other models. You can read more about how to use XGBoost models in tidymodels by reading the [documentation](https://parsnip.tidymodels.org/reference/boost_tree.html).

```{r}
boosted_tree <- boost_tree() %>% 
  set_mode("classification") %>% 
  set_engine("xgboost")
```

```{r}
boosted_model <- churn_wf %>% 
  add_model(boosted_tree) %>% 
  fit_resamples(
    resamples = churn_vfolds,
    metrics = metrics_used, 
    control = control_resamples(save_pred = TRUE)
  )
```

```{r}
collect_metrics(boosted_model)
```

```{r}
boosted_model %>% 
  conf_mat_resampled()
```

```{r}
boosted_final <- churn_wf %>% 
  add_model(boosted_tree) %>% 
  last_fit(churn_split)
```

```{r}
(boosted_results <- collect_metrics(boosted_final) %>% 
   cbind(Model = "XGBoost"))
```

Here we can see that the final XGBoost performed pretty well on the testing data with an accuracy of about 79.32% and an roc_auc score of about 85.46%.

```{r}
collect_predictions(boosted_final) %>% 
  conf_mat(churn, .pred_class)
```

```{r}
(xgb_spec <- specificity(collect_predictions(boosted_final), churn, .pred_class, 
                         event_level = "second")) %>% 
  cbind(Model = "XGBoost")
```

Here we can see that the XGBoost model predicted about 83.76% of the true negatives correctly. So if there was a person who did not churn, there is a 67% chance of correctly predicting it. The calculation from the confusion matrix would be (1083 / (1083 + 210)).

```{r}
(xgb_sens <- sensitivity(collect_predictions(boosted_final), churn, .pred_class, 
                         event_level = "second")) %>% 
  cbind(Model = "XGBoost")
```

The XGBoost model accurately predicted about 67.02% of the true positives. Which means if a person did churn, the model had a 67.02% chance of predicting it correctly. The calculation from the confusion matrix would be (313 / (313 + 154)).

### Comparing Models

```{r}
plot_grid(
  roc_curve(collect_predictions(logistic_final), churn, .pred_Yes) %>% 
    ggplot(aes(x = 1 - specificity, y = sensitivity)) +
    geom_path(color = "#F8766D", size = 1.1) +
    geom_abline(lty = 3) +
    coord_equal() +
    labs(title = "ROC Curve for Logistic Regression", y = "Sensitivity", x = "1 - Specificity"),
  roc_curve(collect_predictions(boosted_final), churn, .pred_Yes) %>% 
    ggplot(aes(x = 1 - specificity, y = sensitivity)) +
    labs(title = "ROC Curve for XGBoost Model", y = "Sensitivity", x = "1 - Specificity") +
    geom_path(color = "#00BFC4", size = 1.1) +
    geom_abline(lty = 3) +
    coord_equal()
)
```

```{r}
(final_results <- rbind(logistic_results, boosted_results))
```

```{r}
ggplot(final_results, aes(x = .metric, y = .estimate, fill = Model)) +
  geom_col(position = "dodge") +
  labs(title = "Final Results of Models", x = "Metric", y = "Estimate") +
  scale_y_continuous(labels = scales::percent, limits = c(0,1))
```

From the above graphs, we can see that both the logistic regression and the XGBoost model perform very similar. However, the logistic regression value has a slightly higher roc_auc score than the XGBoost model does. Since the logistic regression performs slightly better, and is a more simple model, we would choose it to use on any future data.

## Conclusion

In this project we worked on creating a model to predict customer churn. We used the Tidymodels package to create our two models, a logistic regression and XGBoost. As an alternative to the tidymodels, we could have used caret. The tidymodels package is in some ways a "successor" to the caret package, as Max Kuhn, who created the caret package, is now working on tidymodels. If you want to read about the caret package you can do so [here](http://topepo.github.io/caret/index.html).