kiva.Rmd

---
title: 'Using Text Analytics to Predict Loan Defaults'
subtitle: "A case study of Kiva"
author: "Dr. Stephen W. Thomas, Smith School of Business, Queen's University"
date: "January 2018"
documentclass: article
fontsize: 11pt
output:
  pdf_document:
    highlight: pygments
    number_sections: no
    toc: no
    toc_depth: '2'
  word_document:
    toc: no
    toc_depth: '2'
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, warning=FALSE, message=FALSE, fig.align='center')
```

```{r}
library(tidyverse)
library(ggthemes)
library(scales)
library(rpart)
library(rpart.plot)
library(RColorBrewer)
library(MLmetrics)
library(topicmodels)
library(tidytext)
library(knitr)
library(kableExtra)
```


Kiva Microfunds[^1] is a non-profit that allows individuals to lend money to low-income entrepreneurs and students around the world. Since starting in 2005, Kiva has crowd-funded millions of loans with a repayment rate of around 98%.

[^1]: https://www.kiva.org/

At Kiva, each loan request includes both traditional demographic information on the borrower, such as gender and location, as well as a personal story, because Kiva wants lenders to connect with  borrowers on a human level. For example, consider the personal story for a borrower named Evelyn:

> *Evelyn is 40 years old and married with 3 kids. She is in the Karura Hope women group and her life has been changed by the first KIVA loan she received last year which she is completing this quarter. Before she received the loan, she used to sell 9 litres of milk daily to local residents. After receiving the loan she bought iron sheets, five cement packets, one lorry of sand, some ballast and animal feed for her cows and improved her cow shed. Today she sells a daily average of 40 litres of milk to the Kiamba Dairy cooperative society, which is affiliated to the Kenya Cooperative Creameries at a cost of USD 0.28 per litre. Her daily farming has really grown. Evelyn intends to buy another dairy cow and a tank of water for home consumption and for her cows. She intends to repay in monthly installments.*

Despite her uplifting story, and her previous successful loan, Evelyn defaulted on her next loan of 900 USD. Her lenders lost their money, and Evelyn is no longer allowed to use Kiva.

Kiva's data science team has been tasked to determine if this unfortunate default---and all others like it---could have been predicted from the data in the loan application. If defaults can be predicted beforehand, then the risk to potential lenders is reduced, and loan resources are given to borrowers with the highest likelihood of repaying.

To tackle this task, the data science team decides to use machine learning to build a prediction model based on historical loan data. The data science team is especially curious if the personal story itself will add any predictive power to the model. Since the personal stories are textual, as opposed to numerical, the data science team will need to use _text analytics_---techniques to understand, organize, and transform textual data into numerical data---to process the stories.  

Will text analytics help Kiva's data science team to build an effective prediction model for loan applications? Will the prediction model reduce risk for lenders and Kiva?
 

# Kiva's Data

Luckily for the data science team, Kiva carefully maintains a history of all previous loan applications, including items like the borrower's demographic data, the industry sector of the borrower, the requested loan amount, and whether the loan was eventually repaid.

Kiva's data science team extracts a snapshot of the historical loan data. The snapshot contains about 8,000 completed loans, of which about 50% were repaid and 50% were defaulted. Exhibit 1 shows the data dictionary and a few sample rows, and Exhibit 2 contains a descriptive summary of each of the eight variables in the dataset.


# Beyond Numbers: Adding Text Variables

Most machine learning algorithms that construct prediction models are able to accept numerical and categorical data as inputs, but are not able to accept raw text data. This means that the data science team can't use the `story` variable as-is. The team must first use various text analytics techniques to turn each story into numerical data. One common technique to do so is to split each story (also called a _document_ in the literature) into individual words, and count how many times each word appears in each document. Each individual word becomes its own variable in the dataset.

As an example, consider the following dataset, in which the `loan_purpose` variable contains raw text data and thus cannot be input to machine learning algorithms.

```{r}
tmpdf = data.frame(id=c(1, 2, 3, 4, "..."), 
                   age=c(34, 56, 23, 29, ""), 
                   country=c("Panama", "Mexico", "Uganda", "Kenya", ""),
                   loan_purpose=c("I'd like to purchase a new farm tractor for my farm.", 
                                 "Hire additional day labourors to tend to my expanding avocado farm.", 
                                 "My child care service is expanding and I really need to renovate my home.",
                                 "To purchase 20 more cows for my dairy farm.", ""))

kable(tmpdf, "latex", booktab=TRUE) %>%  
  kable_styling(font_size = 10, full_width=TRUE, latex_options = c("striped", "scale_down")) %>%
  column_spec(4, width = "32em")
```

Splitting each document as described above would result in the following dataset:


```{r}
tmpdf = data.frame(id=c(1, 2, 3, 4, "..."), 
                   age=c(34, 56, 23, 29, ""), 
                   country=c("Panama", "Mexico", "Uganda", "Kenya", ""),
                   farm=c(2, 1, 0, 1, ""),
                   purchase=c(1, 0, 0, 1, ""),
                   expanding=c(0, 1, 1, 0, ""),
                   cows=c(0,0,0,1, ""),
                   hire=c(0 ,1, 0, 0, ""),
                   tractor=c(1, 0, 0, 0, ""),
                   "..."=c("...","...","...","...", "")
)

kable(tmpdf, "latex", booktab=TRUE) %>%  
  kable_styling(font_size = 10, full_width=TRUE, latex_options = c("striped", "scale_down")) 
```

Each unique word from the raw text documents now becomes its own variable, and the value of each variable is the number of times that word appears in the original document. For instance, the new `farm` variable contains a 2 in row 1, because the word "farm" occurs twice in the original `loan_purpose` variable in row 1. Then, after the original `loan_purpose` variable is removed, the remaining variables are all now numerical, and thus the dataset can be input into machine learning algorithms.

Another common text analytics technique is to use two- or three-word phrases, instead of individual words. The idea is the same: the phrases now become variables in the dataset. 

A recent advancement in text analytics is to use a _topic model_ to extract the topics, as opposed to individual words or phrases, from each document . In particular, the most common topic model is Latent Dirichlet Allocation (LDA). LDA can automatically extract high-level topics from text documents. For example, LDA might determine that one document contains the topic "dairy farming," while another document contains two other topics, "clothing manufacturing" and "child care." LDA will discover which topics are in which documents. Note that LDA creates the topics from scratch, based on the input documents; it does not have a list of predefined topics from which it chooses. As a result, the topics are specific to the text data in the given dataset.

The data science team decides to use LDA to extract twelve overarching topics from the stories. Once the LDA algorithm is finished, the team removes the original `story` variable, and adds twelve new variables (i.e., one for each discovered topic) that specify the percentage of words in each story that come from each topic. Exhibit 3 shows the resulting topics, and Exhibit 4 shows a few sample rows in the final dataset, which is ready to be input to machine learning algorithms.


# Building the Prediction Model

Armed with the final dataset, the data science team can now use machine learning algorithms to build a prediction model for loan defaults. There are lots of machine learning algorithms available, ranging from simple logistic regression, to Naive Bayes, all the way to deep neural networks. For this task, the team decides to use a decision tree algorithm, because of its speed, accuracy, and interpretability. 

In order to determine whether the text (or, more precisely, the topics built from the text) has predictive power, the team builds two similar, but slightly different, decision tree models. The first is built by giving the decision tree algorithm only the numerical and categorical variables from the original dataset, i.e., `sector`, `country`, `gender`, `loan_amount`, and  `non_payment`. The second model is built by giving the decision tree algorithm all of those variables plus the twelve topic variables. The team thus runs the algorithm twice, and receives two models in return that they can then compare. Exhibit 5 shows the two models and their performance characteristics.


# Case Discussion Questions

1. What is Kiva's value proposition?
4. What factors might go into a lender's decision to lend money to a borrower?
5. How does text data affect the prediction model's ability to predict a default? 
7. According to the decision tree models, which variable(s) best predict a default?
7. What other text analytics/NLP techniques might be applied to the dataset to improve the prediction model?
8. What additional information might lead to a better prediction model?
9. How should Kiva operationalize the prediction model? What technical challenges and risks do you envision? What procedural challenges and risks do you envision?


<P style="page-break-before: always">
\newpage


<!--
# Loading the Data
-->


```{r, include=FALSE}
df <- read_csv("data/kiva.csv")
df = df %>%
  rename(story = en)
```


```{r, include=FALSE}
str(df)
df$id = 1:nrow(df)
df$status = as.factor(df$status)
df$sector = as.factor(df$sector)
df$country = as.factor(df$country)
df$gender = as.factor(df$gender)
df$nonpayment = as.factor(df$nonpayment)
```

<!--
Let's look at a sample of our data.
-->

```{r, include=FALSE, eval=FALSE}
head(df, n=20)
summary(df)
```

<!--
# Data Cleaning
-->

```{r, include=FALSE}
# Remove HTML Tags
df = df %>% 
  mutate(story = gsub("<.*?>", "", story))

# Convert into tidytext format
text_df <- df %>%
  select(id, status, story) %>%
  unnest_tokens(word, story)

## Remove stopwords
custom_stop_words = data.frame(word=c("loan", "business"))
text_df <- text_df %>%
  anti_join(stop_words, by=c("word"="word")) %>%
  anti_join(custom_stop_words, by=c("word"="word")) %>%
  arrange(id)

# Stem words
#library(SnowballC)
#df = df %>% 
#  mutate(story = wordStem(story))
```


<!-- 
# Feature Engineering

## Latent Dirichlet Allocation

Let's use a technique called Latent Dirichlet Allocation (LDA) to extract the topics from each document.
-->

```{r, include=FALSE}
# Count each word in each document.
word_counts = text_df %>%
  group_by(id, word) %>%
  summarize(count = n())
```


```{r, include=FALSE}
# Create a document term matrix
dtm = word_counts %>% cast_dtm(id, word, count)

# Remove sparse terms from the document term matrix.
library(tm)
dtm2.nosparse <- removeSparseTerms(dtm, 0.9995)

rowTotals <- apply(dtm2.nosparse, 1, sum) #Find the sum of words in each Document
which(rowTotals==0)
dtm.new   <- dtm2.nosparse[rowTotals> 0, ] 
```

<!--
Run the LDA model.
-->

```{r, include=FALSE}
num_topics = 12

# Because the LDA model can take quite a few minutes to run, and because I run this script over and over again
# checking its knitr output, I don't want to run LDA every single time. 
runModel = FALSE
if (runModel == TRUE) {
  # Run the model
  lda <- LDA(dtm.new, k = num_topics, control = list(seed = 1234))
  
  # Name each topic
  t = terms(lda, k=4)
  topic_names = apply(t, 2, function(x) paste(x, collapse = "_"))
  
  lda_beta <- tidy(lda, matrix = "beta")
  lda_gamma <- tidy(lda, matrix = "gamma")
  lda_gamma$document = as.integer(lda_gamma$document)
  
  # Save output
  readr::write_csv(beta, sprintf("beta_%d.csv", num_topics))
  readr::write_csv(lda_gamma, sprintf("gamma_%d.csv", num_topics))
  readr::write_csv(as.data.frame(topic_names), sprintf("topicnames_%d.csv", num_topics))
  
} else {
  # Read the output from a previous run
  lda_beta = readr::read_csv(sprintf("beta_%d.csv", num_topics))
  lda_gamma = readr::read_csv(sprintf("gamma_%d.csv", num_topics))
  topic_names = t(readr::read_csv(sprintf("topicnames_%d.csv", num_topics)))
}

tn = data.frame(id=1:12, topic_name = as.character(t(topic_names)))
tn$topic_name = as.character(tn$topic_name)
tn$topic_name = sprintf("%02d: %s", 1:12, tn$topic_name)
  
```

<!--
Add the resulting document topic probabilities to the `df` dataframe.
-->

```{r, include=FALSE}
lda_gamma_new = lda_gamma %>% spread(topic, gamma)

df_new  = df %>% left_join(lda_gamma_new, by=c("id" = "document"))
library(data.table)
setnames(df_new, old=sprintf("%d", c(1:12)), new=sprintf("topic %d: %s", c(1:12), topic_names))
```


# Exhibit 1: Data Dictionary

The table below describes the eight variables in the dataset.

```{r}
tmpdf = data.frame(Variable=c("id", "sector", "country", "gender", "loan_amount", "non_payment", "story", "status"),
                   Description=c("A unique identifier for the loan.",
                                 "Industry sector of borrower.",
                                 "Borrower's country of residence.",
                                 "Borrower's gender.",
                                 "Amount of the loan, in USD.",
                                 "Who is liable if the loan defaults: the lender, or the partner*?",
                                 "Borrower's personal story.",
                                 "Whether borrower defaulted or repaid loan."
                                 ))
kable(tmpdf, "latex", booktab=TRUE) %>%  
  kable_styling(font_size = 10, full_width=TRUE, latex_options = c("striped", "scale_down")) %>%  
  column_spec(2, width = "35em")
```

`*` All loan applications have an associated field partner, which is local microfinance institution with which Kiva works to find and fund loans. Every loan at Kiva is offered by a partner to a borrower, and the partner works with Kiva to get funding for that loan from lenders.


\vspace{10pt}

The table below shows 10 random rows in dataset.

```{r}
set.seed(124)
sample_ids = floor(runif(10, min = 1, max = nrow(df_new)))

aa = df[sample_ids, ] %>%
  select(-id) %>%
  select(status, sector, country, gender, loan_amount, nonpayment, story)

# Just to make the table look better
aa$country = gsub("Dominican Republic", "D.R.", aa$country)
aa$story = strtrim(aa$story, 150)
aa$story = paste(aa$story, " (...)", sep = "")

kable(aa, "latex", booktab=TRUE, digits=2) %>%
  kable_styling(latex_options = c("striped", "scale_down"))  %>%
  column_spec(7, width = "25em")
```


<P style="page-break-before: always">
\newpage

# Exhibit 2

Below are descriptive plots for the numerical and categorical variables in the dataset.

\vspace{20pt}

```{r fig.height=2, fig.width=2, out.width='.49\\linewidth', fig.show='hold',fig.align='center'}

#myt = theme_igray()
myt = theme_economist()
#myt = theme_fivethirtyeight()
qplot(status, data=df, geom="bar", fill=status, xlab="status") + myt +
  theme(legend.position = "none") 

qplot(gender, data=df, geom="bar", fill=status) + myt +
  theme(legend.position = "none")

qplot(nonpayment, data=df, geom="bar", fill=status) + myt +
  theme(legend.position = "none")
```

\vspace{20pt}


```{r fig.height=2.8, fig.width=3, out.width='.49\\linewidth', fig.show='hold',fig.align='center'}

tmpdf = df
tmpdf$country = gsub("Dominican Republic", "D.R.", tmpdf$country)

qplot(country, data=tmpdf, geom="bar", fill=status) + myt +
  theme(legend.position = "none")

qplot(sector, data=df, geom="bar", fill=status, xlab="sector") + myt +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust=1)) + theme(legend.position = "none")

rm(tmpdf)
```

\vspace{20pt}


```{r fig.width=4, fig.height=2.5, out.width='.49\\linewidth', fig.show='hold',fig.align='center'}
df %>% 
  ggplot(aes(loan_amount, colour=status, fill=status)) + myt +
  geom_density(alpha=0.1) +
  theme(legend.position = "none") + 
  labs(x = "loan_amount (USD)")

df %>%
  mutate(en_length = nchar(story)) %>%
  ggplot(aes(en_length, colour=status, fill=status)) + myt +
  geom_density(alpha=0.1) +
  theme(legend.position = "none") +
  labs(x = "Number of letters in `story`")
```


<P style="page-break-before: always">
\newpage

### Top Words

The table below shows the top (i.e, most frequently occurring) words in the `story` variable.

```{r rows.print=20}
kable(text_df %>% group_by(word) %>%
  summarize(count=n()) %>%
  mutate(freq = count / sum(count)) %>%
  arrange(desc(count)) %>%
  top_n(17), "latex", booktab=TRUE) %>%
  kable_styling(full_width=TRUE, latex_options = c("striped", "scale_down"), font_size=9)
```


### Most Biased Words

The plot below show which words are most biased towards being `paid` or `defaulted`, using the log odds ratio metric.

```{r}
status_words_count = text_df %>% group_by(status, word) %>%
  summarize(count=n()) %>%
  arrange(desc(count))

log_ratios = status_words_count %>% 
  spread (status, count) %>%
  select(-`<NA>`) %>%
  mutate(defaulted = ifelse(is.na(defaulted), 0, defaulted)) %>%
  mutate(paid = ifelse(is.na(paid), 0, paid)) %>%
  mutate(total=defaulted+paid) %>%
  mutate(log_ratio = log2(paid/defaulted)) 
```

```{r, fig.height=3.5}
log_ratios %>%
  filter(total > 500) %>%
  group_by(log_ratio < 0) %>%
  top_n(14, abs(log_ratio)) %>%
  ungroup() %>%
  mutate(word = reorder(word, log_ratio)) %>%
  ggplot(aes(word, log_ratio, fill = log_ratio < 0)) +
  myt +
  geom_col() +
  coord_flip() +
  ylab("log odds ratio") +
  scale_fill_discrete(name = "", labels = c("paid", "default")) +
  theme(legend.position = "right", legend.text = element_text(size = 8), 
                                                              legend.margin = margin(0.5, 0.5, 0.5, 0.5, "pt"))
```


```{r, include=FALSE}
kable(log_ratios %>%
  filter(total > 500) %>%
  arrange(desc(log_ratio)) %>%
  top_n(17), "latex", booktab=TRUE) %>%
  kable_styling(full_width=TRUE, latex_options = c("striped", "scale_down"))
```


```{r rows.print=20, include=FALSE}
kable(log_ratios %>%
  filter(total > 500) %>%
  arrange((log_ratio)) %>%
  top_n(-20), "latex", booktab=TRUE) %>%
  kable_styling(full_width=TRUE, latex_options = c("striped", "scale_down"))
```


<P style="page-break-before: always">
\newpage

# Exhibit 3: LDA Topics

Latent Dirichlet Allocation (LDA) was applied to the `story` variable using the R package `topicmodels`. The Kiva team told LDA to find twelve topics.

### LDA Top Terms Per Topic

The figure below shows the top words related to each of the twelve discovered topics. LDA itself does not assign a human-readable name to each topic. The topics are only defined by their word probabilities. However, to make the topics easier to quickly understand, Kiva's data science team has given a name to each topic using their four highest-probable words, of the form "Topic Number: TopWord1_TopWord2_TopWord3_TopWord4."

```{r,fig.width=10,fig.height=8.0}
ap_top_terms <- lda_beta %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

ap_top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  left_join(tn, by=c("topic" = "id")) %>% 
  ggplot(aes(term, beta, fill = factor(topic))) +
  myt +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic_name, scales = "free_y", ncol=3) +
  coord_flip() + 
  theme(strip.text = element_text(size=10)) # Make titles smaller
```


<P style="page-break-before: always">
\newpage

### Documents per LDA Topic

The figure below shows the number of documents that contain each topic.

```{r}
topic_totals = lda_gamma %>%
  left_join(df, by=c("document" = "id")) %>%
  select(c(-story)) %>% 
  filter(gamma >= 0.05) %>%
  group_by(topic, status) %>% 
  summarize(count=n()) %>%
  spread(status, count) %>%
  mutate(total = defaulted + paid) %>% 
  left_join(tn, by=c("topic" = "id")) %>%
  select(topic, topic_name, everything())
```

```{r fig.height=2}
tmp_gathered = topic_totals %>% 
  select(topic, topic_name, defaulted, paid) %>% 
  gather(Status, Value, defaulted, paid)

tmp_gathered$topic = as.factor(tmp_gathered$topic)
ggplot(tmp_gathered, aes(x=topic, y=Value, fill=Status)) + myt +
  geom_bar(stat="identity") +
  theme(legend.position = "right", legend.text = element_text(size = 8), 
                                                              legend.margin = margin(0.5, 0.5, 0.5, 0.5, "pt"))
```


```{r, include=FALSE}
kable(topic_totals, "latex", booktab=TRUE) %>%
  kable_styling(full_width=TRUE, latex_options = c("striped", "scale_down"))
```


### LDA Examples

Below is an example of a `story` that contained LDA topic 8 at 99%. 

> *Senaida Agueda has a business in which she buys and resells clothing that she has been operating for about 2 years now with the help of loans from Esperanza. She has two children, 41 and 38, whom do not live with Mrs. Agueda. When not working with her clothing shop, she enjoys going to the beach and cooking, arroz con carne (rice with meat) being one of her favorite dishes. As Mrs. Agueda is in her elder years and has fully grown children, she simply wishes to sustain her business to support her and her husband. Mrs. Senaida Agueda is a member of an eight person group, Group 4, that is part of a larger micro-bank called Mujeres de Fe, "Women of Faith" in English. In the picture, Mrs. Agueda is third from the right along with members of her group and some others of Mujeres de Fe. (...)*

Below is an example for LDA topic 4: 

> *Descripcin del Negocio. La Sra Angela se dedica a la venta de articulos para el hogar a credito y de forma anbulante ademas vende golosinas a llos nios de una escuela. En la actualidad vende a personas de otros sectores que han sido recomendados por buenos clientes que le refieren esos sitios. Uso del Prstamo. Ella necesita el credito para comprar mas mercaderia pues en estas epocas de fin de aos le son muy solicitadas. Informacin Personal. Ella tiene 29 aos y tiene dos hijos que estudian su casa es de caa y tiene estabilidad familiar. Translated from Spanish by Kiva Volunteer Wendy Wise*


And finally, below is an example for LDA topic 1:

> *Mary is 65 years of age, married with six children. All her children are married and self-reliant. She is a member of St Jude group at Githunguri in Thika district. Mary is earns her income as a dairy farmer. She needs a USD 150 loan to help her buy another small high breed dairy calf, which she will raise to maturity. She plans to meet her repayments on monthly basis.*


```{r, eval=FALSE, include=FALSE}
ids = lda_gamma %>%
  filter(topic==1) %>%
  arrange(desc(gamma)) %>%
  top_n(1000) %>%
  left_join(df_new, by=c("document" = "id")) %>%
  mutate(len = nchar(story)) %>%
  arrange(len)
ids
ids = c(7830, 7306, 7258, 7105)

df[7306,3]

df_new[ids,]

kable(t(df_new[ids[1],]), "latex", booktab=TRUE)  %>%
  kable_styling(full_width=TRUE, latex_options = c("striped", "scale_down"))
```


<P style="page-break-before: always">
\newpage

# Exhibit 4: A Sample of Data

The table below shows the same 10 random rows from Exhibit 1, except from the final dataset. The columns named "T1", "T2", etc., contain the percentage of the original story's words that have been assigned to topic 1, topic 2, etc., as determined by LDA. 

\vspace{20pt}

```{r}
aa = df_new[sample_ids, ] %>%
  select(-story, -id)

# Just to make the table look better
aa$country = gsub("Dominican Republic", "D.R.", aa$country)
setnames(aa, new=sprintf("T%d", c(1:12)), old=sprintf("topic %d: %s", c(1:12), topic_names))

#TODO: set NAs to 0
aa[is.na(aa)] <- 0

kable(aa, "latex", booktab=TRUE, digits=2) %>%
  kable_styling(latex_options = c("striped", "scale_down"))
```


<P style="page-break-before: always">
\newpage

# Exhibit 5: Model Perfomance

Decision tree models were created with the R package `rpart`. Here, the decision tree models are automatically created by a machine learning algorithm as the algorithm learns simple decision rules from the data. These automatically-learned rules can then be used to both understand the variables and to predict future data. A big advantage of decision trees over other models is that they are relatively simple for humans to understand and interpret.

A decision tree consists of nodes. Each node splits the data according to a rule. A rule is based on a variable in the data. For example, a rule might be “Age greater than 30.” In this case, the node splits the data by the `age` variable; those passengers that satisfy the rule (i.e., are greater than 30) follow the left path out of the node; the rest follow the right path out of the node. In this way, paths from the root node down to leaf nodes are created, describing the fate of certain types of passengers.

A decision tree path always starts with a root node (node number 1), which contains the most important splitting rule. Each subsequent node contains the next most important rule. After the decision tree is automatically created by the machine learning algorithm, one can use the decision tree to classify an individual by simply following a path: start at the root node and apply each rule to follow the appropriate path until you hit an end.

When creating a decision tree from data, the analyst can specify the number of nodes for the machine learning algorithm to create. More nodes leads to a more accurate model, at the cost of a more complicated and harder-to-interpret model, as well as the risk of over fitting the training data. Likewise, fewer nodes usually leads to a less accurate model, but the model is easier to understand and interpret.

## Model 1 (No text)

Below is the model that was created from only the original numerical and categorical variables.

```{r fig.height=2.5}
set.seed(123)
# Don't want to use either of these for prediction, and the - sign doesn't work
# with rpart forumulas.
df_notext = subset(df_new, select=c(status, sector, country, gender, loan_amount, nonpayment))

# Split the data into training and testing.
train_notext <- sample_frac(df_notext, 0.8)
test_notext <- setdiff(df_notext, train_notext)


# Let's train the model.
form = as.formula(status ~ .)
tree <- rpart(form, train_notext, method="class")
rpart.plot(tree, extra=2)
```


<!--
\vspace{20pt}

The following table summarizes the predictions of the decision on testing data.
-->

```{r, eval=TRUE, inculde=FALSE}
predicted = predict(tree, test_notext, type="class")
actual = test_notext$status
preds = data.frame((table(predicted, actual))) %>%
  spread(actual, Freq) %>%
  mutate(total = defaulted + paid) %>%
  select(predicted, total, everything())
```

```{r, include=FALSE}
kable(preds, "latex", booktab=TRUE) %>%
  kable_styling(full_width=TRUE, latex_options = c("striped", "scale_down")) %>%
  add_header_above(c(" " = 2, "actual" = 2))
```

<P style="page-break-before: always">
\newpage

## Model 2 (With Text)

Below is the model that was created from all variables in the dataset.


```{r fig.height=3.0}
set.seed(123)
# Don't want to use either of these for prediction, and the - sign doesn't work
# with rpart forumulas.
df_text = subset(df_new, select=c(-id, -story))

# Split the data into training and testing.
train_text <- sample_frac(df_text, 0.8)
test_text <- setdiff(df_text, train_text)


# Let's create the model.
form = as.formula(status ~ .)
tree.text <- rpart(form, train_text, method="class")
rpart.plot(tree.text, extra=2)
```


<!--
\vspace{20pt}

Below is a summary of its predictions:
-->

```{r, include=FALSE}

predicted.text = predict(tree.text, test_text, type="class")
actual.text = test_text$status
preds.text = data.frame((table(predicted.text, actual.text))) %>%
  spread(actual.text, Freq) %>%
  mutate(total = defaulted + paid) %>%
  select(predicted.text, total, everything())
```


```{r, include=FALSE}
kable(preds.text, "latex", booktab=TRUE) %>%
  kable_styling(full_width=TRUE, latex_options = c("striped", "scale_down")) %>%
  add_header_above(c(" " = 2, "actual" = 2))
```


<!--
That is, the model predicted `defaulted` 74 times: 50 times correctly, and 24 times incorrectly. It predicted `paid` 159 times: 120 times correctly, and 39 times incorrectly.
-->


## Metrics

Below is the accuracy and other metrics of the two models.

```{r}
bb = data.frame(Metric=c("Accuracy", "Precision", "Recall", "F1 Score", "Sensitivity", "Specificity"),
                "Model_1" =c(Accuracy(y_true=actual, y_pred=predicted),
                        Precision(y_true=actual, y_pred=predicted),
                        Recall(y_true=actual, y_pred=predicted),
                        F1_Score(predicted, actual),
                        Sensitivity(y_true=actual, y_pred=predicted),
                        Specificity(y_true=predicted, y_pred=actual)),
                "Model_2" =  c(Accuracy(y_true=actual.text, y_pred=predicted.text),
                        Precision(y_true=actual.text, y_pred=predicted.text),
                        Recall(y_true=actual.text, y_pred=predicted.text),
                        F1_Score(predicted.text, actual.text),
                        Sensitivity(y_true=actual.text, y_pred=predicted.text),
                        Specificity(y_true=predicted.text, y_pred=actual.text)))


kable(bb, "latex", booktab=TRUE, digits=3) %>%
  kable_styling(full_width=TRUE, latex_options = c("striped", "scale_down"), font_size=9)
```


# Appendix 1: Data Collection

The data in this case study was collected from Build.Kiva[^2], Kiva's website that provides snapshots of Kiva loan data. In the full dataset, about 98% of loans are paid and 2% defaulted. In this case study, we look at only a sample of the data, where the split between paid and defaulted is closer to 50%-50%. This sample is available at [http://www.github.com/stepthom/sandbox/data/kiva.csv](http://www.github.com/stepthom/sandbox/data/kiva.csv).

[^2]: https://build.kiva.org