-
Notifications
You must be signed in to change notification settings - Fork 63
/
Copy pathkiva.Rmd
734 lines (517 loc) · 29.7 KB
/
kiva.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
---
title: 'Using Text Analytics to Predict Loan Defaults'
subtitle: "A case study of Kiva"
author: "Dr. Stephen W. Thomas, Smith School of Business, Queen's University"
date: "January 2018"
documentclass: article
fontsize: 11pt
output:
pdf_document:
highlight: pygments
number_sections: no
toc: no
toc_depth: '2'
word_document:
toc: no
toc_depth: '2'
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, warning=FALSE, message=FALSE, fig.align='center')
```
```{r}
library(tidyverse)
library(ggthemes)
library(scales)
library(rpart)
library(rpart.plot)
library(RColorBrewer)
library(MLmetrics)
library(topicmodels)
library(tidytext)
library(knitr)
library(kableExtra)
```
Kiva Microfunds[^1] is a non-profit that allows individuals to lend money to low-income entrepreneurs and students around the world. Since starting in 2005, Kiva has crowd-funded millions of loans with a repayment rate of around 98%.
[^1]: https://www.kiva.org/
At Kiva, each loan request includes both traditional demographic information on the borrower, such as gender and location, as well as a personal story, because Kiva wants lenders to connect with borrowers on a human level. For example, consider the personal story for a borrower named Evelyn:
> *Evelyn is 40 years old and married with 3 kids. She is in the Karura Hope women group and her life has been changed by the first KIVA loan she received last year which she is completing this quarter. Before she received the loan, she used to sell 9 litres of milk daily to local residents. After receiving the loan she bought iron sheets, five cement packets, one lorry of sand, some ballast and animal feed for her cows and improved her cow shed. Today she sells a daily average of 40 litres of milk to the Kiamba Dairy cooperative society, which is affiliated to the Kenya Cooperative Creameries at a cost of USD 0.28 per litre. Her daily farming has really grown. Evelyn intends to buy another dairy cow and a tank of water for home consumption and for her cows. She intends to repay in monthly installments.*
Despite her uplifting story, and her previous successful loan, Evelyn defaulted on her next loan of 900 USD. Her lenders lost their money, and Evelyn is no longer allowed to use Kiva.
Kiva's data science team has been tasked to determine if this unfortunate default---and all others like it---could have been predicted from the data in the loan application. If defaults can be predicted beforehand, then the risk to potential lenders is reduced, and loan resources are given to borrowers with the highest likelihood of repaying.
To tackle this task, the data science team decides to use machine learning to build a prediction model based on historical loan data. The data science team is especially curious if the personal story itself will add any predictive power to the model. Since the personal stories are textual, as opposed to numerical, the data science team will need to use _text analytics_---techniques to understand, organize, and transform textual data into numerical data---to process the stories.
Will text analytics help Kiva's data science team to build an effective prediction model for loan applications? Will the prediction model reduce risk for lenders and Kiva?
# Kiva's Data
Luckily for the data science team, Kiva carefully maintains a history of all previous loan applications, including items like the borrower's demographic data, the industry sector of the borrower, the requested loan amount, and whether the loan was eventually repaid.
Kiva's data science team extracts a snapshot of the historical loan data. The snapshot contains about 8,000 completed loans, of which about 50% were repaid and 50% were defaulted. Exhibit 1 shows the data dictionary and a few sample rows, and Exhibit 2 contains a descriptive summary of each of the eight variables in the dataset.
# Beyond Numbers: Adding Text Variables
Most machine learning algorithms that construct prediction models are able to accept numerical and categorical data as inputs, but are not able to accept raw text data. This means that the data science team can't use the `story` variable as-is. The team must first use various text analytics techniques to turn each story into numerical data. One common technique to do so is to split each story (also called a _document_ in the literature) into individual words, and count how many times each word appears in each document. Each individual word becomes its own variable in the dataset.
As an example, consider the following dataset, in which the `loan_purpose` variable contains raw text data and thus cannot be input to machine learning algorithms.
```{r}
tmpdf = data.frame(id=c(1, 2, 3, 4, "..."),
age=c(34, 56, 23, 29, ""),
country=c("Panama", "Mexico", "Uganda", "Kenya", ""),
loan_purpose=c("I'd like to purchase a new farm tractor for my farm.",
"Hire additional day labourors to tend to my expanding avocado farm.",
"My child care service is expanding and I really need to renovate my home.",
"To purchase 20 more cows for my dairy farm.", ""))
kable(tmpdf, "latex", booktab=TRUE) %>%
kable_styling(font_size = 10, full_width=TRUE, latex_options = c("striped", "scale_down")) %>%
column_spec(4, width = "32em")
```
Splitting each document as described above would result in the following dataset:
```{r}
tmpdf = data.frame(id=c(1, 2, 3, 4, "..."),
age=c(34, 56, 23, 29, ""),
country=c("Panama", "Mexico", "Uganda", "Kenya", ""),
farm=c(2, 1, 0, 1, ""),
purchase=c(1, 0, 0, 1, ""),
expanding=c(0, 1, 1, 0, ""),
cows=c(0,0,0,1, ""),
hire=c(0 ,1, 0, 0, ""),
tractor=c(1, 0, 0, 0, ""),
"..."=c("...","...","...","...", "")
)
kable(tmpdf, "latex", booktab=TRUE) %>%
kable_styling(font_size = 10, full_width=TRUE, latex_options = c("striped", "scale_down"))
```
Each unique word from the raw text documents now becomes its own variable, and the value of each variable is the number of times that word appears in the original document. For instance, the new `farm` variable contains a 2 in row 1, because the word "farm" occurs twice in the original `loan_purpose` variable in row 1. Then, after the original `loan_purpose` variable is removed, the remaining variables are all now numerical, and thus the dataset can be input into machine learning algorithms.
Another common text analytics technique is to use two- or three-word phrases, instead of individual words. The idea is the same: the phrases now become variables in the dataset.
A recent advancement in text analytics is to use a _topic model_ to extract the topics, as opposed to individual words or phrases, from each document . In particular, the most common topic model is Latent Dirichlet Allocation (LDA). LDA can automatically extract high-level topics from text documents. For example, LDA might determine that one document contains the topic "dairy farming," while another document contains two other topics, "clothing manufacturing" and "child care." LDA will discover which topics are in which documents. Note that LDA creates the topics from scratch, based on the input documents; it does not have a list of predefined topics from which it chooses. As a result, the topics are specific to the text data in the given dataset.
The data science team decides to use LDA to extract twelve overarching topics from the stories. Once the LDA algorithm is finished, the team removes the original `story` variable, and adds twelve new variables (i.e., one for each discovered topic) that specify the percentage of words in each story that come from each topic. Exhibit 3 shows the resulting topics, and Exhibit 4 shows a few sample rows in the final dataset, which is ready to be input to machine learning algorithms.
# Building the Prediction Model
Armed with the final dataset, the data science team can now use machine learning algorithms to build a prediction model for loan defaults. There are lots of machine learning algorithms available, ranging from simple logistic regression, to Naive Bayes, all the way to deep neural networks. For this task, the team decides to use a decision tree algorithm, because of its speed, accuracy, and interpretability.
In order to determine whether the text (or, more precisely, the topics built from the text) has predictive power, the team builds two similar, but slightly different, decision tree models. The first is built by giving the decision tree algorithm only the numerical and categorical variables from the original dataset, i.e., `sector`, `country`, `gender`, `loan_amount`, and `non_payment`. The second model is built by giving the decision tree algorithm all of those variables plus the twelve topic variables. The team thus runs the algorithm twice, and receives two models in return that they can then compare. Exhibit 5 shows the two models and their performance characteristics.
# Case Discussion Questions
1. What is Kiva's value proposition?
4. What factors might go into a lender's decision to lend money to a borrower?
5. How does text data affect the prediction model's ability to predict a default?
7. According to the decision tree models, which variable(s) best predict a default?
7. What other text analytics/NLP techniques might be applied to the dataset to improve the prediction model?
8. What additional information might lead to a better prediction model?
9. How should Kiva operationalize the prediction model? What technical challenges and risks do you envision? What procedural challenges and risks do you envision?
<P style="page-break-before: always">
\newpage
<!--
# Loading the Data
-->
```{r, include=FALSE}
df <- read_csv("data/kiva.csv")
df = df %>%
rename(story = en)
```
```{r, include=FALSE}
str(df)
df$id = 1:nrow(df)
df$status = as.factor(df$status)
df$sector = as.factor(df$sector)
df$country = as.factor(df$country)
df$gender = as.factor(df$gender)
df$nonpayment = as.factor(df$nonpayment)
```
<!--
Let's look at a sample of our data.
-->
```{r, include=FALSE, eval=FALSE}
head(df, n=20)
summary(df)
```
<!--
# Data Cleaning
-->
```{r, include=FALSE}
# Remove HTML Tags
df = df %>%
mutate(story = gsub("<.*?>", "", story))
# Convert into tidytext format
text_df <- df %>%
select(id, status, story) %>%
unnest_tokens(word, story)
## Remove stopwords
custom_stop_words = data.frame(word=c("loan", "business"))
text_df <- text_df %>%
anti_join(stop_words, by=c("word"="word")) %>%
anti_join(custom_stop_words, by=c("word"="word")) %>%
arrange(id)
# Stem words
#library(SnowballC)
#df = df %>%
# mutate(story = wordStem(story))
```
<!--
# Feature Engineering
## Latent Dirichlet Allocation
Let's use a technique called Latent Dirichlet Allocation (LDA) to extract the topics from each document.
-->
```{r, include=FALSE}
# Count each word in each document.
word_counts = text_df %>%
group_by(id, word) %>%
summarize(count = n())
```
```{r, include=FALSE}
# Create a document term matrix
dtm = word_counts %>% cast_dtm(id, word, count)
# Remove sparse terms from the document term matrix.
library(tm)
dtm2.nosparse <- removeSparseTerms(dtm, 0.9995)
rowTotals <- apply(dtm2.nosparse, 1, sum) #Find the sum of words in each Document
which(rowTotals==0)
dtm.new <- dtm2.nosparse[rowTotals> 0, ]
```
<!--
Run the LDA model.
-->
```{r, include=FALSE}
num_topics = 12
# Because the LDA model can take quite a few minutes to run, and because I run this script over and over again
# checking its knitr output, I don't want to run LDA every single time.
runModel = FALSE
if (runModel == TRUE) {
# Run the model
lda <- LDA(dtm.new, k = num_topics, control = list(seed = 1234))
# Name each topic
t = terms(lda, k=4)
topic_names = apply(t, 2, function(x) paste(x, collapse = "_"))
lda_beta <- tidy(lda, matrix = "beta")
lda_gamma <- tidy(lda, matrix = "gamma")
lda_gamma$document = as.integer(lda_gamma$document)
# Save output
readr::write_csv(beta, sprintf("beta_%d.csv", num_topics))
readr::write_csv(lda_gamma, sprintf("gamma_%d.csv", num_topics))
readr::write_csv(as.data.frame(topic_names), sprintf("topicnames_%d.csv", num_topics))
} else {
# Read the output from a previous run
lda_beta = readr::read_csv(sprintf("beta_%d.csv", num_topics))
lda_gamma = readr::read_csv(sprintf("gamma_%d.csv", num_topics))
topic_names = t(readr::read_csv(sprintf("topicnames_%d.csv", num_topics)))
}
tn = data.frame(id=1:12, topic_name = as.character(t(topic_names)))
tn$topic_name = as.character(tn$topic_name)
tn$topic_name = sprintf("%02d: %s", 1:12, tn$topic_name)
```
<!--
Add the resulting document topic probabilities to the `df` dataframe.
-->
```{r, include=FALSE}
lda_gamma_new = lda_gamma %>% spread(topic, gamma)
df_new = df %>% left_join(lda_gamma_new, by=c("id" = "document"))
library(data.table)
setnames(df_new, old=sprintf("%d", c(1:12)), new=sprintf("topic %d: %s", c(1:12), topic_names))
```
# Exhibit 1: Data Dictionary
The table below describes the eight variables in the dataset.
```{r}
tmpdf = data.frame(Variable=c("id", "sector", "country", "gender", "loan_amount", "non_payment", "story", "status"),
Description=c("A unique identifier for the loan.",
"Industry sector of borrower.",
"Borrower's country of residence.",
"Borrower's gender.",
"Amount of the loan, in USD.",
"Who is liable if the loan defaults: the lender, or the partner*?",
"Borrower's personal story.",
"Whether borrower defaulted or repaid loan."
))
kable(tmpdf, "latex", booktab=TRUE) %>%
kable_styling(font_size = 10, full_width=TRUE, latex_options = c("striped", "scale_down")) %>%
column_spec(2, width = "35em")
```
`*` All loan applications have an associated field partner, which is local microfinance institution with which Kiva works to find and fund loans. Every loan at Kiva is offered by a partner to a borrower, and the partner works with Kiva to get funding for that loan from lenders.
\vspace{10pt}
The table below shows 10 random rows in dataset.
```{r}
set.seed(124)
sample_ids = floor(runif(10, min = 1, max = nrow(df_new)))
aa = df[sample_ids, ] %>%
select(-id) %>%
select(status, sector, country, gender, loan_amount, nonpayment, story)
# Just to make the table look better
aa$country = gsub("Dominican Republic", "D.R.", aa$country)
aa$story = strtrim(aa$story, 150)
aa$story = paste(aa$story, " (...)", sep = "")
kable(aa, "latex", booktab=TRUE, digits=2) %>%
kable_styling(latex_options = c("striped", "scale_down")) %>%
column_spec(7, width = "25em")
```
<P style="page-break-before: always">
\newpage
# Exhibit 2
Below are descriptive plots for the numerical and categorical variables in the dataset.
\vspace{20pt}
```{r fig.height=2, fig.width=2, out.width='.49\\linewidth', fig.show='hold',fig.align='center'}
#myt = theme_igray()
myt = theme_economist()
#myt = theme_fivethirtyeight()
qplot(status, data=df, geom="bar", fill=status, xlab="status") + myt +
theme(legend.position = "none")
qplot(gender, data=df, geom="bar", fill=status) + myt +
theme(legend.position = "none")
qplot(nonpayment, data=df, geom="bar", fill=status) + myt +
theme(legend.position = "none")
```
\vspace{20pt}
```{r fig.height=2.8, fig.width=3, out.width='.49\\linewidth', fig.show='hold',fig.align='center'}
tmpdf = df
tmpdf$country = gsub("Dominican Republic", "D.R.", tmpdf$country)
qplot(country, data=tmpdf, geom="bar", fill=status) + myt +
theme(legend.position = "none")
qplot(sector, data=df, geom="bar", fill=status, xlab="sector") + myt +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust=1)) + theme(legend.position = "none")
rm(tmpdf)
```
\vspace{20pt}
```{r fig.width=4, fig.height=2.5, out.width='.49\\linewidth', fig.show='hold',fig.align='center'}
df %>%
ggplot(aes(loan_amount, colour=status, fill=status)) + myt +
geom_density(alpha=0.1) +
theme(legend.position = "none") +
labs(x = "loan_amount (USD)")
df %>%
mutate(en_length = nchar(story)) %>%
ggplot(aes(en_length, colour=status, fill=status)) + myt +
geom_density(alpha=0.1) +
theme(legend.position = "none") +
labs(x = "Number of letters in `story`")
```
<P style="page-break-before: always">
\newpage
### Top Words
The table below shows the top (i.e, most frequently occurring) words in the `story` variable.
```{r rows.print=20}
kable(text_df %>% group_by(word) %>%
summarize(count=n()) %>%
mutate(freq = count / sum(count)) %>%
arrange(desc(count)) %>%
top_n(17), "latex", booktab=TRUE) %>%
kable_styling(full_width=TRUE, latex_options = c("striped", "scale_down"), font_size=9)
```
### Most Biased Words
The plot below show which words are most biased towards being `paid` or `defaulted`, using the log odds ratio metric.
```{r}
status_words_count = text_df %>% group_by(status, word) %>%
summarize(count=n()) %>%
arrange(desc(count))
log_ratios = status_words_count %>%
spread (status, count) %>%
select(-`<NA>`) %>%
mutate(defaulted = ifelse(is.na(defaulted), 0, defaulted)) %>%
mutate(paid = ifelse(is.na(paid), 0, paid)) %>%
mutate(total=defaulted+paid) %>%
mutate(log_ratio = log2(paid/defaulted))
```
```{r, fig.height=3.5}
log_ratios %>%
filter(total > 500) %>%
group_by(log_ratio < 0) %>%
top_n(14, abs(log_ratio)) %>%
ungroup() %>%
mutate(word = reorder(word, log_ratio)) %>%
ggplot(aes(word, log_ratio, fill = log_ratio < 0)) +
myt +
geom_col() +
coord_flip() +
ylab("log odds ratio") +
scale_fill_discrete(name = "", labels = c("paid", "default")) +
theme(legend.position = "right", legend.text = element_text(size = 8),
legend.margin = margin(0.5, 0.5, 0.5, 0.5, "pt"))
```
```{r, include=FALSE}
kable(log_ratios %>%
filter(total > 500) %>%
arrange(desc(log_ratio)) %>%
top_n(17), "latex", booktab=TRUE) %>%
kable_styling(full_width=TRUE, latex_options = c("striped", "scale_down"))
```
```{r rows.print=20, include=FALSE}
kable(log_ratios %>%
filter(total > 500) %>%
arrange((log_ratio)) %>%
top_n(-20), "latex", booktab=TRUE) %>%
kable_styling(full_width=TRUE, latex_options = c("striped", "scale_down"))
```
<P style="page-break-before: always">
\newpage
# Exhibit 3: LDA Topics
Latent Dirichlet Allocation (LDA) was applied to the `story` variable using the R package `topicmodels`. The Kiva team told LDA to find twelve topics.
### LDA Top Terms Per Topic
The figure below shows the top words related to each of the twelve discovered topics. LDA itself does not assign a human-readable name to each topic. The topics are only defined by their word probabilities. However, to make the topics easier to quickly understand, Kiva's data science team has given a name to each topic using their four highest-probable words, of the form "Topic Number: TopWord1_TopWord2_TopWord3_TopWord4."
```{r,fig.width=10,fig.height=8.0}
ap_top_terms <- lda_beta %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta)
ap_top_terms %>%
mutate(term = reorder(term, beta)) %>%
left_join(tn, by=c("topic" = "id")) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
myt +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic_name, scales = "free_y", ncol=3) +
coord_flip() +
theme(strip.text = element_text(size=10)) # Make titles smaller
```
<P style="page-break-before: always">
\newpage
### Documents per LDA Topic
The figure below shows the number of documents that contain each topic.
```{r}
topic_totals = lda_gamma %>%
left_join(df, by=c("document" = "id")) %>%
select(c(-story)) %>%
filter(gamma >= 0.05) %>%
group_by(topic, status) %>%
summarize(count=n()) %>%
spread(status, count) %>%
mutate(total = defaulted + paid) %>%
left_join(tn, by=c("topic" = "id")) %>%
select(topic, topic_name, everything())
```
```{r fig.height=2}
tmp_gathered = topic_totals %>%
select(topic, topic_name, defaulted, paid) %>%
gather(Status, Value, defaulted, paid)
tmp_gathered$topic = as.factor(tmp_gathered$topic)
ggplot(tmp_gathered, aes(x=topic, y=Value, fill=Status)) + myt +
geom_bar(stat="identity") +
theme(legend.position = "right", legend.text = element_text(size = 8),
legend.margin = margin(0.5, 0.5, 0.5, 0.5, "pt"))
```
```{r, include=FALSE}
kable(topic_totals, "latex", booktab=TRUE) %>%
kable_styling(full_width=TRUE, latex_options = c("striped", "scale_down"))
```
### LDA Examples
Below is an example of a `story` that contained LDA topic 8 at 99%.
> *Senaida Agueda has a business in which she buys and resells clothing that she has been operating for about 2 years now with the help of loans from Esperanza. She has two children, 41 and 38, whom do not live with Mrs. Agueda. When not working with her clothing shop, she enjoys going to the beach and cooking, arroz con carne (rice with meat) being one of her favorite dishes. As Mrs. Agueda is in her elder years and has fully grown children, she simply wishes to sustain her business to support her and her husband. Mrs. Senaida Agueda is a member of an eight person group, Group 4, that is part of a larger micro-bank called Mujeres de Fe, "Women of Faith" in English. In the picture, Mrs. Agueda is third from the right along with members of her group and some others of Mujeres de Fe. (...)*
Below is an example for LDA topic 4:
> *Descripcin del Negocio. La Sra Angela se dedica a la venta de articulos para el hogar a credito y de forma anbulante ademas vende golosinas a llos nios de una escuela. En la actualidad vende a personas de otros sectores que han sido recomendados por buenos clientes que le refieren esos sitios. Uso del Prstamo. Ella necesita el credito para comprar mas mercaderia pues en estas epocas de fin de aos le son muy solicitadas. Informacin Personal. Ella tiene 29 aos y tiene dos hijos que estudian su casa es de caa y tiene estabilidad familiar. Translated from Spanish by Kiva Volunteer Wendy Wise*
And finally, below is an example for LDA topic 1:
> *Mary is 65 years of age, married with six children. All her children are married and self-reliant. She is a member of St Jude group at Githunguri in Thika district. Mary is earns her income as a dairy farmer. She needs a USD 150 loan to help her buy another small high breed dairy calf, which she will raise to maturity. She plans to meet her repayments on monthly basis.*
```{r, eval=FALSE, include=FALSE}
ids = lda_gamma %>%
filter(topic==1) %>%
arrange(desc(gamma)) %>%
top_n(1000) %>%
left_join(df_new, by=c("document" = "id")) %>%
mutate(len = nchar(story)) %>%
arrange(len)
ids
ids = c(7830, 7306, 7258, 7105)
df[7306,3]
df_new[ids,]
kable(t(df_new[ids[1],]), "latex", booktab=TRUE) %>%
kable_styling(full_width=TRUE, latex_options = c("striped", "scale_down"))
```
<P style="page-break-before: always">
\newpage
# Exhibit 4: A Sample of Data
The table below shows the same 10 random rows from Exhibit 1, except from the final dataset. The columns named "T1", "T2", etc., contain the percentage of the original story's words that have been assigned to topic 1, topic 2, etc., as determined by LDA.
\vspace{20pt}
```{r}
aa = df_new[sample_ids, ] %>%
select(-story, -id)
# Just to make the table look better
aa$country = gsub("Dominican Republic", "D.R.", aa$country)
setnames(aa, new=sprintf("T%d", c(1:12)), old=sprintf("topic %d: %s", c(1:12), topic_names))
#TODO: set NAs to 0
aa[is.na(aa)] <- 0
kable(aa, "latex", booktab=TRUE, digits=2) %>%
kable_styling(latex_options = c("striped", "scale_down"))
```
<P style="page-break-before: always">
\newpage
# Exhibit 5: Model Perfomance
Decision tree models were created with the R package `rpart`. Here, the decision tree models are automatically created by a machine learning algorithm as the algorithm learns simple decision rules from the data. These automatically-learned rules can then be used to both understand the variables and to predict future data. A big advantage of decision trees over other models is that they are relatively simple for humans to understand and interpret.
A decision tree consists of nodes. Each node splits the data according to a rule. A rule is based on a variable in the data. For example, a rule might be “Age greater than 30.” In this case, the node splits the data by the `age` variable; those passengers that satisfy the rule (i.e., are greater than 30) follow the left path out of the node; the rest follow the right path out of the node. In this way, paths from the root node down to leaf nodes are created, describing the fate of certain types of passengers.
A decision tree path always starts with a root node (node number 1), which contains the most important splitting rule. Each subsequent node contains the next most important rule. After the decision tree is automatically created by the machine learning algorithm, one can use the decision tree to classify an individual by simply following a path: start at the root node and apply each rule to follow the appropriate path until you hit an end.
When creating a decision tree from data, the analyst can specify the number of nodes for the machine learning algorithm to create. More nodes leads to a more accurate model, at the cost of a more complicated and harder-to-interpret model, as well as the risk of over fitting the training data. Likewise, fewer nodes usually leads to a less accurate model, but the model is easier to understand and interpret.
## Model 1 (No text)
Below is the model that was created from only the original numerical and categorical variables.
```{r fig.height=2.5}
set.seed(123)
# Don't want to use either of these for prediction, and the - sign doesn't work
# with rpart forumulas.
df_notext = subset(df_new, select=c(status, sector, country, gender, loan_amount, nonpayment))
# Split the data into training and testing.
train_notext <- sample_frac(df_notext, 0.8)
test_notext <- setdiff(df_notext, train_notext)
# Let's train the model.
form = as.formula(status ~ .)
tree <- rpart(form, train_notext, method="class")
rpart.plot(tree, extra=2)
```
<!--
\vspace{20pt}
The following table summarizes the predictions of the decision on testing data.
-->
```{r, eval=TRUE, inculde=FALSE}
predicted = predict(tree, test_notext, type="class")
actual = test_notext$status
preds = data.frame((table(predicted, actual))) %>%
spread(actual, Freq) %>%
mutate(total = defaulted + paid) %>%
select(predicted, total, everything())
```
```{r, include=FALSE}
kable(preds, "latex", booktab=TRUE) %>%
kable_styling(full_width=TRUE, latex_options = c("striped", "scale_down")) %>%
add_header_above(c(" " = 2, "actual" = 2))
```
<P style="page-break-before: always">
\newpage
## Model 2 (With Text)
Below is the model that was created from all variables in the dataset.
```{r fig.height=3.0}
set.seed(123)
# Don't want to use either of these for prediction, and the - sign doesn't work
# with rpart forumulas.
df_text = subset(df_new, select=c(-id, -story))
# Split the data into training and testing.
train_text <- sample_frac(df_text, 0.8)
test_text <- setdiff(df_text, train_text)
# Let's create the model.
form = as.formula(status ~ .)
tree.text <- rpart(form, train_text, method="class")
rpart.plot(tree.text, extra=2)
```
<!--
\vspace{20pt}
Below is a summary of its predictions:
-->
```{r, include=FALSE}
predicted.text = predict(tree.text, test_text, type="class")
actual.text = test_text$status
preds.text = data.frame((table(predicted.text, actual.text))) %>%
spread(actual.text, Freq) %>%
mutate(total = defaulted + paid) %>%
select(predicted.text, total, everything())
```
```{r, include=FALSE}
kable(preds.text, "latex", booktab=TRUE) %>%
kable_styling(full_width=TRUE, latex_options = c("striped", "scale_down")) %>%
add_header_above(c(" " = 2, "actual" = 2))
```
<!--
That is, the model predicted `defaulted` 74 times: 50 times correctly, and 24 times incorrectly. It predicted `paid` 159 times: 120 times correctly, and 39 times incorrectly.
-->
## Metrics
Below is the accuracy and other metrics of the two models.
```{r}
bb = data.frame(Metric=c("Accuracy", "Precision", "Recall", "F1 Score", "Sensitivity", "Specificity"),
"Model_1" =c(Accuracy(y_true=actual, y_pred=predicted),
Precision(y_true=actual, y_pred=predicted),
Recall(y_true=actual, y_pred=predicted),
F1_Score(predicted, actual),
Sensitivity(y_true=actual, y_pred=predicted),
Specificity(y_true=predicted, y_pred=actual)),
"Model_2" = c(Accuracy(y_true=actual.text, y_pred=predicted.text),
Precision(y_true=actual.text, y_pred=predicted.text),
Recall(y_true=actual.text, y_pred=predicted.text),
F1_Score(predicted.text, actual.text),
Sensitivity(y_true=actual.text, y_pred=predicted.text),
Specificity(y_true=predicted.text, y_pred=actual.text)))
kable(bb, "latex", booktab=TRUE, digits=3) %>%
kable_styling(full_width=TRUE, latex_options = c("striped", "scale_down"), font_size=9)
```
# Appendix 1: Data Collection
The data in this case study was collected from Build.Kiva[^2], Kiva's website that provides snapshots of Kiva loan data. In the full dataset, about 98% of loans are paid and 2% defaulted. In this case study, we look at only a sample of the data, where the split between paid and defaulted is closer to 50%-50%. This sample is available at [http://www.github.com/stepthom/sandbox/data/kiva.csv](http://www.github.com/stepthom/sandbox/data/kiva.csv).
[^2]: https://build.kiva.org