-
Notifications
You must be signed in to change notification settings - Fork 0
/
classification_with_decision_trees.Rmd
1577 lines (1158 loc) · 64.8 KB
/
classification_with_decision_trees.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Classification with decision trees"
author: "Anton Barrera Mora ([email protected])"
date: "June 2023"
output:
github_document:
preserve_yaml: true
word_document: default
pdf_document:
highlight: zenburn
toc: yes
html_document:
highlight: default
number_sections: yes
theme: cosmo
toc: yes
toc_depth: 3
includes:
in_header: p_brand.html
editor_options:
markdown:
wrap: sentence
bibliography: references.bib
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r librerias y paquetes, echo=FALSE, message=FALSE, warning=FALSE}
#Carga de librerias en segundo plano
if(!require('ggplot2')) install.packages('ggplot2'); library('ggplot2')
if(!require('dplyr')) install.packages('dplyr'); library('dplyr')
if (!require('janitor')) install.packages('janitor'); library(janitor)
if(!require('magrittr')) install.packages('magrittr'); library('magrittr')
```
# Introduction
We will address the creation of a supervised data mining project.
We will use a classification algorithm, specifically the decision tree model.
We will rely on the 'German Credit' dataset from @ucimach as our reference.
We will consider the 'default' variable as an indicator and label for credit defaults.
This is a classification problem because the outcome is a discrete variable: whether credits are paid or not, with only two classes.
Decision trees and random forests can be employed for this type of classification problem, as well as for supervised regression problems.
They can handle non-linear relationships and interactions between variables well, helping to understand which factors are driving the outcomes, in this case, the factors present in credit defaults.
# Phase 1: Understanding the Business
## Problem:
The requirement is to have the ability to predict which customers, based on certain variables, may default on credit in case of granting a loan.
## Data Collection:
We will base the project on the "[Statlog (German Credit Data) Data Set](https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data))." The dataset from the year 1994 classifies individuals described by a set of attributes to determine if they are a good or bad credit risk.
It contains a total of 1000 records with around 20 variables, presented in two formats, one of which is numeric only, including a cost matrix.
This dataset is publicly available.
Additionally, the dataset is accompanied by[documentatio](https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.doc)n that explains the different attributes.
# Phase 2: Understanding the Data
## Initial Analysis
We start the work by conducting an analysis of the data and the different variables present in the dataset.
This initial exploration will give us insights into the structure of the dataset, the types of variables present, and an overview of their distribution and summary statistics.
### Exploring the dataset
Loading the dataset:
```{r carga del dataset, echo=TRUE, message=FALSE, warning=FALSE}
credit <- read.csv("credit.csv", header=T,sep = ",")
attach(credit) # Agregamos el fichero al entorno de trabajo para poder llamar a las variables mas facilmente, aunque este metodo tiene varios problemas
```
We will take a first look at the data to see what variables are present:
```{r glimpse1, echo=TRUE, message=FALSE, warning=FALSE}
# Observamos la informacion de las variables
glimpse(credit)
```
We can see that there are a large number of categorical variables.
We proceed with a detailed analysis:
#### Description of attributes or variables:
- Checking_balance \<chr\>.
It refers to the status of the checking account for the loan in Deutsche Mark (DM).
It is a categorical variable with four possible categories:
- Less than 0 DM
- Between 0 and 200 DM
- More than 200 DM or salary has been deposited into this account for at least 1 year
- No checking account
Although decision trees can handle categorical variables, in this case, we will convert it into dummy variables based on whether there is a balance or not in order to make the model easier to interpret.
- months_loan_duration \<int\>.
It refers to the duration of the loan repayment in months.
It is a numerical variable with a wide range.
In this case, we will scale the variable.
- credit_history \<chr\>.
It is a categorical variable that describes the credit history of the loan applicant.
It does not require transformation, but we will recode the levels.
- purpose \<chr\>.
This variable answers the question: What is the purpose of the credit application?
It is another categorical variable.
In this case, we will convert it into a dummy variable that answers the question: Is it for leisure?
Where 0 is false and 1 is true.
- amount \<int\>.
The granted credit amount.
The range of values is very wide, so we will normalize the variable to ensure that the attributes have a similar range of values.
- savings_balance \<chr\>.
It refers to the amount in a savings account, distinguishing it from a checking account.
Like attribute 1, it may be important for predicting whether someone will default on a loan or have difficulties with loan repayments.
We will use the same strategy as attribute 1.
- employment_length \<chr\>.
This attribute refers to the length of employment in the current job.
It is an ordinal variable.
We will encode the categories as integers.
- installment_rate \<int\>.
It refers to the percentage of disposable income that is allocated to loan installments.
The dataset authors define this attribute as "Installment rate in percentage of disposable income".
It takes values from 1 to 4, which seem to indicate ranges or categories instead of literal percentages.
We understand that 1 represents a low percentage of disposable income allocated to loan installments, and 4 represents a high percentage of disposable income available for loan installments.
Therefore, a higher value in "installment_rate" indicates a higher percentage of disposable income dedicated to loan installments, which could increase the risk of default if financial difficulties arise.
We will keep the values as they are.
- personal_status \<chr\>.
It refers to marital status.
There is a risk of gender bias here, as it is scientifically questionable and ethically reproachable to assume that women or men are of a certain creditworthy nature based on their gender.
This variable presents many ethical problems, and we will choose to exclude it.
- other_debtors \<chr\>.
Other debtors refer to the presence of guarantors.
We will recode the levels.
- residence_history \<int\>.
Reviewing the dataset documentation, we observe that it refers to the length of time the person has been residing in the current residence.
It is not very relevant considering that attribute number 14 refers to the aspect of property ownership of the residence.
Therefore, we will choose not to consider it.
- property \<chr\>.
It refers to the types of property owned by the loan borrower.
It has specific categories such as real estate, building society savings agreement/life insurance, car or other, and unknown/no property.
It is a categorical variable.
We are interested in the differences between customers who have different properties, so we will create dummy variables for each property type.
- age \<int\>.
Age is a numerical variable.
Classifying individuals based on their age is clearly unethical and discriminatory, so we will exclude this variable from the dataset.
- installment_plan \<chr\>.
It refers to the existence of other installment plans or loans that the credit applicant may have, in addition to the loan being applied for.
Originally, it refers to whether it is with a bank, a store, or no installment plan.
In this case, we will convert it into a new dummy variable.
- housing \<chr\>.
It refers to the type of housing and its ownership.
In this case, we will create dummy variables for each category.
- existing_credits \<int\>.
This attribute refers to the number of credits the person has with the bank.
This attribute raises serious doubts since there is already a similar attribute (14).
The documentation does not clarify whether it refers to a credit already paid or in progress.
We choose to keep it because the formulation of the attribute name is in the present tense, so we assume that they are credits already requested, in progress, and pending payment.
We will keep the values as they are.
- default \<int\>.
The target variable.
It is encoded as 1 or 2.
We will modify it to 0 and 1.
- telephone \<chr\>.
Whether the client has a telephone installed or not can be an independent variable to consider when assessing economic capacity.
We will convert it into another dummy variable, although in the present day, the possession of a telephone in a household may not be very representative of its economic potential.
- foreign_worker \<chr\>.
It refers to whether the client who enjoyed the credit was a foreign worker or not.
The inclusion of certain characteristics, such as nationality, race, gender, religion, sexual orientation, among others, in credit decision models has been the subject of significant ethical and legal debate.
This is a case specific to the German society model, where this variabl may have made sense in its time.
From a legal perspective, legislation varies depending on the country.
In the United States, the "**Equal Credit Opportunity Act**" prohibits discrimination in any aspect of a credit transaction based on race, color, religion, national origin, sex, marital status, age, among others.
The ethics are questionable, so we will choose to remove it from the model.
- dependents \<int\>.
It refers to the number of dependents the client has.
We choose to keep it in its current integer format.
- job \<chr\>.
Skilled worker or not.
It contains categories related to the legal status of the worker in the country.
We will create dummy variables for each category.
At this point, we would like to clarify several ethical aspects that arise from the data.
The inclusion of many of the variables present - Age, Worker's origin, Marital status, Gender - borders on illegality - if not directly - according to current legislation in many parts of the world.
They are clearly deserving of ethical considerations.
We have chosen not to include them.
In this case, it is a work for educational purposes, but in a real-life scenario, we would refuse to include this type of characteristics that only serve to bias and support discriminatory policies.
We continue the analysis in search of 'NA' values and the distribution of the variables:
```{r summary1, echo=TRUE, message=FALSE, warning=FALSE}
#Buscamos NA y estudiar la distribucion de las variables
summary(credit)
perdidos <-credit[is.na(credit),]
print(perdidos)
```
We observe the characteristics and how the variables are distributed.
At this point, it is interesting to highlight some information such as:
- The loan duration is centered around an average of 18 months.
- The granted amounts revolve around an average of 3271 DM.
- On average, customers had only one credit with the institution.
- And, regarding the customer prototype, they tend to have a dependent family member.
We ensure that there are no blank values.
```{r blank1, echo=TRUE, message=FALSE, warning=FALSE}
# Encuentra filas con al menos un blanco
blank_rows <- rowSums(credit == "") > 0
# Imprime las filas con al menos un blanco
print("Blancos")
print(credit[blank_rows, ])
# Observamos las dimensiones del dataset (debe ser 1000 / 21)
dim(credit)
```
We can confidently state that the dataset does not have any infinite or blank values.
However, as part of the data preparation process, we will need to remove and transform variables as previously mentioned.
## Dataset preparation
We will exclude attributes that we are not considering, such as marital status, residence history, age, and foreign worker status.
1. We will exclude attributes that we will not consider such as marital status, length of residence, age and whether you are a foreign worker:
```{r eliminando atributos, echo=TRUE, message=FALSE, warning=FALSE}
# Eliminamos las columnas
credit <- select(credit, -c(personal_status, residence_history, age, foreign_worker))
```
2. We are going to convert checking_balance\<chr\> into a dummie. The current account balance can be a relevant information when it comes to whether defaults occur or not, so we are interested in generating dummie variables for each category:
```{r checking_balance, echo=TRUE, message=FALSE, warning=FALSE}
# Antes que nada, convertimos el atributo en factor:
credit$checking_balance <- as.factor(credit$checking_balance)
# Convertimos la variable 'checking_balance' en variables dummy
dummy_vars <- model.matrix(~checking_balance -1, data = credit)
#Convertimos el resultado en df y asignamos nombres a las columnas apropiadas
dummy_df <- as.data.frame(dummy_vars)
dummy_df[] <- lapply(dummy_df, as.integer) # tiene la fea costumbre de convertir a dbl, rectificamos a int
colnames(dummy_df) <- levels(credit$checking_balance)
# Forzamos el nombre que a nosotros nos interesa, mas descriptivo:
names(dummy_df) <- c("checking_balance_lt_0", "checking_balance_gt_200", "checking_balance_1_200", "checking_balance_unknown")
# Unimos las nuevas variables al df original
credit <- cbind(credit, dummy_df)
# Excluimos el atributo 'checking_balance':
credit$checking_balance <- NULL
# visualizamos el dataset o df:
str(credit)
names(credit)
```
And we check that the changes have been made correctly.
3. months_loan_duration\<int\>. The expected amortisation of the loan in months. We will scale the variable:
```{r escalado de months_loan_duration, echo=TRUE, message=FALSE, warning=FALSE}
#Escalamos la variable months_loan_duration
credit_months_loan_z <- scale(credit$months_loan_duration)
# Unimos las nuevas variables al df original
credit <- cbind(credit, credit_months_loan_z)
# Renombramos
# credit <- credit %>% rename(credit_scale_z = credit_escale)
# Excluimos el atributo 'months_loan_duration':
credit$months_loan_duration <- NULL
# Visualizamos la tabla
glimpse(credit)
```
We observe that everything is correct and proceed.
4. credit_history \<chr\>. Describes the credit history of the loan applicant. We will recode the levels to make it more understandable:
- critical: This category refers to applicants who have a history of critical credit behavior, such as not paying other credits that are not with the bank in question.
- delayed: This category refers to applicants who have had delays in the payment of their credits in the past.
- fully repaid: Refers to applicants who have fully repaid their credits in the past.
- fully repaid this bank: This category refers to applicants who have fully repaid their credits at the bank in question.
- repaid: Refers to applicants who have repaid their credits to date.
```{r recodificando credit_history, echo=TRUE, message=FALSE, warning=FALSE}
# Recodificamos la variable
credit$credit_history <- recode(credit$credit_history,
"critical" = "Critical",
"delayed" = "PaymentDelayed",
"fully repaid" = "FullyRepaid",
"fully repaid this bank" = "FullyRepaidThisBank",
"repaid" = "Repaid")
# Convertimos la variable a factor
credit$credit_history <- as.factor(credit$credit_history)
# Visualizamos la tabla
glimpse(credit)
```
We observe that the categories have the desired format.
We continue with the next variable:
5. purpose\<chr\>. What is the purpose of the loan application? This is another categorical variable. In this case, we will convert it into a dummy variable. Initially, we had planned to convert it into a binary variable, but at this point, it could be interesting to know which type of consumer loans would generate a higher default rate. Therefore, we create new attributes for each category.
```{r dummy purpose, echo=TRUE, message=FALSE, warning=FALSE}
#Creamos variables dummy para este atributo, convertimos el atributo en factor:
credit$purpose <- as.factor(credit$purpose)
# Convertimos la variable 'purpose' en variables dummy
dummy_vars <- model.matrix(~ purpose -1, data = credit) # excluimos la primera categoría como referencia
#Convertimos el resultado en df y asignamos nombres a las columnas apropiadas
dummy_df <- as.data.frame(dummy_vars)
dummy_df[] <- lapply(dummy_df, as.integer) # como convierte por defecto a dbl, vamos a cambiar a int
colnames(dummy_df) <- levels(credit$purpose)
# Unimos las nuevas variables al df original
credit <- cbind(credit, dummy_df)
# Excluimos el atributo 'checking_balance':
credit$purpose <- NULL
# Normalizamos los nombres de atributos:
credit <- credit %>% clean_names()
# visualizamos el dataset o df:
glimpse(credit)
#Visualizamos las variables:
names(credit)
```
And indeed, the changes have gone in the desired direction.
6. amount\<int\>. The amount of credit granted. Let's normalise to get the attributes to have a similar range of values in standard scores.
```{r z amount, echo=TRUE, message=FALSE, warning=FALSE}
#Escalamos la variable amount
amount_z <- scale(credit$amount)
# Unimos las nuevas variables al df original
credit <- cbind(credit, amount_z)
# Excluimos el atributo 'amount':
credit$amount <- NULL
# Visualizamos la tabla
glimpse(credit)
```
We have effectively converted the attribute to its z-scores.
7. savings_balance\<chr\>. Savings account. Like attribute 1 it can be important for predicting whether someone is going to default on a loan or have difficulty paying repayments. We will use the same strategy as with attribute 1.
```{r dummy savings_balance, echo=TRUE, message=FALSE, warning=FALSE}
# Convertimos el atributo en factor:
credit$savings_balance <- as.factor(savings_balance)
# Convertimos la variable 'savings_balance' en variables dummy
dummy_vars <- model.matrix(~savings_balance -1, data = credit) # -1 para sin categoria de referencia
#Convertimos el resultado en df y asignamos nombres a las columnas apropiadas
dummy_df <- as.data.frame(dummy_vars)
dummy_df[] <- lapply(dummy_df, as.integer) # convertimos a int
colnames(dummy_df) <- levels(credit$savings_balance)
# Forzamos el nombre que a nosotros nos interesa, mas descriptivo:
names(dummy_df) <- c("savings_bal_lt_100", "savings_bal_gt_1000", "savings_bal_101_500", "savings_bal_501_1000", "savings_bal_unknown")
# Unimos las nuevas variables al df original
credit <- cbind(credit, dummy_df)
# Excluimos el atributo 'checking_balance':
credit$savings_balance <- NULL
# visualizamos el dataset o df:
str(credit)
names(credit)
```
The changes have indeed taken place as expected, we continue with the eighth variable.
8. employment_length\<chr\>. Length of service in the job. This is an ordinal variable in which we will code the categories as integers.
```{r ordinal employment_length, echo=TRUE, message=FALSE, warning=FALSE}
# Codificando ordinales
credit$employment_length <- factor(credit$employment_length,
levels = c("unemployed", "0 - 1 yrs", "1 - 4 yrs", "4 - 7 yrs", "> 7 yrs"),
labels = c(0, 1, 2, 3, 4),
ordered = TRUE)
# Visualizamos los registros de la columna que hemos modificado
head(credit$employment_length, 4)
```
We observe that the levels have been modified according to our intentions: 0 = unemployed, 1 = 0-1 yrs, 2 = 1-4 yrs, 3 = 4-7 yrs, 4 = \> 7 yrs.
9. installment_rate\<int\>. This refers to the percentage of disposable income allocated to loan installments. It takes values from 1 to 4, which seem to indicate ranges or categories instead of literal percentages. Therefore, we understand that 1 represents a low percentage of disposable income for loan installment payments, and 4 represents a high percentage of disposable income available for loan payments. A higher value in "installment_rate" indicates a higher percentage of disposable income dedicated to loan installments. We will keep the values without modifications.
10. personal_status\<chr\>. This refers to marital status. We excluded it at the beginning of this phase.
11. other_debtors\<chr\>. This attribute refers to the presence of guarantors. It does not require any changes.
12. residence_history\<int\>. It has been removed from the dataset.
13. property\<chr\>. This refers to the types of property owned by a borrower. It is a categorical variable. We will create dummy attributes for each property type.
```{r dummy property, echo=TRUE, message=FALSE, warning=FALSE}
# Convertimos el atributo en factor:
credit$property <- as.factor(property)
# Convertimos la variable 'savings_balance' en variables dummy
dummy_vars <- model.matrix(~property -1, data = credit) # -1 para sin categoria de referencia
#Convertimos el resultado en df y asignamos nombres a las columnas apropiadas
dummy_df <- as.data.frame(dummy_vars)
dummy_df[] <- lapply(dummy_df, as.integer) # convertimos a int
colnames(dummy_df) <- levels(credit$property)
# Forzamos el nombre que a nosotros nos interesa, mas descriptivo:
names(dummy_df) <- c("property_soc_savings", "property_other", "property_r_estate", "property_unk_none")
# Unimos las nuevas variables al df original
credit <- cbind(credit, dummy_df)
# Excluimos el atributo 'checking_balance':
credit$property <- NULL
# visualizamos el dataset o df:
str(credit)
names(credit)
```
We checked that everything is correct.
We continue with the data preprocessing:
14. age\<int\>. The age removed from the dataset.
15. installment_plan\<chr\>. Other payment plans or loans that the credit applicant may have, in addition to the credit they are applying for. We will convert it into a new dummie variable but without creating new attributes, we will reduce it to yes or no, 0 and 1.
```{r mutate installment_plan, echo=TRUE, message=FALSE, warning=FALSE}
# Manteniendo la misma columna, cambiamos a una variable binaria. Bancos y tiendas seria 1, el resto cero.
credit <- mutate(credit,
installment_plan = ifelse(installment_plan %in% c('bank', 'stores'), 1, 0))
# visualizamos las modificaciones
head(credit$installment_plan, 5)
```
16. housing\<chr\>. This refers to the usual residence and its ownership. In this case we will proceed to create dummy variables for each category.
```{r dummy housing, echo=TRUE, message=FALSE, warning=FALSE}
# Convertimos el atributo en factor:
credit$housing <- as.factor(housing)
# Convertimos la variable 'housing' en variables dummy
dummy_vars <- model.matrix(~housing -1, data = credit) # -1 para sin categoría de referencia
#Convertimos el resultado en df y asignamos nombres a las columnas apropiadas
dummy_df <- as.data.frame(dummy_vars)
dummy_df[] <- lapply(dummy_df, as.integer) # convertimos a int
colnames(dummy_df) <- levels(credit$housing)
# Forzamos el nombre que a nosotros nos interesa, mas descriptivo:
names(dummy_df) <- c("housing_free", "housing_own", "housing_rent")
# Unimos las nuevas variables al df original
credit <- cbind(credit, dummy_df)
# Excluimos el atributo 'housing':
credit$housing <- NULL
# visualizamos el dataset o df:
str(credit)
names(credit)
```
And observe once again that the required dummy variables have been created.
17. existing_credits\<int\>. This refers to the number of existing credits with the bank. We choose to keep it as is because the name of the variable is in the present tense, so we assume that these are credits that have already been applied for, in progress, and pending payment. We will keep the values unchanged.
18. default\<int\>. The target variable. It is currently encoded as 1 or 2, and we will modify it to 0 and 1.
```{r deafault -1, echo=TRUE, message=FALSE, warning=FALSE}
# Simplemente le restamos 1 para que se adecue a 0,1
credit$default <- credit$default - 1
# Observamos la variable
head(default, 8)
```
Based on the documentation, assessing the information \@ucimacha :
"This dataset requires use of a cost matrix (see below)\
..... 1 2\
\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\--\
1 0 1\
\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\--\
2 5 0\
(1 = Good, 2 = Bad)"
We will assume that 0 represents 'good', indicating no payment issues or 0=FALSE, meaning no default, and 1=TRUE, indicating there were payment problems.
Continuing with the data preprocessing:
19. telephone\<chr\>. We will convert it to a dummy or binary variable.
```{r bin telephone, echo=TRUE, message=FALSE, warning=TRUE}
# Manteniendo la misma columna, cambiamos a una variable binaria, segun tenencia o no de telefono.
credit <- mutate(credit,
telephone = ifelse(telephone %in% c('yes'), 1, 0))
# visualizamos las modificaciones
glimpse(credit)
```
20. foreign_worker\<chr\>. Removed from the
21. dependents\<int\>. Client dependents. No changes.
22. job\<chr\>. Qualification. We will create dummy variables for each category.
```{r job, echo=TRUE, message=FALSE, warning=FALSE}
# Convertimos el atributo en factor:
credit$job <- as.factor(job)
# Convertimos la variable 'job' en variables dummy
dummy_vars <- model.matrix(~job -1, data = credit) # -1 para sin categoría de referencia
#Convertimos el resultado en df y asignamos nombres a las columnas apropiadas
dummy_df <- as.data.frame(dummy_vars)
dummy_df[] <- lapply(dummy_df, as.integer) # convertimos a int
colnames(dummy_df) <- levels(credit$job)
# Forzamos el nombre que a nosotros nos interesa, mas descriptivo:
names(dummy_df) <- c("job_mang_self", "job_skill_emp", "job_unemp", "job_unskill")
# Unimos las nuevas variables al df original
credit <- cbind(credit, dummy_df)
# Excluimos el atributo 'housing':
credit$job <- NULL
# visualizamos el dataset o df:
str(credit)
names(credit)
```
We make the last corrections on some attribute types to unify criteria in 'default' and 'telephone':
```{r}
# Convertimos la columna default en un integral
credit$default <- as.integer(credit$default)
# convertimos la columna telephone en numeros integrales
credit$telephone <- as.integer(credit$telephone)
# Y observamos la tabla definitiva sobre la que aplicaremos el algoritmo
glimpse(credit)
```
## Visualising the dataset
To improve the understanding of the data, we will use different visualisations to help us in this task.
```{r libreriasII,echo=FALSE, message=FALSE, warning=FALSE }
#Carga de librerías en segundo plano
if(!require('ggpubr')) install.packages('ggpubr'); library('ggpubr')
if(!require('grid')) install.packages('grid'); library('grid')
if(!require('ggpubr')) install.packages('ggpubr'); library('ggpubr')
if(!require('gridExtra')) install.packages('gridExtra'); library('gridExtra')
if(!require('C50')) install.packages('C50'); library('C50')
if(!require('tidyverse')) install.packages('tidyverse'); library('tidyverse')
if(!require('ggcorrplot')) install.packages('ggcorrplot'); library('ggcorrplot')
if(!require('randomForest')) install.packages('randomForest'); library('randomForest')
```
```{r tema custom, echo=FALSE, message=FALSE, warning=FALSE}
mi_tema <- function() {
theme(
panel.border = element_rect(colour = "black",
fill = NA,
linetype = 1),
panel.background = element_rect(fill = "white",
color = 'grey50'),
panel.grid.major = element_line(colour = "grey80", linetype = "dashed"),
panel.grid.minor = element_blank(),
axis.text = element_text(colour = "black",
face = "plain",
family = "serif",
size = 12),
axis.title = element_text(colour = "black",
family = "serif",
face = "bold",
size = 14),
axis.ticks = element_line(colour = "black"),
axis.ticks.length = unit(0.15, "cm"),
plot.title = element_text(size = 23,
hjust = 0.5,
family = "serif",
face = "bold",
margin = margin(0, 0, 10, 0)),
plot.subtitle=element_text(size=16,
hjust = 0.5,
margin = margin(0, 0, 10, 0)),
plot.caption = element_text(colour = "black",
face = "italic",
family = "serif",
size = 10,
margin = margin(10, 0, 0, 0)),
legend.background = element_rect(fill = "white"),
legend.key = element_rect(fill = "white"),
legend.title = element_text(size = 12, face = "bold"),
legend.text = element_text(size = 12),
legend.position = "right"
)
}
```
We create different plots.
We are interested in visualising how the "good" and "bad borrowers" are distributed, and we will do this by a histogram:
```{r viz defaultI, echo=TRUE, message=FALSE, warning=FALSE}
# Distribucion de default
ggplot(credit, aes(x= default)) +
geom_bar(fill = 'skyblue') +
## mi_tema() +
labs(x= "Estado de impagos", y= "Ocurrencia", title = "Distribucion de impagos en el credito Aleman") +
scale_x_continuous(breaks = c(0,1), labels = c("Sin impagos", "impagos"))
```
We can see that 30% of the loans ended in default.
Next we will analyse the attributes 'amount_z' referring to the amount granted in the credit and the status of the defaults:
```{r viz defaultII, echo=TRUE, message=FALSE, warning=FALSE}
# Visualiamos importe e impagos
ggplot(credit, aes(x = as.factor(default), y = amount_z)) +
geom_boxplot(outlier.shape = NA) +
## mi_tema() +
labs(x = "Estado de impagos", y = "Importe concedido", title = "Importe del credito por estado de los impagos") +
scale_x_discrete(labels = c("bueno", "malo"))
```
Everything seems to indicate that the granted amount for credits that resulted in defaults was higher.
Observing the boxplots, we can see that there are outliers and variability in the granting of "bad" credits, as also indicated by the interquartile range.
The median, represented by the horizontal black line within the box, further supports the claim that larger credits were granted.
In conclusion, we can infer that the criteria for granting credits that resulted in defaults were less strict.
We continue with the visual analysis, exploring various representations of the derived dummy variables such as "purpose," indicating the intended use of the credit by customers.
```{r long_data puposeI, echo=TRUE, message=FALSE, warning=FALSE}
# Convertimos los datos al formato largo
long_data <- credit %>%
select(education, furniture, radio_tv, repairs, retraining, others) %>%
pivot_longer(everything(), names_to = "Purpose", values_to = "Count")
# Creamos un gráfico de barras para cada propósito
ggplot(long_data, aes(x = factor(Count, levels = c(0, 1)), fill = Purpose)) +
geom_bar(position = "dodge") +
## mi_tema() +
facet_wrap(~ Purpose, scales = "free") +
labs(x = "propositos", y = "numero de casos", title = "Distribucion de los propositos del credito") +
scale_x_discrete(labels = c("Si", "No"))
```
The loans granted are concentrated in TV and radio, furniture and, to a lesser extent, education.
```{r long_data puposeII, echo=TRUE, message=FALSE, warning=FALSE}
# rehacemos las variables dummies de purpose
long_data <- credit %>%
pivot_longer(cols = c(education, furniture, radio_tv, repairs, retraining, others),
names_to = "purpose",
values_to = "value")
# Guardamos solo las filas donde value == 1, es decir, donde el credito fue malo y tipos de credito
long_data <- long_data[long_data$value == 1,]
# Seleccionamos las columnas que necesitamos para la grafica
long_data <- long_data[, c("default", "purpose")]
```
Using a grouped bar chart we are going to represent which types of loans have presented the highest delinquency rates:
```{r viz tipo_creditos, echo=TRUE, message=FALSE, warning=FALSE}
# Creamos un gráfico de barras agrupado
ggplot(long_data, aes(x = purpose, fill = as.factor(default))) +
geom_bar(position = "dodge") +
## mi_tema() +
scale_fill_discrete(name = "Impago", labels = c("No", "Sí")) +
labs(x = "Propósito del crédito", y = "Número de créditos",
title = "Distribución de impagos por propósito del crédito",
fill = "Impago") +
coord_flip()
```
This grouped bar chart confirms what we observed in the box plot: the two most frequent consumer loans - furniture and television/radio - also have the highest default rates.
However, they are also the most requested loans.
To further analyze this, we will calculate and visualize the default rates.
```{r tasa_impagos, echo=TRUE, message=FALSE, warning=FALSE}
# Calculamos la tasa de impago
default_rate <- long_data %>%
group_by(purpose) %>%
summarise(total = n(), defaults = sum(default == 1)) %>% #seleccionamos solo (1) y sumamos
mutate(default_rate = defaults / total) # dividimos los "1" o ocurrencias entre los totales
# Visualizamos la tasa de impago
ggplot(default_rate, aes(x = purpose, y = default_rate)) +
geom_col(fill = 'skyblue') +
labs(x = "Propósito del crédito", y = "Tasa de impago",
title = "Tasa de impago por propósito del crédito") +
coord_flip()
```
And this contradicts our perception; it seems that education and other loans have the highest default rate, which would make some sense.
Finally, to conclude the visualization section, we turn to a heatmap of the numeric variables "default," "amount_z," and "credit_months_loan_z." This heatmap will help us understand the relationship between loan repayment and loan amounts and duration.
```{r heat_mapI, echo=TRUE, message=FALSE, warning=FALSE}
# seleccionamos las variables de interés (relacion default, importe y meses)
vars_de_interesI <- c("default", "amount_z", "credit_months_loan_z")
# calculamos la matriz de correlación sólo para estas variables
cor_matrix <- cor(credit[vars_de_interesI])
# visualizamos la matriz de correlación
ggcorrplot(cor_matrix, title = "Matriz de correlacion de variables I " )
```
And we easily observe that there is a correlation between the amortisation period and the amount, which is also very logical.
We repeat the same strategy to study the relationship of another set of variables:
```{r heat_mapII, echo=TRUE, message=FALSE, warning=FALSE}
# seleccionamos las variables de interés (relación default y cualificación laboral)
vars_de_interesII <- c("default", "job_mang_self", "job_skill_emp", "job_unemp", "job_unskill")
# calculamos la matriz de correlación sólo para estas variables
cor_matrix <- cor(credit[vars_de_interesII])
# visualizamos la matriz de correlación
ggcorrplot(cor_matrix, title = "Matriz de correlacion de variables II" )
```
There does not seem to be any correlation between the type of job qualification and default or non-payment.
We could repeat some graphs by changing variables, but at this point we consider finalising the visualisation of the data using different functions that return a numerical format:
```{r str_viz, echo=TRUE, message=FALSE, warning=FALSE}
# visualizamos resumen de los datos
# Usando un str()
str(credit)
# usando summary
summary(credit)
```
And finally, on the number of defaults:
```{r media_default, echo=TRUE, message=FALSE, }
# Partiendo de un N o muestra de 1000 créditos concedidos:
# Suma de créditos malos
total_defaults <- sum(credit$default == 1)
# Créditos buenos
total_no_defaults <- sum(credit$default == 0)
# media de créditos con problemas
avg_defaults <- mean(credit$default == 1)
print(paste("Numero de impagos: ", total_defaults))
print(paste("Numero de creditos abonados: ", total_no_defaults))
print(paste("Media de creditos con impagos: ", avg_defaults))
```
At this point we finalise the visualisation of the data and proceed with the decision tree.
# Phase 3. Data Preparation for the Model
In order to evaluate the decision tree, it is necessary to split the dataset into a training set and a test set.
We will use a 2/3 ratio for the training set and a 1/3 ratio for the test set.
Based on the example provided by the teaching team regarding the decision tree, we apply the model to the dataset we are working with.
**The target variable** is the indicator of whether the credit was paid or defaulted, 'default'.
```{r libreriasIII,echo=FALSE, message=FALSE, warning=FALSE }
#Carga de librerías en segundo plano
if(!require('caret')) install.packages('caret'); library('caret')
if(!require('pROC')) install.packages('pROC'); library('pROC')
```
```{r conjuntos_arbol, echo=TRUE, message=FALSE, warning=FALSE}
# El contador de semillas para replicabilidad
set.seed(777)
# Creamos un vector
y <- credit[, 7]
# y un df
X <- credit[, 1:41]
# Eliminamos la variable a predecir
X$default <- NULL
```
Reviewing an extensive literature, many authors refer to a method that would consist of using the results of supervised learning - decision tree - after an unsupervised learning step - clustering.
This way, we will verify and compare the assignment made by both algorithms.
However, we will use the algorithm to predict population growth and gross domestic product.
First, we will separate the data for the training and test set:
```{r conjuntos, echo=TRUE, message=FALSE, warning=FALSE}
# Definimos la proporción de datos para el conjunto de entrenamiento
train_ratio <- 2/3
# Creamos los índices de partición
index <- createDataPartition(y, p = train_ratio, list = FALSE)
# Crear el conjunto de entrenamiento
trainX <- X[index,]
trainy <- y[index]
# Crear el conjunto de prueba
testX <- X[-index,]
testy <- y[-index]
```
We will carry out an analysis of the data to ensure that the data is not skewed in any of the cases:
```{r anal_sets, echo=TRUE, message=FALSE, warning=FALSE}
# Set de entrenamiento X
summary(trainX)
# Variable objetivo entrenamiento
glimpse(trainy)
# set de pruebas
summary(testX)
# variable objetivo pruebas
glimpse(testy)
```
Although the binary format is not the most suitable for displaying data, we did not observe any serious differences.
# Phase 4. Model creation
We create the decision tree using the C5.0 algorithm:
```{r modeloI, echo=TRUE, message=FALSE, warning=FALSE}
# Creamos el Arbol de decision
trainy <- as.factor(trainy) # convertimos en factor
model <- C50::C5.0(trainX, trainy,rules=TRUE )
summary(model)
```
The C5.0 algorithm was implemented to train a decision tree model using a dataset of 667 cases, each containing 41 attributes or variables.
The model generated a total of 19 rules based on this training data.
Among these rules, we want to highlight some of particular relevance, especially considering the lift indicators or the relationship between the results obtained with and without a prediction model:
The first rule of the model states that if the savings_bal_unknown feature is greater than 0, then the predicted classification will correspond to class 0.
This rule applied to 125 training cases, although it resulted in 19 errors, resulting in a 20% improvement compared to random prediction, indicated by the lift coefficient of 1.2.
Similarly, the second rule states that if savings_bal_unknown is less than or equal to 0, the predicted class will again be 0.
This rule applied to 542 cases, with 170 errors, resulting in a lift of 1.0.
The model shows a pattern of multiple conditions leading to a predicted classification.
This is the case with the third rule.
With a lift of 3.1, it involves a total of seven conditions, including employment_length = 4, installment_rate \> 2, and telephone \<= 0.
The analysis of the training data revealed that the decision tree model has an error rate of 13.2%, with a total of 88 misclassifications out of 667 cases.
According to the confusion matrix of the model, it correctly classified 443 cases as class 0 and 136 cases as class 1.
However, there were also cases of misclassification: 35 cases were classified as class 0 when they actually belonged to class 1, and 53 cases were classified as class 1 when they actually belonged to class 0.
Furthermore, we observed the frequency with which each attribute was included in the model's rules.
The savings_bal_unknown attribute was the most used, as it was included in all the model's rules.
This finding suggests that this particular attribute plays a crucial role in the model's decisions.
Despite the complexity of the training data, the C5.0 algorithm was able to generate the model in a practically insignificant time, clearly highlighting its virtues.
We can state that the results of the model on the training data indicate satisfactory performance, pending validation with an independent test dataset to study its generalization ability, preventing an expected overfitting effect to the training data.
We continue by displaying the obtained tree:
```{r arbolI, echo=TRUE, message=FALSE, warning=FALSE}
# visualizamos el arbol del modelo (no termina de finalizar el codigo)
# model <- C50::C5.0(trainX, trainy)
# plot(model,gp = gpar(fontsize = 9.5))
```
## Model validation
We proceed to check the quality by predicting the default for the test data:
```{r prediccion, echo=TRUE, message=FALSE, warning=FALSE}
#Predecimos el modelo
predicted_model <- predict( model, testX, type="class" )
print(sprintf("La precisión del árbol es: %.4f %%",100*sum(predicted_model == testy) / length(predicted_model)))
```
The accuracy of the model on the test set is approximately 69.37%.
The model has correctly classified 69.37% of the cases in the test dataset.
A 50% accuracy would be equivalent to random guessing, so it is only about 19% better than a random binomial model.
Accuracy is just one performance metric and can be misleading in certain cases, such as when classes are imbalanced, although we do not believe this is the case.
Therefore, we consider other performance metrics such as:
\- Area Under the ROC Curve (AUC-ROC): It plots the true positive rate (sensitivity) against the false positive rate (1 - specificity) for different classification thresholds.
A perfect model will have an AUC-ROC of 1, while a random model will have an AUC-ROC of 0.5.
\- Sensitivity: It is the proportion of true positives (TP) among the sum of true positives (TP) and false negatives (FN).
It indicates the percentage of positive classes that were correctly identified.
\- Specificity: It is the proportion of true negatives (TN) among the sum of true negatives (TN) and false positives (FP).
It indicates the percentage of negative classes that were correctly identified.
\- Precision: It is the proportion of true positives (TP) among the sum of true positives (TP) and false positives (FP).
It indicates the percentage of positive predictions that were correct.
\- F1 Score: It combines precision and recall.
It is especially useful when dealing with an unequal class distribution.