-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathProject_2019_Preprocessing.Rmd
117 lines (87 loc) · 2.81 KB
/
Project_2019_Preprocessing.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
---
title: "House Prices"
subtitle: "MAP 535 Regression"
author: ""
date: ""
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, cache = TRUE, message = FALSE, warning = FALSE)
library(tidyverse)
library(caret)
library(GGally)
library(lattice)
library(corrplot)
library(magrittr)
library(xgboost)
theme_set(theme_bw())
```
## Preliminary
### Loading the data set
```{r loading housing data}
train <- read_csv(file = 'data/train_raw.csv')
```
### Checking variables type
```{r}
str(train)
```
### Preprocessing I: types
First, we use the variable `Id` to index observations and remove this column.
```{r}
row.names(train) <- train$Id
train <- subset(train,select=c(-Id))
```
We cast categorical variables (here with character strings as modalities) into factors.
```{r}
var.quali <- sapply(select(train,-"SalePrice"), function(x) is.character(x))
var.quali["MSSubClass"]=TRUE
train %<>% mutate_each_(funs(as.factor), names(var.quali)[var.quali])
```
Check again types
```{r}
str(train)
```
### Preprocessing II: missing data
The following script detects and deletes the variable with a proportion of missing values superior to 40%.
```{r missing}
missing_threshold <- .4
is_too_scarce <- lapply(select(train, -SalePrice), function(x) mean(is.na(x)) > missing_threshold)
is_too_scarce <- map_lgl(select(train, -SalePrice), ~mean(is.na(.x)) > missing_threshold)
not_too_scarce <- names(is_too_scarce)[!is_too_scarce]
train <- select(train, SalePrice, not_too_scarce)
train %<>% select(SalePrice, not_too_scarce)
```
### Preprocessing III: imputation, standardization
For variables with a proportion of missing data inferior to 40%. We apply an elementary imputation scheme:
- For quantitative data, we use k nearest neigbours technique. We delete the variables with variance too close to 0. We also apply additional centering and rescaling treatments.
```{r imputation continuous pred}
imputedData <- preProcess(
select(train, -SalePrice),
method = c("center", "scale", "knnImpute", "nzv", 'YeoJohnson')
)
trainTrans <- predict(imputedData, train)
```
- For categorical data, we use the most frequent mode.
```{r}
trainTrans <- map_df(trainTrans, function(x) {
if (anyNA(x)) x[is.na(x)] <- names(which.max(table(x)))
x
}
)
```
```{r}
trainImputed <- map_df(train[,colnames(trainTrans)], function(x) {
if (anyNA(x) & is.factor(x)) x[is.na(x)] <- names(which.max(table(x)))
x
}
)
colnames(trainImputed) <- make.names(colnames(trainImputed))
mice_mice <- mice::mice(select(trainImputed,-"SalePrice"), m=1, print = FALSE)
trainImputed <- cbind(trainImputed$SalePrice, mice::complete(mice_mice,1))
colnames(trainImputed) <- colnames(trainTrans)
```
### Save the preprocessed data
```{r}
write.csv(trainTrans, 'train_preprocessed.csv')
write.csv(trainImputed, 'train_imputed.csv')
```