-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathEDAkaggle.Rmd
410 lines (290 loc) · 13.1 KB
/
EDAkaggle.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
---
title: "EDA Kaggle"
author: "Gideon Wolf"
date: "2/27/2024"
output: html_document
---
```{r setup, include=FALSE, message=FALSE, echo = FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(mice)
library(xts)
library(ggplot2)
library(gridExtra)
house <- read_csv("train.csv")
```
# EDA Goal: Explore the Kaggle housing dataset and prepare for model fitting
---
## Initial Questions:
**How large is the data? What is the quality of the data?**
-What type of variables are we working with?
**Is there a substantial amount of missing data?**
-If so, what kind of imputation would be most useful?
-Do we have any outliers?
## Data Handling {.tabset}
### Data Wrangling
We need to replace some 'NA's with another label because the data uses 'NA' as an
indicator for some variables. The 'NA' label indicates that the house does not have
said feature (as noted in the data_description.txt file), so we have to change
the NA to something else to have it be considered as a level of each factor in
the model.
```{r, message = FALSE}
NAtoNone <- c(2,3,7,10,24,25,26,31,32,33,34,36,54,56,58,59,61,64,65,73,74,75,79)
for (i in NAtoNone) {
housetmp <- house
housetmp[i][is.na(housetmp[i])] <- "None"
house <- housetmp
}
```
Now, we must change the character variables to factor variables. Additionally,
some other variables are categories that use integer category names which must
be changed individually.
```{r, include = FALSE, message=FALSE}
house <- house %>% mutate_if(is.character, as.factor)
house <- house %>% mutate_at(c('MSSubClass','MoSold'), as_factor)
YrMoSold <- do.call(paste, c(sep='',list(house$YrSold),list(rep("-", times = length(house$YrSold))),list(house$MoSold),list(rep("-01", times = length(house$YrSold)))))
YrMoSold <- as_date(YrMoSold)
house <- as.data.frame(append(house, list(YrMoSold = YrMoSold), after = 78))
```
### Size and Quality of Data
Summary and Variable Types:
```{r}
summary(house)
```
**ADD**: what does this summary tell us?
### Handling GarageYrBuilt
P: This variable is tricky. It is the date of garage construction with an NA meaning that there is no garage attached to the house. To impute an NA for this variable would be meaningless because there would be no garage to have been built on an imputed day- imputation would ignore the real-world meaning of this variable and potentially skew sales price. However, leaving the NA's without imputation could impede upon model creation. Additionally, this variable could hold important information about the sales price, so it may be unwise to simply take the column out without knowing if it is valuable.
S: We will investigate the similarity between other garage variables and GarageYrBlt to see if the impact of GarageYrBlt on SalePrice can be summarized by other variables. After analyzing the relationship between GarageYrBlt, Sales Price, and other garage variables, we feel comfortable taking out the variable entirely. We will also try and create a categorical variable for GarageYrBlt (1900-1940 old, 1940-1980 mid, 1980-2010 new).
```{r}
p <- print(select(house, row.names = c("GarageYrBlt", "GarageArea")))
p[p[2]== 0,]
p[is.na(p[1]),]
```
The above shows that each instance of an NA date corresponds to a 0 GarageArea and vice versa.
```{r}
hist(house$GarageYrBlt,breaks = 30)
ggplot(aes(x = GarageYrBlt, y = SalePrice), data = house) +
geom_point()
ggplot(aes(x = GarageYrBlt, y = SalePrice, color = GarageArea), data = house) +
geom_point()
ggplot(aes(x = GarageYrBlt, y = SalePrice, color = GarageQual), data = house) +
geom_point()
ggplot(aes(x = GarageYrBlt, y = SalePrice, color = GarageCond), data = house) +
geom_point()
ggplot(aes(x = GarageYrBlt, y = SalePrice, color = GarageFinish), data = house) +
geom_point()
ggplot(aes(x = GarageYrBlt, y = SalePrice, color = GarageType), data = house) +
geom_point()
```
**ADD**: How do we interpret these graphs?
```{r}
house <- house[-60] # Taking out GarageYrBlt
```
### Imputation
Missing Data:
```{r}
md.pattern(house)
for (i in 1:length(house)) {
print(sum(is.na(house[i])))
}
```
P: LotFrontage has 259 missing values. What imputation technique should we use to handle it?
S: MICE's PMM (predictive mean matching) can be used because we are imputing numeric data.
```{r}
house1 = house
hosue1 = house1 %>% filter(LotFrontage < 200) ## This doesn't work
ggplot(data = house) +
geom_boxplot(aes(x = MSSubClass, y = LotFrontage))
ggplot(data = house) +
geom_boxplot(aes(x = MSZoning, y = LotFrontage))
ggplot(data = house,aes(x = LotArea, y = LotFrontage)) +
geom_point() +
geom_smooth()
ggplot(data = house) +
geom_boxplot(aes(x = Street, y = LotFrontage))
ggplot(data = house) +
geom_boxplot(aes(x = Alley, y = LotFrontage))
ggplot(data = house) +
geom_boxplot(aes(x = LotShape, y = LotFrontage))
ggplot(data = house) +
geom_boxplot(aes(x = LandContour, y = LotFrontage))
ggplot(data = house) +
geom_boxplot(aes(x = LotConfig, y = LotFrontage))
ggplot(data = house) +
geom_boxplot(aes(x = LandSlope, y = LotFrontage))
ggplot(data = house) +
geom_boxplot(aes(x = Condition1, y = LotFrontage))
ggplot(data = house) +
geom_boxplot(aes(x = Condition2, y = LotFrontage))
ggplot(data = house) +
geom_boxplot(aes(x = BldgType, y = LotFrontage))
ggplot(data = house) +
geom_boxplot(aes(x = HouseStyle, y = LotFrontage))
ggplot(data = house,aes(x = YearBuilt, y = LotFrontage)) +
geom_point() +
geom_smooth()
ggplot(data = house) +
geom_boxplot(aes(x = Foundation, y = LotFrontage))
ggplot(data = house1 ,aes(x = GrLivArea, y = LotFrontage)) +
geom_point() +
geom_smooth()
cor(house$LotFrontage, house$LotArea) ## Returns NA
```
**ADD**: description of these graphs, what does it mean that the cov function returns an NA?
```{r}
numhouse <- select_if(house, is.numeric)
imputed_house <- mice(numhouse,
m = 10,
method = 'pmm',
seed = 1)
imputed_house$loggedEvents
```
P: While this command runs, many of the variables are marked as "(nearly) multi-collinear." The logged events command asks "Does your dataset contain duplicates, linear transformation, or factors with unique respondent names?" How do we approach this problem?
S: Since it is just a warning we can leave it for now and return to it later if we have time.
---
Jeremiah's Note on Multillinearity:
Multicollinearity doesn't affect the model's predictions, but it affects the coefficients for the predictors. This means that for the purpose of the Kaggle competition we don't need to worry about it, but for the presentation it should be addressed. In assessing the presence of multicollinearity, Id was removed because it's irrelevant, Utilities was removed because lm() gave an error that suggested that some categorical predictors had only one factor level, which is simply not true, but removing Utilities fixes it somehow. The rest were removed because they were aliases, meaning that they are very highly multicollinear (TotalBsmtSF is a linear transformation of two other predictors, for instance). Once those are all removed, vif() actually gives results. YrMoSold is obviously multicollinear with the two variables it's based on, and removing it also decreases their scores. The only other high VIF (above 10, which corresponds to a R^2 of .9 for the variable modeled using all other predictors) is PoolArea, but unfortunately R doesn't conclude which variable(s) is/are causing this.
```{r Jeremiah's Note on Multillinearity}
train.df = subset(house3, select=-c(Id, Utilities, BsmtCond, BsmtFinType1, TotalBsmtSF, GrLivArea, GarageFinish, GarageQual, GarageCond, BldgType, Exterior2nd))
viftest <- lm(SalePrice ~ ., train.df)
#alias(viftest)$Complete
vif(viftest)
```
---
```{r}
set.seed(1)
imphouse <- complete(imputed_house)
housetmp <- house
s = 1
for (i in names(imphouse)[2:length(names(imphouse))]) {
housetmp[i] <- imphouse[i]
}
house <- housetmp
summary(house)
md.pattern(house)
```
Now we only have one column with NA's left: GarageYrBlt.
### Outliers
```{r}
housetmp <- select_if(house, is.numeric)
for (i in 1:length(housetmp)) {
print(names(housetmp[i]))
col <- housetmp[,i]
z <- vector("numeric", length = length(col))
outliers <- vector("numeric")
for (j in 1:length(col)) {
z[j] <- (col[j]-mean(col))/sd(col)
outliers <- append(outliers, if(abs(z[j]) > 7) j) }
cat("Index of outliers: ", outliers, "\n")
cat("Value of outliers: ", col[outliers], "\n")
cat(" Min, 1st, Med, Mean, 3rd, Max\n")
cat("Quartiles: ", summary(housetmp[,i]), "\n\n")
}
```
**ADD**: What to do with these outliers? Some of these variables are categorical variables so we will probably not take those "outliers" out. It is currently choosing outliers 5 standard deviations away from the mean, is that fine or should it be changed?
use par(c(2,2)) to create 4 boxplots to display more obvious outliers
Answer by Jeremiah: This could wait until the modeling stage. When we use R to determine which predictors should be included, we can then individually assess their relationship with SalePrice to ensure that the outlier isn't the sole cause of the correlation.
What commands should we use "to determine which predictors should be included?" Will we use models that employ variable elimination or should we perform elimination ourselves using another command to be determined?
---
\
\
\
## Questions:
To answer:
<details>
<summary>**How do the variables covary?**</summary>
-Does month sold affect the price sold?
```{r}
ggplot(data = house, aes(x = YrMoSold, y = SalePrice)) +
geom_point() +
geom_smooth(method='lm',se = TRUE)
h_lm <- lm(SalePrice~YrMoSold, data= house)
plot(h_lm)
```
Slope ~= 0, no relationship
-Is there a connection between Rennovation date and overall quality?
```{r}
ggplot(data = house, aes(x = YearRemodAdd, y = OverallCond)) +
geom_jitter() +
geom_smooth(method='lm',se = TRUE)
h_lm <- lm(SalePrice~YrMoSold, data= house)
plot(h_lm)
```
slope ~= 0, no correlation relationship.
-Does zone and house square feet have any correlation?
```{r}
ggplot(house, aes(x= MSZoning, y= (TotalBsmtSF + `X1stFlrSF` + `X2ndFlrSF`))) +
geom_boxplot()
summary(house$MSZoning)
```
Similar means even with few observations for C (all) and RH variables
-Mice noted multicollinearity in many variables, let's look at the variables more.
```{r}
ggplot( house, aes(x = BsmtFullBath, y =BsmtFinSF1 )) +
geom_point() +
labs(title = "Outlier makes an otherwise regular relationship \nfan-like, taking it out should solve this issue.") +
xlab("Number of Full Bathrooms in the Basement") +
ylab("Square Feet of Basement Finish")
```
```{r}
ggplot( house, aes(x = log(BsmtFinSF1),y = GrLivArea)) +
geom_point() +
labs(title = "Log transformation of basement square footage \nfixes most multicolinear relationship.") +
xlab("Log of Square Feet of Basement Finish") +
ylab("Above Ground Square Feet")
```
</details>
\
<details>
<summary>**How does each variable explain the data?**</summary>
-How much impact does house size have on sales price?
```{r}
p1 <- ggplot(house, aes(x=log(scale(TotalBsmtSF)),y=SalePrice)) + geom_point() + xlab("Log Scale of Basement\n Square Feet") + ylab("Sale Price")
p2 <- ggplot(house, aes(x=log(LotFrontage),y=SalePrice)) +
geom_point() + ylab("Sale Price") + xlab("Log of Lot Frontage")
p3 <- ggplot(house, aes(x=log(scale(`X1stFlrSF`)),y=SalePrice)) + geom_point() + ylab("Sale Price") + xlab("Log Scale of 1st Floor\n Square Feet")
p4 <- ggplot(house, aes(x=`X2ndFlrSF`,y=SalePrice)) +
geom_point() + ylab("Sale Price") + xlab("2nd Floor Square feet")
grid.arrange(p1, p2, p3, p4, ncol=2, top ="Transformations took care of multicollinearity\n of flagged variables.")
ggplot(house, aes(x = (TotalBsmtSF + `1stFlrSF` + `2ndFlrSF`), y = SalePrice)) +
geom_point()
```
-How does sale price vary over time?
Answered by Jeremiah
```{r}
ggplot(house,aes(x=YrSold, y=SalePrice)) +
geom_jitter()
```
-What years do we have the most data for?
```{r}
summary(house$YearBuilt)
summary(house$YearRemodAdd)
summary(house$YrSold)
```
Answered by Jeremiah
-How do sale price and number of houses in the data per year relate?
```{r}
hist(house$YrMoSold,breaks = 30)
plot(house$YrMoSold,house$SalePrice)
```
**ADD**: description of these graphs
</details>
\
<details>
<summary>**Extra questions**</summary>
-Once finished with everything else, how similar is the Ames dataset?
Answered by Jeremiah
-How does the 2008 housing crash affect prices, does it at all?
Answered by Jeremiah
-Can we make our response categorical?
Our output for Kaggle must be continuous, but is there a way to narrow down this response by using categories? Are there natural delineations in sales price(i.e. low, medium, high)?
</details>
\
<details>
<summary>**Modelling preparation**</summary>
-What kinds of models would be best fit for our data?
Using lazypredict in python, we may get a better idea of what the best model to use is.
Dr. Khormali's lazypredict: RF, lasso larsic, ridge cv, lin reg, gradient boosting regression, bagging
Find 9-12 variables that are most important for user:
</details>\
\