EDAkaggle.Rmd

---
title: "EDA Kaggle"
author: "Gideon Wolf"
date: "2/27/2024"
output: html_document
---

```{r setup, include=FALSE, message=FALSE, echo = FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(mice)
library(xts)
library(ggplot2)
library(gridExtra)

house <- read_csv("train.csv")
``` 


# EDA Goal: Explore the Kaggle housing dataset and prepare for model fitting

---

## Initial Questions:

**How large is the data? What is the quality of the data?**
-What type of variables are we working with?
**Is there a substantial amount of missing data?**
-If so, what kind of imputation would be most useful?
-Do we have any outliers?

## Data Handling {.tabset}

### Data Wrangling
We need to replace some 'NA's with another label because the data uses 'NA' as an 
indicator for some variables. The 'NA' label indicates that the house does not have 
said feature (as noted in the data_description.txt file), so we have to change 
the NA to something else to have it be considered as a level of each factor in 
the model.

```{r, message = FALSE}
NAtoNone <- c(2,3,7,10,24,25,26,31,32,33,34,36,54,56,58,59,61,64,65,73,74,75,79)
for (i in NAtoNone) {
  housetmp <- house
  housetmp[i][is.na(housetmp[i])] <- "None"
  house <- housetmp
}
```


Now, we must change the character variables to factor variables. Additionally, 
some other variables are categories that use integer category names which must 
be changed individually.

```{r, include = FALSE, message=FALSE}
house <- house %>% mutate_if(is.character, as.factor)
house <- house %>% mutate_at(c('MSSubClass','MoSold'), as_factor) 

YrMoSold <- do.call(paste, c(sep='',list(house$YrSold),list(rep("-", times = length(house$YrSold))),list(house$MoSold),list(rep("-01", times = length(house$YrSold)))))
YrMoSold <- as_date(YrMoSold)
house <- as.data.frame(append(house, list(YrMoSold = YrMoSold), after = 78))
```


### Size and Quality of Data

Summary and Variable Types:

```{r}
summary(house)
```

**ADD**: what does this summary tell us?


### Handling GarageYrBuilt

P: This variable is tricky. It is the date of garage construction with an NA meaning that there is no garage attached to the house. To impute an NA for this variable would be meaningless because there would be no garage to have been built on an imputed day- imputation would ignore the real-world meaning of this variable and potentially skew sales price. However, leaving the NA's without imputation could impede upon model creation. Additionally, this variable could hold important information about the sales price, so it may be unwise to simply take the column out without knowing if it is valuable. 

S: We will investigate the similarity between other garage variables and GarageYrBlt to see if the impact of GarageYrBlt on SalePrice can be summarized by other variables. After analyzing the relationship between GarageYrBlt, Sales Price, and other garage variables, we feel comfortable taking out the variable entirely. We will also try and create a categorical variable for GarageYrBlt (1900-1940 old, 1940-1980 mid, 1980-2010 new). 

```{r}
p <- print(select(house, row.names = c("GarageYrBlt", "GarageArea")))
p[p[2]== 0,]
p[is.na(p[1]),] 
```

The above shows that each instance of an NA date corresponds to a 0 GarageArea and vice versa. 

```{r}
hist(house$GarageYrBlt,breaks = 30)
ggplot(aes(x = GarageYrBlt, y = SalePrice), data = house) +
  geom_point()
ggplot(aes(x = GarageYrBlt, y = SalePrice, color = GarageArea), data = house) +
  geom_point()
ggplot(aes(x = GarageYrBlt, y = SalePrice, color = GarageQual), data = house) +
  geom_point()
ggplot(aes(x = GarageYrBlt, y = SalePrice, color = GarageCond), data = house) +
  geom_point()
ggplot(aes(x = GarageYrBlt, y = SalePrice, color = GarageFinish), data = house) +
  geom_point()
ggplot(aes(x = GarageYrBlt, y = SalePrice, color = GarageType), data = house) +
  geom_point()
```

**ADD**: How do we interpret these graphs?

```{r}
house <- house[-60] # Taking out GarageYrBlt
```


### Imputation 

Missing Data:

```{r}
md.pattern(house)

for (i in 1:length(house)) {
  print(sum(is.na(house[i])))
}
```

P: LotFrontage has 259 missing values. What imputation technique should we use to handle it?

S: MICE's PMM (predictive mean matching) can be used because we are imputing numeric data. 

```{r}
house1 = house
hosue1 = house1 %>% filter(LotFrontage < 200) ## This doesn't work

ggplot(data = house) +
  geom_boxplot(aes(x = MSSubClass, y = LotFrontage))

ggplot(data = house) +
  geom_boxplot(aes(x = MSZoning, y = LotFrontage))

ggplot(data = house,aes(x = LotArea, y = LotFrontage)) +
  geom_point() +
  geom_smooth()

ggplot(data = house) +
  geom_boxplot(aes(x = Street, y = LotFrontage))

ggplot(data = house) +
  geom_boxplot(aes(x = Alley, y = LotFrontage))

ggplot(data = house) +
  geom_boxplot(aes(x = LotShape, y = LotFrontage))

ggplot(data = house) +
  geom_boxplot(aes(x = LandContour, y = LotFrontage))

ggplot(data = house) +
  geom_boxplot(aes(x = LotConfig, y = LotFrontage))

ggplot(data = house) +
  geom_boxplot(aes(x = LandSlope, y = LotFrontage))

ggplot(data = house) +
  geom_boxplot(aes(x = Condition1, y = LotFrontage))

ggplot(data = house) +
  geom_boxplot(aes(x = Condition2, y = LotFrontage))

ggplot(data = house) +
  geom_boxplot(aes(x = BldgType, y = LotFrontage))

ggplot(data = house) +
  geom_boxplot(aes(x = HouseStyle, y = LotFrontage))

ggplot(data = house,aes(x = YearBuilt, y = LotFrontage)) +
  geom_point() +
  geom_smooth()

ggplot(data = house) +
  geom_boxplot(aes(x = Foundation, y = LotFrontage))

ggplot(data = house1 ,aes(x = GrLivArea, y = LotFrontage)) +
  geom_point() +
  geom_smooth()

cor(house$LotFrontage, house$LotArea) ## Returns NA
```

**ADD**: description of these graphs, what does it mean that the cov function returns an NA?

```{r}
numhouse <- select_if(house, is.numeric)
imputed_house <- mice(numhouse, 
                       m = 10,  
                       method = 'pmm', 
                       seed = 1)
imputed_house$loggedEvents
```

P: While this command runs, many of the variables are marked as "(nearly) multi-collinear." The logged events command asks "Does your dataset contain duplicates, linear transformation, or factors with unique respondent names?" How do we approach this problem?

S: Since it is just a warning we can leave it for now and return to it later if we have time. 

---
Jeremiah's Note on Multillinearity: 

Multicollinearity doesn't affect the model's predictions, but it affects the coefficients for the predictors.  This means that for the purpose of the Kaggle competition we don't need to worry about it, but for the presentation it should be addressed.  In assessing the presence of multicollinearity, Id was removed because it's irrelevant, Utilities was removed because lm() gave an error that suggested that some categorical predictors had only one factor level, which is simply not true, but removing Utilities fixes it somehow.  The rest were removed because they were aliases, meaning that they are very highly multicollinear (TotalBsmtSF is a linear transformation of two other predictors, for instance).  Once those are all removed, vif() actually gives results.  YrMoSold is obviously multicollinear with the two variables it's based on, and removing it also decreases their scores.  The only other high VIF (above 10, which corresponds to a R^2 of .9 for the variable modeled using all other predictors) is PoolArea, but unfortunately R doesn't conclude which variable(s) is/are causing this.

```{r Jeremiah's Note on Multillinearity}
train.df = subset(house3, select=-c(Id, Utilities, BsmtCond, BsmtFinType1, TotalBsmtSF, GrLivArea, GarageFinish, GarageQual, GarageCond, BldgType, Exterior2nd))
viftest <- lm(SalePrice ~ ., train.df)
#alias(viftest)$Complete
vif(viftest)
```

---

```{r}
set.seed(1)
imphouse <- complete(imputed_house)
housetmp <- house
s = 1
for (i in names(imphouse)[2:length(names(imphouse))]) {
  housetmp[i] <- imphouse[i]
}
house <- housetmp
summary(house)
md.pattern(house)
```
Now we only have one column with NA's left: GarageYrBlt. 


### Outliers

```{r}
housetmp <- select_if(house, is.numeric)
for (i in 1:length(housetmp)) {
  print(names(housetmp[i]))
  col <- housetmp[,i]
  z <- vector("numeric", length = length(col))
  outliers <- vector("numeric")
  for (j in 1:length(col)) {
    z[j] <- (col[j]-mean(col))/sd(col)
    outliers <- append(outliers, if(abs(z[j]) > 7) j) }
  cat("Index of outliers: ", outliers, "\n")
  cat("Value of outliers: ", col[outliers], "\n")
  cat("           Min, 1st, Med, Mean, 3rd, Max\n")
  cat("Quartiles: ", summary(housetmp[,i]), "\n\n")
}
```

**ADD**: What to do with these outliers? Some of these variables are categorical variables so we will probably not take those "outliers" out. It is currently choosing outliers 5 standard deviations away from the mean, is that fine or should it be changed? 

use par(c(2,2)) to create 4 boxplots to display more obvious outliers

Answer by Jeremiah: This could wait until the modeling stage.  When we use R to determine which predictors should be included, we can then individually assess their relationship with SalePrice to ensure that the outlier isn't the sole cause of the correlation.

What commands should we use "to determine which predictors should be included?" Will we use models that employ variable elimination or should we perform elimination ourselves using another command to be determined? 


---

\
\


\
## Questions: 

To answer: 

<details>
  <summary>**How do the variables covary?**</summary>
  
  -Does month sold affect the price sold? 
```{r}
ggplot(data = house, aes(x = YrMoSold, y = SalePrice)) +
  geom_point() +
  geom_smooth(method='lm',se = TRUE)

h_lm <- lm(SalePrice~YrMoSold, data= house)
plot(h_lm)
```
  Slope ~= 0, no relationship

  -Is there a connection between Rennovation date and overall quality?
```{r}
ggplot(data = house, aes(x = YearRemodAdd, y = OverallCond)) +
  geom_jitter() +
  geom_smooth(method='lm',se = TRUE)

h_lm <- lm(SalePrice~YrMoSold, data= house)
plot(h_lm)
```
  slope ~= 0, no correlation relationship.

  -Does zone and house square feet have any correlation?
```{r}
ggplot(house, aes(x= MSZoning, y= (TotalBsmtSF + `X1stFlrSF` + `X2ndFlrSF`))) +
  geom_boxplot()
summary(house$MSZoning)
```
  Similar means even with few observations for C (all) and RH variables

  -Mice noted multicollinearity in many variables, let's look at the variables more. 
  
```{r}
ggplot( house, aes(x = BsmtFullBath, y =BsmtFinSF1 )) +
  geom_point() + 
  labs(title = "Outlier makes an otherwise regular relationship \nfan-like, taking it out should solve this issue.") +
  xlab("Number of Full Bathrooms in the Basement") + 
  ylab("Square Feet of Basement Finish")
```

```{r}
ggplot( house, aes(x = log(BsmtFinSF1),y =  GrLivArea)) +
  geom_point() + 
  labs(title = "Log transformation of basement square footage \nfixes most multicolinear relationship.") +
  xlab("Log of Square Feet of Basement Finish") + 
  ylab("Above Ground Square Feet")
```
  

</details> 
\


<details>
  <summary>**How does each variable explain the data?**</summary>
  
  -How much impact does house size have on sales price?
  
```{r}
p1 <- ggplot(house, aes(x=log(scale(TotalBsmtSF)),y=SalePrice)) + geom_point() + xlab("Log Scale of Basement\n Square Feet") + ylab("Sale Price")
p2 <- ggplot(house, aes(x=log(LotFrontage),y=SalePrice)) +
  geom_point() + ylab("Sale Price") + xlab("Log of Lot Frontage")
p3 <- ggplot(house, aes(x=log(scale(`X1stFlrSF`)),y=SalePrice)) + geom_point() + ylab("Sale Price") + xlab("Log Scale of 1st Floor\n Square Feet")
p4 <- ggplot(house, aes(x=`X2ndFlrSF`,y=SalePrice)) +
  geom_point() + ylab("Sale Price") + xlab("2nd Floor Square feet")

grid.arrange(p1, p2, p3, p4, ncol=2, top ="Transformations took care of multicollinearity\n of flagged variables.") 

ggplot(house, aes(x = (TotalBsmtSF + `1stFlrSF` + `2ndFlrSF`), y = SalePrice)) + 
  geom_point()
```
  
  -How does sale price vary over time? 
  
Answered by Jeremiah
```{r}
ggplot(house,aes(x=YrSold, y=SalePrice)) + 
  geom_jitter()
```

  -What years do we have the most data for? 
  
```{r}
summary(house$YearBuilt)
summary(house$YearRemodAdd)
summary(house$YrSold)
```

Answered by Jeremiah
  
  -How do sale price and number of houses in the data per year relate? 

```{r}
hist(house$YrMoSold,breaks = 30)
plot(house$YrMoSold,house$SalePrice)
```

  **ADD**: description of these graphs
  
</details> 
\


<details>
  <summary>**Extra questions**</summary>
  
  -Once finished with everything else, how similar is the Ames dataset?
  
Answered by Jeremiah
  
  -How does the 2008 housing crash affect prices, does it at all?

Answered by Jeremiah
  
  -Can we make our response categorical?
    
  Our output for Kaggle must be continuous, but is there a way to narrow down this response by using categories? Are there natural delineations in sales price(i.e. low, medium, high)? 
  
</details> 
\


<details>
  <summary>**Modelling preparation**</summary>
  
  -What kinds of models would be best fit for our data?
  
  Using lazypredict in python, we may get a better idea of what the best model to use is.
    Dr. Khormali's lazypredict: RF, lasso larsic, ridge cv, lin reg, gradient boosting regression, bagging
    Find 9-12 variables that are most important for user: 
      
  
</details>\
\