methodology.Rmd

---
title: ""
---

text here.
text here.

<br>
<h1>The Variables</h1>
The variables used in this analysis were (unless listed otherwise, the source of data is American Community Survey):
<br>
<br>
<ul>
<li>Median household income (imputed where missing)</li>
<li>% Race (disaggregated by groups)</li>
<li>% Aged 65+</li>
<li>% Aged 18 & under</li>
<li>Median home value (imputed where missing)</li>
<li>Median gross rent (imputed where missing)</li>
<li>% Renters</li>
<li>Total housing units (Met Council estimate)</li>
<li>% limited English proficiency (disaggregated by largest groups in region)</li>
<li>% concentration of poverty (185% below Federal Poverty Line)</li>
<li>Average household size</li>
<li>Population (Met Council estimate)</li>
<li>% households residing in mobile homes</li>
</ul>
<br>
<h1>The Timeframe</h1>
The timepoints examined were 2000, 2010, and 2017. While American Community Survey (ACS) data is available for years between 2010 and 2017 and some years prior to 2010, we chose to examine these three time points specifically to remove small variations that occurred over time and instead emphasize the larger trends that were occurring in these decades.
<br>
<br>
<h1>kml3d: The Cluster Algorithm</h1>
<br>
This project used a form of unsupervised machine learning called clustering, in which observations are classified into groups based on their distances from one another. The form of clustering used in our analysis, known as kml3d, is shape-respecting; that is, if observations are changing in much the same ways, but in different magnitudes, the algorithm will favor those observations with similar magnitudes (i.e. shapes). This means that slightly staggered changes of the same shape (for example, a tract rapidly increasing in total housing units and decreasing in their % aged 65+, but starting in different years) will be classified together.
<br>
<br>
This is in contrast to other time-series clustering methods (such as PROC FASTCLUS in SAS) that will group changes together, regardless of magnitude.  For example, in PROC FASTCLUS, slow growth in income starting from a median income of \$20,000 would be, theoretically, treated the same as slow growth (i.e. the same slope of change) starting from a median income of \$120,000.  Obviously, for our study, this was not desirable behavior, as starting and ending magnitudes of demographic, built environment, and housing market variables are crucial to planning decision.
<br>
<br>
<h1>How kml3d works</h1>
```{r}

# snc <- read_csv("Data/3-Timepoint RF-Imputed Parcel EMV Dataset.csv")
# 
# snc_top <- snc %>%
#   dplyr::select(Tract, MEDGRENT_2000, MEDGRENT_2010, MEDGRENT_2017) %>%
#   filter(Tract == 27003050109 | Tract == 27003050110) %>%
#   mutate(Tract = as.character(Tract))
# 
# pander(snc_top)

```
we can use the Euclidean distance between rent at each timepoint to determine how "close" the tracts are:

$$\sqrt{\sum_{i=1}^n|(x_{tract09}-y_{Tract10})|^2}$$

or $$\sqrt{|(1309-921.3)|^2 + |(1348-1208)|^2 + |(1563-1292)|^2}$$

which will give us the same result as using the function `dist()`: r dist(snc_top). (PUT IN BACKTICKS LATER) <br>
$$y'_{ijX} = \frac{y_{ijX}-\bar{y.._{X}}}{s..{X}}$$

where
<style type="text/css">
box {
font-family: Times New Roman;
font-size: 14px;
}
</style>
$y_{ijX}$: value of a variable for a given tract at a given timepoint <br>
$\bar{y.._{X}}$: mean of matrix <em><box>y..<sub>X</sub></box></em> (all values of the variable at all timepoints) <br>
$s.._{X}$: standard deviation of <em><box>y..<sub>X</sub></box></em> (all values of the variable at all timepoints)
<br>
<br>
<h1>Opening the Black Box: Reverse-engineering the Clusters with a k-fold Random Forest Model</h1>
After clustering, it was important to our users to know how we obtained the clusters we did - what were the driving variables, and driving changes? In order to answer this question, we "reverse-engineered" our clusters. Similar to physical reverse engineering, in which one starts with the end-product, and duplicates it without the aid of any drawings or instructions, we started with our clusters and then used all of our variables as predictors of those clusters. There are a variety of random forests, including out-of-bag, gradient boosting machine, or black-boosted. In a k-fold random forest, the original dataframe is partitioned into k number of folds; the model is then "trained" on k-1 folds, and tested on 1 fold. Testing is rotated until every fold has been treated as the test fold in one iteration. In our case, we used 9 folds total, with 8 training folds and 1 test fold in every iteration. We chose a k-fold random forest not only because of their high accuracy rates in prediction, but also because it rotates every fold through a test iteration, allowing us to easily extract the algorithm's probabilistic predictions for each observation (tract).

<script src="js/javascript.js"></script>

<button class="accordion">Section 1</button>
<div class="panel">
Lorem ipsum...
</div>

<button class="accordion">Section 2</button>
<div class="panel">
Lorem ipsum...
</div>

<button class="accordion">Section 3</button>
<div class="panel">
Lorem ipsum...
</div>