analysis.qmd

---
title: "Data Analysis"
html:
  code-fold: true
  code-summary: "Show the code"
---

## Overview
The goal of this analysis is to understand how dependency ratios in Washington State are evolving temporally and spatially. How are the statewide ratios evolving? Which counties have the highest ratios? Are there regional differences in dependency ratios? If so, are there any patterns to the distribution of high dependency ratios?

```{r setup}
#| message: false
#| echo: false
library(tidyverse)
library(sf)
library(tigris)
library(gt)
library(broom)
library(ggrepel)

page_theme <- theme_minimal() +
  theme(
    plot.caption.position = "plot",
    plot.title.position = "plot",
    plot.subtitle = element_text(size=14),
    plot.title = element_text(size=18)
  )

map_theme <- theme_void() +
theme(
  plot.subtitle = element_text(size=14),
  plot.title = element_text(size=18),
  strip.text = element_text(size = 12),
  legend.position = "bottom",
  plot.caption.position = "plot",
  plot.title.position = "plot"
)
theme_set(page_theme)
```

```{r read in county dataframe}
#| echo: false
df_counties <- read_rds("intermediate/df_countries_proced.rds")
```

```{r store year range in variables}
#| echo: false
years <- levels(df_counties$year)
start_year <- first(years)
last_year <- last(years)

start_year_num <- as.numeric(start_year)
last_year_num <- as.numeric(last_year)
```

## Statewide dependency ratios
As the chart and table below show, the total dependency ratio for the State is increasing. Child and aged dependency ratios are moving in opposite directions. Child dependency is decreasing slowly while the aged dependency ratio is increasing. The aged dependency ratio is driving the trend in total dependency over the `r last_year_num - start_year_num` years from `r start_year` to `r last_year`.

```{r statewide dependency ratios}
#| code-fold: true
#| code-summary: "Show the code"

df_counties |> 
  filter(geography == "Washington State") |> 
  select(-(age_65:age_1)) |> 
  pivot_longer(total_dep_ratio:aged_dep_ratio, names_to = "dep_ratio", values_to = "ratio_val") |> 
  mutate(
    dep_ratio = fct_rev(fct_recode(as_factor(dep_ratio), 
                             "Total" = "total_dep_ratio",
                             "Child" = "child_dep_ratio",
                             "Aged" = "aged_dep_ratio"))
  ) |> 
  ggplot(aes(x=year, y=ratio_val, fill=dep_ratio)) +
  geom_col(position = "dodge", colour="white") +
  labs(
    x = element_blank(),
    y = "Value",
    fill = "Dependency ratio",
    caption="Data source: Washington State County Demography Dashboard",
    title = paste("Total dependency increases from", start_year, "to", last_year),
    subtitle = "Child and aged dependencies move in opposite directions"
  ) +
  theme(
    legend.position = "bottom"
  )
  
```
```{r table of DR values at state level}
#| code-fold: true
#| code-summary: "Show the code"
df_counties |> 
  filter(geography == "Washington State") |> 
  select( "Year"=year, "Child"=child_dep_ratio, "Aged"=aged_dep_ratio, "Total"=total_dep_ratio) |> 
  mutate(
    pct_change_total = round((Total - lag(Total, n=1))/lag(Total, n=1)*100, 2)
  ) |> 
  gt() |> 
  tab_header(
    title = paste("Changes in dependency ratios in Washington State,", start_year, "-", last_year)
  ) |> 
  cols_label(
    pct_change_total = "Percent Change Total"
  ) |> 
  tab_options(
    table.width = pct(100)
  )
  
```

### Understanding the total dependency ratio trend

In order to quantify how total dependency is changing in Washington a simple linear model was employed. The model indicates that in each subsequent year the total dependency ratio increases by about 0.67 and the intercept can be interpreted as the value of the total dependency ratio in 2010.

```{r TDR trends}
#| code-fold: true
#| code-summary: "Show the code"

lm_fit <- df_counties |> 
  filter(geography == "Washington State") |>
  mutate(
    year = as.numeric(year)
  ) |> 
  lm(total_dep_ratio ~ year, data = _)

model_estimates <- broom::tidy(lm_fit)

model_estimates |> 
  gt() |> 
  tab_header(
    title = "Simple linear model of total depency ratio"
  ) |> 
  fmt_number(
    decimals = 2
  )
```

The plot below shows the model fit and confidence interval.

```{r plot total dependency trend}
#| message: false
#| code-fold: true
#| code-summary: "Show the code"
df_counties |> 
  filter(geography == "Washington State") |>
  ggplot(aes(x=as.numeric(year), y=total_dep_ratio)) +
  geom_point() +
  geom_smooth(method="lm", se=TRUE) +
  scale_x_continuous(
    breaks = 1:11,
    minor_breaks = NULL,
    labels = as.character(2011:2021)
  ) +
  labs(
    title = paste("Washington State total dependecy ratio trend",start_year,"-",last_year),
    x = "Year",
    y = "Total dependency ratio",
    caption="Data source: Washington State County Demography Dashboard"
  )
```
The confidence interval is very tight around the trend line.

### Prediciting future total dependency ratios
We can project this model forward a few years to see what total dependency ratios will be in subsequent years. When the 2022-23 demographic data become available, they can be compared with model predictions and the model can be updated.

```{r TDR predictions}
#| code-fold: true
#| code-summary: "Show the code"
#| message: false
# 2022 and 2023 are the 12th and 13th years in the series.
TDR_preds <- predict(lm_fit, newdata = data.frame(year = c(12,13)), type="response")
```
In this case, the model predicts total dependency ratios of `r round(TDR_preds[1],2)` and `r round(TDR_preds[2],2)` for 2022 and 2023, respectively.

## Counties with highest total dependency ratios
The table below shows the counties with the highest total dependency ratios in 2021. The counties in this list all have much higher aged dependency ratios than the statewide value (25.71) while generally having typical child dependency ratios. Jefferson county being the notable exception with a child dependency ratio of 18.58.

```{r identify high ratio counties}
#| code-fold: true
#| code-summary: "Show the code"

df_high_ratio <- df_counties |> 
  filter(geography != "Washington State" & year == 2021) |> 
  arrange(desc(total_dep_ratio)) |> 
  select("County"= geography, "Child"=child_dep_ratio, "Aged"=aged_dep_ratio, "Total"=total_dep_ratio) |>
  head(10)

df_high_ratio |> 
  gt() |> 
  tab_header(
    title = paste("Counties with highest dependency ratios in",last_year)
  ) |> 
  tab_options(
    table.width = pct(100)
  )
```

### Where these counties are located
In order to understand how dependency ratios vary across the state, shape files for the counties are imported and the dependency ratios are overlaid on a map of the state.

```{r create WA geometries and map high ratio counties}
#| warning: false
#| code-fold: true
#| code-summary: "Show the code"
#| fig-width: 8
#| fig-height: 6

wa_county <- counties(state = "WA", cb = TRUE, class = "sf", progress_bar = FALSE)

label_size <- 3.5

wa_county |> 
  left_join(df_high_ratio, by=c("NAME"="County")) |>
  mutate(
    NAME = if_else(is.na(Total), "", NAME)
  ) |>
  ggplot() +
  geom_sf(aes(fill=Total), colour="white", ) +
  geom_label_repel(aes(label = NAME, geometry = geometry),
                  stat = "sf_coordinates", size = label_size) +
  # for the same colour scale for all dependency maps with limits
  scale_fill_continuous(type = "viridis",limits=c(0, 100)) +
  map_theme +
  labs(
    title="Counties with highest total dependency ratios in 2021",
    caption="Data source: Washington State County Demography Dashboard",
    fill="Total dependency ratio"
  )
```
From this map, it is clear that the distribution of the high dependency counties is not random. They are located on the east/west extremes of the state. Furthermore, 5 of the top 6 are located on the west coast, while the other five are located in the eastern quarter of the state.

#### Child dependency comparisons
These counties have roughly similar rates of child dependency, with Jefferson and San Juan Counties being notable exceptions. 

```{r map high ratio counties child dependency rates}
#| warning: false
#| code-fold: true
#| code-summary: "Show the code"
#| fig-width: 8
#| fig-height: 6
wa_county |> 
  left_join(df_high_ratio, by=c("NAME"="County")) |>
  mutate(
    NAME = if_else(is.na(Child), "", NAME)
  ) |>
  ggplot() +
  geom_sf(aes(fill=Child), colour="white", ) +
  geom_label_repel(aes(label = NAME, geometry = geometry),
                  stat = "sf_coordinates", size = label_size) +
  # for the same colour scale for all dependency maps with limits
  scale_fill_continuous(type = "viridis",limits=c(0, 100)) +
  map_theme +
  labs(
    title="Child dependency ratios for counties with highest\ntotal dependency ratios in 2021",
    caption="Data source: Washington State County Demography Dashboard",
    fill="Child dependency ratio"
  )
```
That said, there is a statistically significant difference, with $\alpha=0.05$, between the mean child dependency ratios of the west coast and eastern counties.

```{r comparing west coast and eastern counties child dependency ratios}
#| code-fold: true
#| code-summary: "Show the code"
west_coast_CDR <- c(18.85, 24.40, 24.47, 24.55, 19.69)
eastern_CDR <- c(31.6, 30.61, 27.34, 25.46, 27.13)
state_CDR_mean <- 27.73

t.test(
  west_coast_CDR, # west coast counties
  eastern_CDR  # eastern counties
)
```
If we compare the west coast counties to the statewide mean, there is a statistically significant difference at the $\alpha=0.05$ significance level.

```{r testing west coast against statewide CDR}
#| code-fold: true
#| code-summary: "Show the code"
t.test(
  x=west_coast_CDR,
  mu = state_CDR_mean
)
```
If we compare the eastern counties to the statewide mean, we can see that there is *not* a statistically significant difference at the $\alpha=0.05$ significance level.

```{r testing eastern counties against statewide CDR}
#| code-fold: true
#| code-summary: "Show the code"
t.test(
  x=eastern_CDR,
  mu = state_CDR_mean
)
```
#### Aged dependency comparisons
The map shows aged dependency ratios for these counties. It is clear that the west coast counties have higher ratios than the eastern counties.

```{r map high ratio counties aged dependency rates}
#| warning: false
#| code-fold: true
#| code-summary: "Show the code"
#| fig-width: 8
#| fig-height: 6

wa_county |> 
  left_join(df_high_ratio, by=c("NAME"="County")) |>
  mutate(
    NAME = if_else(is.na(Aged), "", NAME)
  ) |>
  ggplot() +
  geom_sf(aes(fill=Aged), colour="white", ) +
  geom_label_repel(aes(label = NAME, geometry = geometry),
                  stat = "sf_coordinates", size = label_size) +
  # for the same colour scale for all dependency maps with limits
  scale_fill_continuous(type = "viridis",limits=c(0, 100)) +
  map_theme +
  labs(
    title="Aged dependency ratios for counties with highest\ntotal dependency ratios in 2021",
    caption="Data source: Washington State County Demography Dashboard",
    fill="Aged dependency ratio"
  )
```
With $\alpha = 0.05$, there is a statistically significant difference between the means of the two regions' counties.

```{r comparing west coast and eastern counties aged dependency ratios}
#| code-fold: true
#| code-summary: "Show the code"
t.test(
  c(79.39, 66.51, 64.54, 62.54, 67.15), # west coast counties
  c(59.65, 53.93, 51.14, 52.23, 50.42)  # eastern counties
)
```
The aged dependency ratios for both county groups is so much higher than the statewide mean that performing a statistical test is unnecessary.

::: {.callout-tip} 
## Key takeaway
Among the counties with the highest total dependencies there are significant differences between these counties and the statewide value. Furthermore, there are differences between the regional subgroups in these 10 counties.
:::

### Modeling changes in total dependency
Predicting future changes to the total dependency ratio for counties can help county planners understand what budgetary pressures they might expect to face in the coming years. While attempting to forecast far into the future comes with risk, we can use simple linear models to project into the near future for the counties with the highest total dependency ratios.

The trend lines for all 10 counties show a clear upward movement.

```{r build linear models}
#| message: false
#| code-fold: true
#| code-summary: "Show the code"
#| fig-width: 10
#| fig-height: 10

high_ratio_counties <- df_high_ratio |> 
  select(County) |> 
  pull()

df_county_models <- df_counties |> 
  filter(geography %in% high_ratio_counties) |> 
  select(-c(age_65:age_1, child_dep_ratio, aged_dep_ratio)) |>
  mutate(
    year = as.numeric(year)
  ) |> 
  group_by(geography) |> 
  nest() |> 
  mutate(
    model = map(.x=data, ~lm(total_dep_ratio ~ year, data=.x) |> tidy())
  )

df_county_models |> 
  unnest(data) |>
  ggplot(aes(x=year, y=total_dep_ratio)) +
  geom_point() +
  geom_smooth(method = "lm", se=TRUE) +
  facet_wrap(~geography, ncol=2) +
  scale_x_continuous(
    breaks = c(1:11),
    labels = as.character(2011:2021),
    minor_breaks = NULL,
  ) +
  labs(
    x=element_blank(),
    y="Total dependency ratio",
    title="Simple regression models of total dependency ratios",
    caption="Data source: Washington State County Demography Dashboard"
  ) +
  theme(
    axis.text.x = element_text(
      angle = 45
    )
  )
```
The following table shows the model slopes for each of the ten counties. The interpretation of these slopes is that for each year the total dependency ratio will change by the amount of the slope. For Jefferson County, the state with the highest total dependency ratio, we can expect the total dependency ratio to change by 3.31 points each year. This almost 5 times higher than Washington State's model slope of 0.67. Garfield County has the steepest slope, but as discussed below, the integrity of the Garfield County data is in question. The standard error for the Garfield County slope reflects this.

```{r table of model results}
#| code-fold: true
#| code-summary: "Show the code"

df_county_models |> 
  select(-data) |> 
  unnest(model) |>
  select(-statistic) |> 
  pivot_wider(names_from = term, values_from = estimate) |> 
  rename(Intercept = "(Intercept)", "Slope"=year) |> 
  filter(is.na(Intercept)) |> 
  select(-Intercept, County = "geography", Slope, "Std Error" = "std.error", "P Value"="p.value") |> 
  relocate(Slope, .before = "Std Error") |> 
  arrange(desc(Slope)) |> 
  ungroup() |> 
  gt() |> 
  tab_header(
    title = "Rates of change for total dependency ratio by county"
  ) |>
  fmt_number(
    decimals = 2
  ) |> 
  cols_align(
    align = "left",
    columns = County
  )
```

#### Potential problems with Garfield County data
The Garfield County data has an interesting bend in it that occurred in 2016. This behaviour isn't seen in the other counties. The Garfield County data for 2015-2016 shows the following:

```{r Garfield County 2015-2016}
#| code-fold: true
#| code-summary: "Show the code"
df_counties |> 
  filter(geography == "Garfield" & year %in% c(2015,2016)) |>
  select(-(age_65:age_1)) |>
  select("Child"=child_dep_ratio, "Aged"=aged_dep_ratio, "Total"=total_dep_ratio) |> 
  gt() |> 
  tab_header(
    title = paste("Garfield County dependency data, 2015-2016")
  ) |> 
  tab_options(
    table.width = pct(100)
  )
```

There is a noticeable jump in both the child and aged dependency rates from 2015 to 2016. There is not a significant change in the total population of the county over the decade from 2011 to 2021 as shown in the following table:

```{r}
#| code-fold: true
#| code-summary: "Show the code"
df_counties |> 
  filter(geography=="Garfield") |> 
  select("Year" = year, age_65:age_1) |> 
  pivot_longer(cols=age_65:age_1, names_to = "brackets", values_to = "value") |> 
  group_by(Year) |> 
  summarise(`Total Population` = ceiling(sum(value))) |> 
  gt()
```

There is a small decline from 2015 to 2016. In looking for an explanation for this jump I consulted the [Washington Regional Economic Analysis Project](https://washington.reaproject.org/analysis/comparative-trends-analysis/population/tools/530023/530000/) as well as [Garfield County, WA](https://datausa.io/profile/geo/garfield-county-wa/) looking for a possible explanation for this change. 

::: {.callout-warning}
## More information needed
Additional consultation with Washington State Department of Health is required to understand what caused this pattern in the Garfield County data.
:::

<section class="button-container">
  <a href="conclusions.html" role="button" class="btn btn-primary">Continue to Conclusions</a>
</section>