07-multilevel-01.qmd

# Multilevel Modelling - Part 1 {#sec-chp7}

This chapter provides an introduction to multi-level data structures and multi-level modelling and draws on the following references:

-   @gelman2006data provides an excellent and intuitive explanation of multilevel modelling and data analysis in general. Read Part 2A for a really good explanation of multilevel models.
-   @bristol2020 is an useful online resource on multilevel modelling and is free!

## Dependencies

We will use the following dependencies

```{r}
#| include = FALSE
source("./style/data-visualisation_theme.R")
```

```{r}
#| message: false
#| warning: false
# Data manipulation, transformation and visualisation
library(tidyverse)
# Nice tables
library(kableExtra)
# Spatial data manitulation
library(sf) 
# Thematic maps
library(tmap) 
# Colour palettes
library(viridis) 
# Fitting multilevel models
library(lme4)
# Tools for extracting information generated by lme4
library(merTools)
# Exportable regression tables
library(jtools)
```

## Data

For this chapter, we will data for Liverpool from England's 2011 Census. The original source is the [Office of National Statistics](https://www.nomisweb.co.uk/home/census2001.asp) and the dataset comprises a number of selected variables capturing demographic, health and socio-economic attributes of the local resident population at four geographic levels: Output Area (OA), Lower Super Output Area (LSOA), Middle Super Output Area (MSOA) and Local Authority District (LAD). The variables include population counts and percentages. For a description of the variables, see the readme file in the mlm data folder.[^07-multilevel-01-1]

[^07-multilevel-01-1]: Read the file in R by executing `read_tsv("data/mlm/readme.txt")` . Ensure the library `readr` is installed before running `read_tsv`.

Let us read the data:

```{r}
# read data
oa_shp <- st_read("data/mlm/OA.shp")
```

We can now attach and visualise the structure of the data.

```{r}
#| message: false
# attach data frame
attach(oa_shp)

# sort data by oa
oa_shp <- oa_shp[order(oa_cd),]
head(oa_shp)
```

![Fig. 1. Data Structure.](figs/ch5/datastr.png)

The data are hierarchically structured: OAs nested within LSOAs; LSOAs nested within MSOAs; and, MSOAs nested within LADs. Observations nested within higher geographical units may be correlated.

This is one type of hierarchical structure. There is a range of data structures:

-   Strict nested data structures eg. an individual unit is nested within only one higher unit

-   Repeated measures structures eg. various measurements for an individual unit

-   Crossed classified structures eg. individuals may work and live in different neighbourhoods

-   Multiple membership structure eg. individuals may have two different work places

*Why should we care about the structure of the data?*

-   *Draw correct statistical inference*: Failing to recognise hierarchical structures will lead to underestimated standard errors of regression coefficients and an overstatement of statistical significance. Standard errors for the coefficients of higher-level predictor variables will be the most affected by ignoring grouping.

-   *Link context to individual units*: We can link and understand the extent of group effects on individual outcomes eg. how belonging to a certain socio-economic group influences on future career opportunities.

-   *Spatial dependency*: Recognising the hierarchical structure of data may help mitigate the effects of severe spatial autocorrelation.

Quickly, let us get a better idea about the data and look at the number of OAs nested within LSOAs and MSOAs

```{r}
# mean of nested OAs within LSOAs and MSOAs
lsoa_cd %>% table() %>%
  mean() %>%
  round(, 2)

msoa_cd %>% table() %>%
  mean() %>%
  round(, 2)

# number of OAs nested within LSOAs and MSOAs
lsoa_cd %>% table() %>%
  sort() %>%
  plot()

msoa_cd %>% table() %>%
  sort() %>%
  plot()
```

## Modelling

We should now be persuaded that ignoring the hierarchical structure of data may be a major issue. Let us now use a simple example to understand the intuition of multilevel model using the census data. We will seek to understand the spatial distribution of the proportion of population in unemployment in Liverpool, particularly why and where concentrations in this proportion occur. To illustrate the advantages of taking a multilevel modelling approach, we will start by estimating a linear regression model and progressively building complexity. We will first estimate a model and then explain the intuition underpinning the process. We will seek to gain a general understanding of multilevel modelling. If you are interested in the statistical and mathemathical formalisation of the underpinning concepts, please refer to @gelman2006data.

We first need to want to understand our dependent variable: its density ditribution;

```{r}
ggplot(data = oa_shp) +
  geom_density(alpha=0.8, colour="black", fill="lightblue", aes(x = unemp)) +
   theme_plot_tufte()
```

```{r}
summary(unemp)
```

and, its spatial distribution:

```{r}
# ensure geometry is valid
oa_shp = sf::st_make_valid(oa_shp)

# create a map
legend_title = expression("% unemployment")
map_oa = tm_shape(oa_shp) +
  tm_fill(col = "unemp", title = legend_title, palette = magma(256, begin = 0.25, end = 1), style = "cont") + 
  tm_borders(col = "white", lwd = .01)  + 
  tm_compass(type = "arrow", position = c("right", "top") , size = 4) + 
  tm_scale_bar(breaks = c(0,1,2), text.size = 0.5, position =  c("center", "bottom")) 
map_oa
```

Let us look at those areas:

```{r}
# high %s
oa_shp %>% filter(unemp > 0.2) %>% 
  dplyr::select(oa_cd, pop, unemp) 
```

### Baseline Linear Regression Model

Now let us estimate a simple linear regression model with the intercept only:

```{r}
# specify a model equation
eq1 <- unemp ~ 1
model1 <- lm(formula = eq1, data = oa_shp)

# estimates
summary(model1)
```

To understand the differences between the linear regression model and multilevel models, let us consider the model we have estimated:

$$y_{i} = \beta_{0} + e_{i}$$ where $y_{i}$ represents the proportion of the unemployed resident population in the OA $i$; $\beta_{0}$ is the regression intercept and measures the average proportion of the unemployed resident population across OAs; and, $e_{i}$ is the error term. But how do we deal with the hierarchical structure of the data?

#### Limitations

Before looking at the answer, let's first understand some of the key limitations of the linear regression model to handle the hierarchical structure of data. A key limitation of the linear regression model is that it only captures average relationships in the data. It does not capture variations in the relationship between variables across areas or groups. Another key limitation is that the linear regression model can capture associations at either macro or micro levels, but it does not simultaneously measure their interdependencies.

To illustrate this, let us consider the regression intercept. It indicates that the average percentage of unemployed population at the OA level is 0.12 but this model ignores any spatial clustering ie. the percentage of unemployed population tends to be similar across OAs nested within a same LSOA or MSOA. A side effect of ignoring this is that our standard errors are biased, and thus claims about statistical significance based on them would be misleading. Additionally, this situation also means we cannot explore variations in the percentage of unemployed population across LSOAs or MSOAs ie. how the percentage of unemployed population may be dependent on various contextual factors at these geographical scales.

#### Fixed Effect Approach

An alternative approach is to adopt a fixed effects approach, or no-pooling model; that is, adding dummy variables indicating the group classification into the regression model eg. the way OAs is nested within LSOAs (or MSOAs). This approach has limitations. First, there is high risk of overfitting. The number of groups may be too large, relative to the number of observations. Second, the estimation of multiple parameters may be required so that measuring differences between groups may be challenging. Third, a fixed effects approach does not allow including group-level explanatory variables. You can try fitting a linear regression model extending our estimated model to include dummy variables for individual LSOAs (and/or MSOAs) so you can compare this to the multilevel model below.

An alternative is fitting separate linear regression models for each group. This approach is not always possible if there are groups with small sizes.

## Multilevel Modelling: Random Intercept Model

We use multilevel modelling to account for the hierarchical nature of the data by explicitly recognising that OAs are nested within LSOAs and MSOAs. Multilevel models can easily be estimated using in R using the package `lme4`. We implement an two-level model to allow for variation across LSOAs. We estimate an only intercept model allowing for variation across LSOAs. In essence, we are estimating a model with varying intercept coefficient by LSOA. As you can see in the code chunk below, the equation has an additional component. This is the group component or LSOA effect. The `(1 | lsoa_cd)` means that we are allowing the intercept, represented by 1, to vary by LSOA.

```{r}
# specify a model equation
eq2 <- unemp ~ 1 + (1 | lsoa_cd)
model2 <- lmer(eq2, data = oa_shp)

# estimates
summary(model2)
```

We can estimate a three-level model by replacing `(1 | lsoa_cd)` for `(1 | msoa_cd/lsoa_cd)` to allow the intercept to also vary by MSOAs and account for the nesting structure of LSOAs within MSOAs. In multilevel modelling, these types of models are formally known as *nested random effects* and they differ from a different set of models known as *crossed random effects*.

::: column-margin 
::: callout-note 
A crossed random effect model in our example would be expressed as follows:

`unemp ~ 1 + (1 | lsoa_cd) + (1 | msoa_cd)`
::: 
:::

```{r}
# specify a model equation
eq3 <- unemp ~ 1 + (1 | msoa_cd/lsoa_cd)
model3 <- lmer(eq3, data = oa_shp)

# estimates
summary(model3)
```

We see two sets of coefficients: *fixed effects* and *random effects*. *Fixed effects* correspond to the standard linear regression coefficients. Their interpretation is as usual. *Random effects* are the novelty. It is a term in multilevel modelling and refers to varying coefficients i.e. the randomness in the probability of the model for the group-level coefficients. Specifically they relate to estimates of the average variance and standard deviation within groups (i.e. LSOAs or MSOAs). Intiutively, variance and standard deviation indicate the extent to which the intercept, on average, varies by LSOAs and MSOAs.

![Fig. 2. Variation of observations around their level 1 group mean.](figs/ch5/nm_dist_obs.png)

![Fig. 3. Variation of level 1 group mean around their level 2 group mean.](figs/ch5/nm_dist_msoa.png)

![Fig. 4. Grand mean.](figs/ch5/nm_dist_grand.png)

More formally, we first estimated the simplest regression model which is an intercept-only model and equivalent to the sample mean (i.e. the *fixed* part of the model):

$$y_{ijk} = \mu + e_{ijk}$$ and then we made the *random* part of the model ($e_{ijk}$) more complex to account for the hierarchical structure of the data by estimating the following three-level regression model:

$$y_{ijk} = \mu + u_{i..} + u_{ij.} + e_{ijk}$$

where $y_{ijk}$ represents the proportion of unemployed population in OA $i$ nested within LSOA $j$ and MSOA $k$; $\mu$ represents the sample mean and the *fixed* part of the model; $e_{ijk}$ is the deviation of an observation from its LSOA mean; $u_{ij.}$ is the deviation of the LSOA mean from its MSOA mean; $u_{i..}$ is the deviation of the MSOA mean from the fixed part of the model $\mu$. Conceptually, this model is decomposing the variance of the model in terms of the hierarchical structure of the data. It is partitioning the observation's residual into three parts or *variance components*. These components measure the relative extent of variation of each hierarchical level ie. LSOA, MSOA and grand means. To estimate the set of residuals, they are assumed to follow a normal distribution and are obtained after fitting the model and are based on the estimates of the model parameters (i.e. intercept and variances of the random parameters).

Let's now return to our three-level model (reported again below), we see that the intercept or fixed part of the model is the same as for the linear regression. The multilevel model reports greater standard errors. Multilevel models capture the hierarchical structure of the data and thus more precisely estimate the standard errors for our parameters.

```{r}
# report model 3
summary(model3)
```

### Interpretation

> Fixed effects

We start by examining the fixed effects or estimated model averaging over LSOAs and MSOAs, $y_{ijk} = 0.115288$ which can also be called by executing:

```{r}
fixef(model3)
```

Th estimated intercept indicates that the overall mean taken across LSOAs and MSOAs is estimated as `0.1152881`.

> Random effects

The set of random effects contains three estimates of variance and standard deviation and refer to the variance components discussed above. The `lsoa_cd:msoa_cd`, `msoa_cd` and `Residual` estimates indicate that the extent of estimated LSOA-, MSOA- and individual-level variance is `0.0007603`, `0.0020735` and `0.0025723`, respectively.

### Variance Partition Coefficient (VPC)

The purpose of multilevel models is to partition variance in the outcome between the different groupings in the data. We thus often want to know the percentage of variation in the dependent variable accounted by differences across groups i.e. what proportion of the total variance is attributable to variation within-groups, or how much is found between-groups. The statistic to obtain this is termed the variance partition coefficient (VPC), or intraclass correlation.[^07-multilevel-01-2] For our case, the VPC at the MSOA level indicates that 38% of the variation in percentage of unemployed resident population across OAs can be explained by differences across MSOAs.

[^07-multilevel-01-2]: The VPC is equal to the intra-class correlation coefficient which is the correlation between the observations of the dependent variable selected randomly from the same group. For instance, if the VPC is 0.1, we would say that 10% of the variation is between groups and 90% within. The correlation between randomly chosen pairs of observations belonging to the same group is 0.1.

::: column-margin
::: {.callout-tip appearance="simple" icon="false"}
**Task** What is the VPC at the LSOA level?
:::
:::

```{r}
vpc_msoa <- 0.0020735 / (0.0007603 + 0.0020735 + 0.0025723)
vpc_msoa * 100
```

You can also obtain the VPC by executing:

```{r}
summ(model3)
```

### Uncertainty of Estimates

You may have noticed that `lme4` does not provide p-values, because of [various reasons](https://stat.ethz.ch/pipermail/r-help/2006-May/094765.html) as explained by Doug Bates, one of the author of `lme4`. These explanations mainly refer to the complexity of dealing with varying sample sizes at a given hierarchical level. The number of observations at each hierarchical level varies across individual groupings (i.e. LSOA or MSOA). It may even be one single observation. This has implications for the distributional assumptions, denominator degrees of freedom and how to approximate a "best" solution. Various approaches exist to compute the statistical significance of estimates. We use the `confint` function available within `lme4` to obtain confidence intervals.

```{r}
confint(model3, level = 0.95)
```

`.sig01` refers to the LSOA level; `.sig02` refers to the MSOA level; and, `.sigma` refers to the OA level.

### Assessing Group-level Variation

*Estimated regression coefficients*

In multilevel modelling, our primary interest is in knowing differences across groups. To visualise the estimated model within each group (ie. LSOA and MSOA), we type:

```{r}
coef_m3 <- coef(model3)
head(coef_m3$lsoa_cd,5)
```

The results indicate that the estimated regression line is $y = 0.09915456$ for LSOA `E01006512` within MSOA `E02001377`; $y = 0.09889615$ for LSOA `E01006513` within MSOA `E02006932` and so forth.

::: column-margin
::: {.callout-tip appearance="simple" icon="false"}
**Task** Try getting the estimated model within each MSOA.
:::
:::

*Random effects*

We can look at the estimated group-level (or LSOA-level and MSOA-level) errors; that is, *random effects*:

```{r}
ranef_m3 <- ranef(model3)
head(ranef_m3$lsoa_cd, 5)
```

Group-level errors indicate how much the intercept is shifted up or down in particular groups (ie. LSOAs or MSOAs). Thus, for example, in LSOA `E01006512`, the estimated intercept is `-0.01613353` lower than average, so that the regression line is `(0.1152881 - 0.01613353)` `= 0.09915457` which is what we observed from the call to `coef()`.

We can also obtain group-level errors (*random effects*) by using a simulation approach, labelled "Empirical Bayes" and discussed [here](https://stat.ethz.ch/pipermail/r-sig-mixed-models/2009q4/002984.html). To this end, we run:

```{r}
# obtain estimates
merTools::REsim(model3) %>% 
  head(10)
```

The results contain the estimated mean, median and standard deviation for the intercept within each group (e.g. LSOA). The mean estimates are similar to those obtained from `ranef` with some small differences due to rounding.

To gain an undertanding of the general pattern of the *random effects*, we can use caterpillar plots via `plotREsim` - reported below. The plot on the right shows the estimated random effects for each MSOA and their respective interval estimate. Note that random effects are on average zero, represented by the red horizontal line. Intervals that do not include zero are in bold. Also note that the width of the confidence interval depends on the standard error of the respective residual estimate, which is inversely related to the size of the sample. The residuals represent an observation departures from the grand mean, so an observation whose confidence interval does not overlap the line at zero (representing the mean proportion of unemployed population across all areas) is said to differ significantly from the average at the 5% level.

```{r}
# plot
plotREsim(REsim(model3)) 
```

Focusing on the plot on the right, we see MSOAs whose mean proportion of unemployed population, assuming no explanatory variables, is lower than average. These are the dots below the horizontal red line. On the right-hand side of the plot, you will see MSOAs whose mean proportion is higher than average. The MSOAs with the smallest residuals include the districts of Allerton and Hunt Cross, Church, Childwall, Wavertree and Woolton.

::: column-margin
::: {.callout-tip appearance="simple" icon="false"}
**Task** What districts do we have at the other extreme? Have a go at identifying them.
:::
:::

```{r}
re <- REsim(model3)
oa_shp %>% dplyr::select(msoa_cd, ward_nm, unemp) %>%
    filter(as.character(msoa_cd) == "E02001387" | as.character(msoa_cd) == "E02001393")
```

We can also map the MSOA-level *random effects*. To this end, we first need to read a shapefile containing data at the MSOA level and merge it with the *random effects* estimates.

```{r}
# read data
msoa_shp <- st_read("data/mlm/MSOA.shp")

# create a dataframe for MSOA-level random effects
re_msoa <- re %>% filter(groupFctr == "msoa_cd")
str(re_msoa)

# merge data
msoa_shp <- merge(x = msoa_shp, y = re_msoa, by.x = "MSOA_CD", by.y = "groupID")
```

Now we can create our map:

```{r}
# ensure geometry is valid
msoa_shp = sf::st_make_valid(msoa_shp)

# create a map
legend_title = expression("MSOA-level residuals")
map_msoa = tm_shape(msoa_shp) +
  tm_fill(col = "mean", title = legend_title, palette = magma(256, begin = 0, end = 1), style = "cont") + 
  tm_borders(col = "white", lwd = .01)  + 
  tm_compass(type = "arrow", position = c("right", "top") , size = 4) + 
  tm_scale_bar(breaks = c(0,1,2), text.size = 0.5, position =  c("center", "bottom")) 
map_msoa
```

### Adding Individual-level Predictors {#sec-indlevel}

In this example, $\mu$ represents the sample mean but it could include a collection of independent variables or predictors. To explain the logic, we will assume that unemployment is strongly associated to long-term illness. We could expect that long-term illness (`lt_ill`) will reduce the chances of working and therefore being unemployed. Note that our focus is on the relationship, not on establishing causation. Specifically we want to estimate the relationship between unemployment and long-term illness and we are interested in variations in OA-level unemployment by MSOAs so we will estimate the following two-level model:

OA-level:

$$y_{ij} = \beta_{0j} + \beta_{1}x_{ij} + e_{ij}$$ MSOA-level:

$$\beta_{0j} = \beta_{0} + u_{0j}$$ Replacing the first equation into the second, we have:

$$y_{ij} = (\beta_{0} + u_{0j}) + \beta_{1}x_{ij} + e_{ij}$$ where $y$ the proportion of unemployed population in OA $i$ within MSOA $j$; $\beta_{0}$ is the fixed intercept (averaging over all MSOAs); $u_{0j}$ represents the MSOA-level residuals or *random effects*; $\beta_{0}$ and $u_{0j}$ together represent the varying-intercept; $\beta_{1}$ is the slope coefficient; $x_{ij}$ represents the percentage of long-term illness population; and, $e_{ij}$ is the individual-level residuals.

We estimate the model executing:

```{r}
# change to proportion
oa_shp$lt_ill <- lt_ill/100

# specify a model equation
eq4 <- unemp ~ lt_ill + (1 | msoa_cd)
model4 <- lmer(eq4, data = oa_shp)

# estimates
summary(model4)
```

*Fixed effects*: model averaging over MSOAs

```{r}
fixef(model4)
```

yields an estimated regression line in an average McSOA: $y = 0.04681959 + 0.29588110x$

*Random effects*: MSOA-level errors

```{r}
ranef_m4 <- ranef(model4)
head(ranef_m4$msoa_cd, 5)
```

yields an estimated intercept for MSOA `E02001347` which is `0.017474815` lower than the average with a regression line: `(0.04681959 - 0.017474815) + 0.29588110x` `=` `0.02934478 + 0.29588110x`. You can confirm this by looking at the estimated model within each MSOA by executing on the first row:

```{r}
coef(model4) %>% head(n = 5 )
```

*Fixed effect correlations*

In the bottom of the output, we have the correlations between the fixed-effects estimates. In our example, it refers to the correlation between $\beta_{0}$ and $\beta_{1}$. It is negative indicating that in MSOAs where the relationship between unemployment and long-term illness is greater, as measured by $\beta_{1}$, the average proportion of unemployed people tends to be smaller, as captured by $\beta_{0}$.

### Adding Group-level Predictors {#sec-grouplevel}

We can also add group-level predictors. We use the formulation:

OA-level:

$$y_{ij} = \beta_{0j} + \beta_{1}x_{ij} + e_{ij}$$

MSOA-level:

$$\beta_{0j} = \beta_{0} + \gamma_{1}m_{j} + u_{0j}$$

where $x_{ij}$ is the OA-level proportion of population suffering long-term illness and $m_{j}$ is the MSOA-level proportion of male population. We first need to create this group-level predictor:

```{r}
#| message: false
#| warning: false
# detach OA shp and attach MSOA shp
detach(oa_shp)
attach(msoa_shp)

# group-level predictor
msoa_shp$pr_male <- males/pop

# remove geometries
msoa_df <- `st_geometry<-`(msoa_shp, NULL)

# select variables
msoa_df <- msoa_df %>% dplyr::select(MSOA_CD, pop, pr_male)

# merge data sets
oa_shp <- merge(x=oa_shp, y=msoa_df, by.x = "msoa_cd", by.y="MSOA_CD")

# inspect data
head(oa_shp[1:10, c("msoa_cd", "oa_cd", "unemp", "pr_male")])
```

We can now estimate our model:

```{r}
#| message: false
detach(msoa_shp)
attach(oa_shp)

# specify a model equation
eq5 <- unemp ~ lt_ill + pr_male + (1 | msoa_cd)
model5 <- lmer(eq5, data = oa_shp)

# estimates
summary(model5)
```

This model includes the proportion of males and intercepts that vary by MSOA. The `lmer()` function only accepts predictors at the individual level, so we have included data on the proportion of male population at this level. Explore and interpret the model running the functions below:

```{r}
# fixed effects
fixef(model5)
```

```{r}
# random effects
ranef_m5 <- ranef(model5)
head(ranef_m5$msoa_cd, 5)
```

Adding group-level predictors tends to improve inferences for group coefficients. Examine the confidence intervals, in order to evalute how the precision of our estimates of the MSOA intercepts have changed. *Have confidence intervals for the intercepts of Model 4 and 5 increased or reduced?* Hint: look at how to get the confidence intervals above.

## Questions

For the second assignment, we will be using a different dataset comprising information on COVID-19 cases, census data and the Index of Multiple Deprivation (IMD) for England. The data set is similar in structured to that used in this chapter. It is hierarchically organised into 149 Upper Tier Local Authority Districts (UTLADs) within 9 Regions and has 508 variables - see Chapter \@ref(datasets) for a more detailed description of the data.

```{r}
sdf <- st_read("data/assignment_2_covid/covid19_eng.gpkg")
```

Here we see a selection of 10 variables for 5 UTLADs.

```{r}
head(sdf[1:5,c(3,4,9,10,381,385,386,387,403,406)])
```

-   `ctyua19nm`: Upper Tier Local Authority District name
-   `Region`: Region name
-   `X2020.01.31`: COVID-19 cases on January 31st 2020
-   `X2020.02.01`: COVID-19 cases on February 1st 2020
-   `IMD...Average.score`: Average IMD score for UTLADs - see [File 11: upper-tier local authority summaries](https://www.gov.uk/government/statistics/english-indices-of-deprivation-2019) for information on this and associated indicators.
-   `Residents`: Number of residents
-   `Households`: Number of households
-   `Dwellings`: Number of dwellings
-   `Age_85plus`: Number of people aged 85 and over
-   `White_British_and_Irish`: Number of white British and Irish people

Note that variable names relating to the daily COVID-19 cases are organised in the following way: `X` stands for daily COVID-19 cases, followed by the year (i.e. 2020, 2021); month (i.e. January to December); and day (i.e. 01 to 31).

Using these data, you are required to address the following challenges:

1.  Fit a varying-intercept model with no explanatory variables. Let the intercept to vary by region.

2.  Fit a varying-intercept model with including at least three explanatory variables.

3.  Compute the Variance Partition Coefficient (VPC) for the models estimated according to points 1 and 2 above.

4.  Create caterpillar plots to visualise the varying intercepts.

Analyse and discuss: 1. the extent of variation in the dependent variables at the two geographical scales (variation at which geographical scale explains most of variance in your dependent variable); 2. the varying intercept estimate(s) from your model(s) (what can they tell you about the difference between groups / areas? are they statistically significantly different?);

Ensure you appropriately describe the structure of the data and identify the various geographical scales of analysis (i.e. level-1 and level-2 units)

In addressing the challenges in this and following chapters, you have some flexibility to be creative. A set of key factors to consider:

1.  *Dependent Variable*: We will seek to explain daily COVID-19 cases, and you will need to make a decision as to:

-   *Daily vs cumulative COVID-19 cases*. Given that we will be focusing on cross-sectional models (i.e. models for one snapshot), you can focus on modelling daily cases at one specific date or cumulative daily cases over a period of time.

-   *Time period*. You can select the date or period of time that you will focus your analysis on.

-   *Use risk of COVID-19 infection*. The dependent variable should be the risk or rate of COVID-19 infection.

For example, the risk of COVID-19 infection for the period (i.e. between Dec. 1st, 2020 - January 29th, 2021) comprising the third wave of the pandemic in the United Kingdom can be computed as:

```{r}
# computing cumulative COVID cases for  01/12/2020 - 29/01/2021
sdf[, 509] <- sdf %>% dplyr::select("X2020.12.01":"X2021.01.29") %>% # select COVID cases 01/12/2020 - 29/01/2021
  mutate(cum_covid = rowSums(across(where(is.numeric)))) %>% # sum daily cases
  dplyr::select(cum_covid) %>% # select cumulative cases
   st_set_geometry(., NULL) # set geometry to NULL

# computing risk of infection
sdf <- sdf %>%  mutate(
  covid19_r = round((cum_covid / Residents ) * 1000) 
  )
```

2.  *Explanatory variables*:

-   *At least 3*. Use at least 3 explanatory variables. There is no maximum limit but consider your model to be parsimonious.

-   *Choice your set*. Select the set of variables you consider appropriate / interesting. Make sure you justify your choice based on evidence and/or theory.

-   *Percentages / Proportions*. Use percentages or proportions to capture the composition of places, rather than numbers of people, households or dwellings. For this, ensure you are using the appropriate denominator.

For instance, if you want to capture the relationship between cumulative COVID-19 cases and overcrowding, share of elderly population and nonwhite minorities, use the following variables

```{r}
sdf <- sdf %>% mutate(
  crowded_hou = Crowded_housing / Households, # share of crowded housing
  elderly = (Age_85plus) / Residents, # share of population aged 65+
  ethnic = (Mixed + Indian + Pakistani + Bangladeshi + Chinese + Other_Asian + Black + Other_ethnicity) / Residents, # share of nonwhite population
)
```

**ADVICE**: Create a new spatial data frame including only the variables you will analyse. For example:

```{r}
nsdf <- sdf  %>%  dplyr::select(objectid, 
                         ctyua19cd, 
                         ctyua19nm, 
                         Region, 
                         covid19_r, 
                         crowded_hou, 
                         elderly, 
                         ethnic, 
                         Residents)
```