Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linear reg and broom content and exercises #18

Merged
merged 38 commits into from
Sep 1, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
6e4f64f
Create 24-linear-reg-broom.Rmd
quirksahern Aug 8, 2024
069a01a
Update 24-linear-reg-broom.Rmd episode info
quirksahern Aug 8, 2024
2baca5b
Update 24-linear-reg-broom.Rmd added data import
quirksahern Aug 14, 2024
8b7a2d2
Update 24-linear-reg-broom.Rmd
quirksahern Aug 14, 2024
5e56ebb
Update 24-linear-reg-broom.Rmd simple linear regression
quirksahern Aug 14, 2024
10ec4f5
Update 24-linear-reg-broom.Rmd predicted values and residuals
quirksahern Aug 14, 2024
e451c63
Update 24-linear-reg-broom.Rmd robust regression
quirksahern Aug 14, 2024
9c148e5
Update 24-linear-reg-broom.Rmd cat variable
quirksahern Aug 14, 2024
6bedd4d
Update 24-linear-reg-broom.Rmd logit to log transform
quirksahern Aug 14, 2024
2ed6268
Update 24-linear-reg-broom.Rmd added cat with int terms
quirksahern Aug 19, 2024
25d2589
Update 24-linear-reg-broom.Rmd assumptions text
quirksahern Aug 19, 2024
fa3644b
Update 24-linear-reg-broom.Rmd moving diagnostics
quirksahern Aug 19, 2024
26a3ba3
Create 25-linreg-diagnostics content moved to new file
quirksahern Aug 19, 2024
0bb678a
Update 25-linreg-diagnostics placeholder intro text added
quirksahern Aug 19, 2024
d1ad33a
Update 25-linreg-diagnostics examples complete
quirksahern Aug 19, 2024
c59f6c3
Update 24-linear-reg-broom.Rmd updated contents and questions
quirksahern Aug 19, 2024
78c012e
Update 24-linear-reg-broom.Rmd initial broom examples added
quirksahern Aug 19, 2024
16b5d11
Update 24-linear-reg-broom.Rmd broom content finalised
quirksahern Aug 19, 2024
2a3fd10
Update 24-linear-reg-broom.Rmd exercises added
quirksahern Aug 20, 2024
d919cdc
Update 25-linreg-diagnostics exercise added
quirksahern Aug 20, 2024
f48cb22
Update 24-linear-reg-broom.Rmd Log transform added
quirksahern Aug 20, 2024
2c3a896
Merge branch 'main' into linear-reg-broom
milanmlft Aug 23, 2024
177fbfa
Update renv
milanmlft Aug 23, 2024
053ad1b
Add file extension to episode 25
milanmlft Aug 23, 2024
28a4b6f
Add linreg episodes to config
milanmlft Aug 23, 2024
52981ff
Formatting: whitespace, code style, etc.
milanmlft Aug 23, 2024
91b0fb3
Fix questions callout
milanmlft Aug 23, 2024
4f995d5
Add missing div close tags
milanmlft Aug 23, 2024
c781738
Fix `objectives` callout
milanmlft Aug 23, 2024
efa533a
Timings should be just integer numbers
milanmlft Aug 23, 2024
ff3c8f8
More code formatting, avoid long lines
milanmlft Aug 23, 2024
f5fc7f5
Update renv; add ep 25 dependencies
milanmlft Aug 23, 2024
35c895d
Fix data path
milanmlft Aug 23, 2024
66f38cf
Update episodes/24-linear-reg-broom.Rmd
quirksahern Aug 23, 2024
84d7cbf
Update 24-linear-reg-broom.Rmd chisq
quirksahern Aug 23, 2024
d86dabf
Update 25-linreg-diagnostics.Rmd line 147
quirksahern Aug 23, 2024
17d47b3
Merge branch 'main' into linear-reg-broom
quirksahern Aug 29, 2024
17a3395
Merge branch 'main' into linear-reg-broom
milanmlft Aug 30, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 2 additions & 3 deletions config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -87,9 +87,8 @@ episodes:
# basic stats
- 22-quarto.md
- 23-statistics.Rmd
# 23-regression
# 24-logregression
# 25-broom
- 24-linear-reg-broom.Rmd
- 25-linreg-diagnostics.Rmd
# 26-r-sql
- 27-wrap-up.Rmd

Expand Down
5 changes: 2 additions & 3 deletions episodes/23-statistics.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -277,9 +277,8 @@ Is the difference between the income ranks statistically significant?
## Doing a t-test

```{r}

t.test(health_london_rank ~ city, data = lon_dims_imd_2019)$statistic
t.test(health_london_rank ~ city, data = lon_dims_imd_2019)$parameter
t.test(health_london_rank ~ city, data = lon_dims_imd_2019)$statistic
t.test(health_london_rank ~ city, data = lon_dims_imd_2019)$parameter
```

Notice that the summary()** of the test contains more data than is output by default.
Expand Down
340 changes: 340 additions & 0 deletions episodes/24-linear-reg-broom.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,340 @@
---
title: "Linear Regression and Broom"
teaching: 80
exercises: 20
source: Rmd
---

::::::::::::::::::::::::::::::::::::::: objectives

- To be able to explore relationships between variables
- To be able to calculate predicted variables and residuals
- To be able to construct linear regression models
- To be able to present model outcomes using Broom

::::::::::::::::::::::::::::::::::::::::::::::::::

:::::::::::::::::::::::::::::::::::::::: questions

- How can I explore relationships between variables in my data?
- How can I present model outputs in an easier to read way?

::::::::::::::::::::::::::::::::::::::::::::::::::

## Content

- Linear Regression Models
- Use of Log transform
- Use of Categorical Variables
- Use of Broom

## Data

```{r libraries, message=FALSE, warning=FALSE}
# We will need these libraries and this data later.
library(ggplot2)
library(tidyverse)
library(lmtest)
library(sandwich)
library(broom)

lon_dims_imd_2019 <- read.csv("data/English_IMD_2019_Domains_rebased_London_by_CDRC.csv")
```

We are going to use the data from the Consumer Data Research Centre, specifically the London IMD 2019 (English IMD 2019 Domains rebased).

Atribution: Data provided by the Consumer Data Research Centre, an ESRC Data Investment: ES/L011840/1, ES/L011891/1

The statistical unit areas across the country are Lower layer Super Output Areas (LSOAs). We will explore the relationships between the different dimensions of the Indices of Multiple Deprivation.

## Linear Regression

Linear Regression enables use to to explore the the linear relationship of the dependent variable Y and independent variable(s) X(s).
We are going to explore the linear relationship between the Health Deprivation and Disability Domain and the Living Environment Deprivation Domain.

The Health Deprivation and Disability Domain measures the risk of premature death and the impairment of quality of life through poor physical or mental health. The domain measures morbidity, disability and premature mortality but not aspects of behaviour or environment that may be predictive of future health deprivation.

The Living Environment Deprivation Domain measures the quality of the local environment. The indicators fall into two sub-domains. The ‘indoors’ living environment measures the quality of housing; while the ‘outdoors’ living environment contains measures of air quality and road traffic accidents.

Reference:
McLennan, David et al. The English Indices of Deprivation 2019 : Technical Report. Ministry of Housing, Communities and Local Government, 2019. Print.


### Simple Linear Regression
In the simple linear regression example we have only one dependent variable (health_london_rank) and one independent variable (livingEnv_london_rank).

```{r}
reg_LivEnv_health <- lm(health_london_rank ~ livingEnv_london_rank, data = lon_dims_imd_2019)
# We put the dependent variable to the left of the '~' and the independent variable(s) to the right
# and we tell R which dataset we are referring to.

summary(reg_LivEnv_health)
```

From the result of this analysis, we can see that the Living Environment Deprivation Domain rank has a significant(small p-value, general rule of thumb <0.05) and positive relationship(positive coefficient) with the Health Deprivation and Disability Domain rank.

One way of interpreting the result is: One unit increase in the Living Environment rank is related to around 0.343 (3.430e-01) points increase of the Health Deprivation and Disability rank.

R-square shows the amount of variance of Y explained by X. In this case the Living Environment rank explains 6.225% of the variance in the Health Deprivation and Disability rank. Adj R2(6.205%) shows the same as R2 but adjusted by the # of cases and # of variables. When the # of variables is small and the # of cases is very large then Adj R2 is closer to R2.

### Log transform

If your data is skwewed, it can be useful to transform a variable to it's log form when doing the regression.
You can either transform the variable beforehand or do so in the equation.

```{r}
reg_logbarriers_health <- lm(health_london_rank ~ log(barriers_london_rank), data = lon_dims_imd_2019)

summary(reg_logbarriers_health)
```
The interpretation of the log-transformed variable is a bit different.
In this example only the predictor variable is log tranformed, therefore to interpret the slope coefficient we divide it by 100 (2917.0/100=29.170).

If the dependent/response variable is solely log-transformed. Exponentiate the coefficient. This gives the multiplicative factor for every one-unit increase in the independent variable. Example: the coefficient is 0.198. exp(0.198) = 1.218962. For every one-unit increase in the independent variable, our dependent variable increases by a factor of about 1.22, or 22%. Recall that multiplying a number by 1.22 is the same as increasing the number by 22%. Likewise, multiplying a number by, say 0.84, is the same as decreasing the number by 1 – 0.84 = 0.16, or 16%.

If both are transformed. nterpret the coefficient as the percent increase in the dependent variable for every 1% increase in the independent variable. Example: the coefficient is 0.198. For every 1% increase in the independent variable, our dependent variable increases by about 0.20%. For x percent increase, calculate 1.x to the power of the coefficient, subtract 1, and multiply by 100. Example: For every 20% increase in the independent variable, our dependent variable increases by about (1.20 0.198 - 1) * 100 = 3.7 percent.

### Predicted values and Residuals

We can expand our simple linear regression example to incorporate the Barriers to Housing and Services Domain rank.
The Barriers to Housing and Services Domain measures the physical and financial accessibility of housing and local services. The indicators fall into two sub-domains: ‘geographical barriers’, which relate to the physical proximity of local services, and ‘wider barriers’ which includes issues relating to access to housing, such as affordability.

```{r}
reg_LivEnv_barriers_health <- lm(
health_london_rank ~ livingEnv_london_rank + barriers_london_rank,
data = lon_dims_imd_2019
)

summary(reg_LivEnv_barriers_health)
```

After running the regression model, we can access the model predicted values and the residuals compared to the real observations.

```{r}
# first we fit the predictions
health_rank_pred <- fitted(reg_LivEnv_barriers_health)
health_rank_pred <- as.data.frame(health_rank_pred)

# now we add the residual values too
health_rank_resid <- residuals(reg_LivEnv_barriers_health)
health_rank_pred$resid <- health_rank_resid

# We can thenview the predictions and residuals
head(health_rank_pred)
```

```{r, eval=FALSE}
# You can view the full data in RStudio with the View() function
View(health_rank_pred)
```
quirksahern marked this conversation as resolved.
Show resolved Hide resolved

### Robust Regression

We can run the robust standard error regressions(control for heteroskedasticity, meaning unequal variances):

```{r}
reg_LivEnv_barriers_health$robse <- vcovHC(reg_LivEnv_barriers_health, type = "HC1")
coeftest(reg_LivEnv_barriers_health, reg_LivEnv_barriers_health$robse)
```

In addition, we can access the cluster-robust standard errors regression results:

```{r}
# cluster-robust standard errors
coeftest(reg_LivEnv_barriers_health, reg_LivEnv_barriers_health$clse)
```

::::::::::::::::::::::::::::::::::::::: challenge

## Challenge 1

Use the `gapminder` data to create a linear model between two continuous variables.

Discuss your question and your findings.

::::::::::::::::::::::::::::::::::::::::::::::::::

### Regression with Categorical independent variables

We will explore the use of categorical independent variables in linear regression in this episode.
When the dependent variable is a categorical variable, you may consider the alternatives of linear regression like logit regression and multinomial regression.

```{r}
# As a categorical variable we have added la19nm, these are the names of the London boroughs
reg_cat_var <- lm(health_london_rank ~ livingEnv_london_rank + barriers_london_rank + la19nm, data = lon_dims_imd_2019)

summary(reg_cat_var)
```

R automatically recognizes la19nm as a factor and treats it accordingly.
The missing one in the coefficient summary (Barking and Dagenham) is treated as a base line, therefore the value is 0.
However, we can also modify our model to show for all:

```{r}
reg_cat_var_showall <- lm(
health_london_rank ~ 0 + livingEnv_london_rank + barriers_london_rank + la19nm,
data = lon_dims_imd_2019
)

summary(reg_cat_var_showall)
```
### Categorical variables with interaction terms

Sometimes we are interested in how a variable interacts with another variable.
We can explore any interactions between locations (la19nm) and the living environment and barrier ranks.

```{r}
reg_cat_var_int <- lm(health_london_rank ~ la19nm * (livingEnv_london_rank + barriers_london_rank), data = lon_dims_imd_2019)

summary(reg_cat_var_int)
```

::::::::::::::::::::::::::::::::::::::: challenge

## Challenge 2

Using the `gapminder` data to create a linear model between a categorical and a continuous variable .

Discuss your question and your findings.

::::::::::::::::::::::::::::::::::::::::::::::::::

## Broom

The 'broom' package offers an alternative way of presenting the output of statistical analysis.
It centers around three S3 methods, each of which take common objects produced by R statistical functions (lm, t.test, nls, etc) and convert them into a tibble.

These are:
* tidy: constructs a tibble that summarizes the model’s statistical findings. This includes coefficients and p-values for each term in a regression, per-cluster information in clustering applications, or per-test information for multtest functions.
* augment: add columns to the original data that was modeled. This includes predictions, residuals, and cluster assignments.
* glance: construct a concise one-row summary of the model. This typically contains values such as R^2, adjusted R^2, and residual standard error that are computed once for the entire model.


Let's revisit our initial linear model:
```{r}
reg_LivEnv_health <- lm(health_london_rank ~ livingEnv_london_rank, data = lon_dims_imd_2019)

summary(reg_LivEnv_health)
```

There is a lot of useful information, but it not available in a way so that you can combine it with other models or do further analysis.
We can convert this to tabular data using the 'tidy' function.

```{r}
tidy(reg_LivEnv_health)
```
The row names have been moved into a column called term, and the column names are simple and consistent (and can be accessed using $).

Information about the model can be explored with 'augment'. The function augments the original data with information from the model,
such as the fitted values and residuals for each of the original points in the regression.


```{r}
augment(reg_LivEnv_health)
```

Some of the data presented by 'augment' will be discussed in the episode Linear Regression Diagnostics.

Summary statistics are computed for the entire regression, such as R^2 and the F-statistic can be accessed with the 'glance' function:


```{r}
glance(reg_LivEnv_health)
```
### Generalised linear models

We can also use the 'broom' functions to present data from Generalised linear and non-linear models.
For example, if we wanted to explore the Income Rank in relation to whether or not an area was within the City of London.

```{r}
# add a variable to indicate whether or not an area is within the City of London
lon_dims_imd_2019 <- lon_dims_imd_2019 %>% mutate(city = la19nm == "City of London")

# create a Generalised Linear Model
glmlondims <- glm(city ~ Income_london_rank, lon_dims_imd_2019, family = "binomial")
summary(glmlondims)
```

Use of 'tidy':
```{r}
tidy(glmlondims)
```

Use of 'augment':
```{r}
augment(glmlondims)
```

Use of 'glance':
```{r}
glance(glmlondims)
```
You will notice that the statistics computed by 'glance' are different for glm objects than for lm (e.g. deviance rather than R^2).

### Hypothesis Testing

The tidy function can also be applied a range of hypotheses tests, such as built-in functions like t.test, cor.test, and wilcox.test.

### t-test
```{r}
tt <- t.test(Income_london_rank ~ city, lon_dims_imd_2019)
tidy(tt)
```
Some cases might have fewer columns (for example, no confidence interval).

Wilcox test:
```{r}
wt <- wilcox.test(Income_london_rank ~ city, lon_dims_imd_2019)
tidy(wt)
```

Since the 'tidy' output is already only one row, glance returns the same output:
```{r}
# t-test
glance(tt)

# Wilcox test
glance(wt)
```
The chisq.test enables use to investigate whether changes in one categorical variable are related to changes in another categorical variable.

The 'augment' method is defined only for chi-squared tests, since there is no meaningful sense, for other tests,
in which a hypothesis test produces output about each initial data point.

```{r}
# convert IDAOP_london_decile to a factor so it is not interprested as continuous data
lon_dims_imd_2019$IDAOP_london_decile <- factor(lon_dims_imd_2019$IDAOP_london_decile)

# xtabs creates a frequency table of IMD deciles within London borooughs
chit <- chisq.test(xtabs(~ la19nm + IDAOP_london_decile, data = lon_dims_imd_2019))
quirksahern marked this conversation as resolved.
Show resolved Hide resolved

tidy(chit)
```

```{r}
augment(chit)
```
There are a number of underlying assumptions of a Chi-Square test, these are:
* Independence: The Chi-Square test assumes that the observations in the data are independent of each other. This means that the outcome of one observation should not influence the outcome of another.
* Random Sampling: The data should be obtained through random sampling to ensure that it is representative of the population from which it was drawn.
* Expected Frequency: The Chi-Square test assumes that the expected frequency count for each cell in the contingency table should be at least 5. If this assumption is not met, the test results may not be reliable.

As we have received a warning about the reliability of our test, it is likely that one of these assumptions has not been met,
and that this is not a suitable test for this data.


::::::::::::::::::::::::::::::::::::::: challenge

## Challenge 3

Use broom to amend the display of your model outputs.

Which function(s) did you use and why?

::::::::::::::::::::::::::::::::::::::::::::::::::

### Conventions
There are some conventions that enable consistency across the broom functions, these are:
* The output of the tidy, augment and glance functions is always a tibble.
* The output never has rownames. This ensures that you can combine it with other tidy outputs without fear of losing information (since rownames in R cannot contain duplicates).
* Some column names are kept consistent, so that they can be combined across different models and so that you know what to expect (in contrast to asking “is it pval or PValue?” every time).
Loading
Loading