From 6e4f64face245344684cb1305198ee0b5e9da327 Mon Sep 17 00:00:00 2001 From: quirksahern Date: Thu, 8 Aug 2024 15:36:23 +0100 Subject: [PATCH 01/35] Create 24-linear-reg-broom.Rmd create file for new episode --- episodes/24-linear-reg-broom.Rmd | 1 + 1 file changed, 1 insertion(+) create mode 100644 episodes/24-linear-reg-broom.Rmd diff --git a/episodes/24-linear-reg-broom.Rmd b/episodes/24-linear-reg-broom.Rmd new file mode 100644 index 000000000..8b1378917 --- /dev/null +++ b/episodes/24-linear-reg-broom.Rmd @@ -0,0 +1 @@ + From 069a01a5c58b6041ae4d3eb652381d595d412039 Mon Sep 17 00:00:00 2001 From: quirksahern Date: Thu, 8 Aug 2024 16:19:00 +0100 Subject: [PATCH 02/35] Update 24-linear-reg-broom.Rmd episode info objectives, questions and content items added --- episodes/24-linear-reg-broom.Rmd | 32 ++++++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+) diff --git a/episodes/24-linear-reg-broom.Rmd b/episodes/24-linear-reg-broom.Rmd index 8b1378917..9884d7308 100644 --- a/episodes/24-linear-reg-broom.Rmd +++ b/episodes/24-linear-reg-broom.Rmd @@ -1 +1,33 @@ +--- +title: "Linear Regression and Broom" +teaching: 60 +exercises: 20 +source: Rmd +--- +::::::::::::::::::::::::::::::::::::::: objectives + +- To be able to explore relationships between variables +- To be able to calculate predicted variables and residuals +- To be able to construct linear regression models +- To be able to check assumptions for linear regressiona models +- To be able to present model outcomes using Broom + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I explore relationships between variables in my data? +- How can I check that my data is suitable for use in a linear regression model? +- How can I present model outputs in an easier to read way? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Content + +- Linear Regression Models +- Use of Logit +- Assumption Diagnostics and Regression Troubleshooting +- Use of Broom + +## Data From 2baca5b04f5b9ed1ebdbca5409817b6177b859a4 Mon Sep 17 00:00:00 2001 From: quirksahern Date: Wed, 14 Aug 2024 11:03:47 +0100 Subject: [PATCH 03/35] Update 24-linear-reg-broom.Rmd added data import --- episodes/24-linear-reg-broom.Rmd | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/episodes/24-linear-reg-broom.Rmd b/episodes/24-linear-reg-broom.Rmd index 9884d7308..685034db7 100644 --- a/episodes/24-linear-reg-broom.Rmd +++ b/episodes/24-linear-reg-broom.Rmd @@ -31,3 +31,19 @@ source: Rmd - Use of Broom ## Data + +'''R +lon_dims_imd_2019 <- read.csv(".data/English_IMD_2019_Domains_rebased_London_by_CDRC.csv") +ahah_2022 <- read.csv('data/AHAH_V3_0.csv') + +#add London Dims to AHA data +combinedLondonDims <- merge(ahah_2022,lon_dims_imd_2019,by.x="lsoa19",by.y="ls11cd",all.y + +''' + +We are going to use the data from the Consumer Data Research Centre, specifically the London IMD 2019 (English IMD 2019 Domains rebased) and the Access to Healthy Assets & Hazards (AHAH) 2022 data. + +Atribution: Data provided by the Consumer Data Research Centre, an ESRC Data Investment: ES/L011840/1, ES/L011891/1 + +The statistical unit areas across the country are Lower layer Super Output Areas (LSOAs). We have combined the data using a right join to focus on London. + From 8b7a2d2b72932256b34601c6cc029efb4152339b Mon Sep 17 00:00:00 2001 From: quirksahern Date: Wed, 14 Aug 2024 11:11:57 +0100 Subject: [PATCH 04/35] Update 24-linear-reg-broom.Rmd --- episodes/24-linear-reg-broom.Rmd | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/episodes/24-linear-reg-broom.Rmd b/episodes/24-linear-reg-broom.Rmd index 685034db7..3c10d87e7 100644 --- a/episodes/24-linear-reg-broom.Rmd +++ b/episodes/24-linear-reg-broom.Rmd @@ -32,18 +32,18 @@ source: Rmd ## Data -'''R -lon_dims_imd_2019 <- read.csv(".data/English_IMD_2019_Domains_rebased_London_by_CDRC.csv") -ahah_2022 <- read.csv('data/AHAH_V3_0.csv') +```{r libraries, message=FALSE, warning=FALSE} + +# We will need this library and this data later. +library(ggplot2) -#add London Dims to AHA data -combinedLondonDims <- merge(ahah_2022,lon_dims_imd_2019,by.x="lsoa19",by.y="ls11cd",all.y +lon_dims_imd_2019 <- read.csv(".data/English_IMD_2019_Domains_rebased_London_by_CDRC.csv") -''' +``` -We are going to use the data from the Consumer Data Research Centre, specifically the London IMD 2019 (English IMD 2019 Domains rebased) and the Access to Healthy Assets & Hazards (AHAH) 2022 data. +We are going to use the data from the Consumer Data Research Centre, specifically the London IMD 2019 (English IMD 2019 Domains rebased). Atribution: Data provided by the Consumer Data Research Centre, an ESRC Data Investment: ES/L011840/1, ES/L011891/1 -The statistical unit areas across the country are Lower layer Super Output Areas (LSOAs). We have combined the data using a right join to focus on London. +The statistical unit areas across the country are Lower layer Super Output Areas (LSOAs). We will explore the relationships between the different dimensions of the Indices of Multiple Deprivation. From 5e56ebb9ced90d032db86a47cade1e623922a387 Mon Sep 17 00:00:00 2001 From: quirksahern Date: Wed, 14 Aug 2024 12:14:43 +0100 Subject: [PATCH 05/35] Update 24-linear-reg-broom.Rmd simple linear regression --- episodes/24-linear-reg-broom.Rmd | 31 +++++++++++++++++++++++++++++++ 1 file changed, 31 insertions(+) diff --git a/episodes/24-linear-reg-broom.Rmd b/episodes/24-linear-reg-broom.Rmd index 3c10d87e7..649ce4182 100644 --- a/episodes/24-linear-reg-broom.Rmd +++ b/episodes/24-linear-reg-broom.Rmd @@ -47,3 +47,34 @@ Atribution: Data provided by the Consumer Data Research Centre, an ESRC Data Inv The statistical unit areas across the country are Lower layer Super Output Areas (LSOAs). We will explore the relationships between the different dimensions of the Indices of Multiple Deprivation. +## Linear Regression + +Linear Regression enables use to to explore the the linear relationship of the dependent variable Y and independent variable(s) X(s). +We are going to explore the linear relationship between the Health Deprivation and Disability Domain and the Living Environment Deprivation Domain. + +The Health Deprivation and Disability Domain measures the risk of premature death and the impairment of quality of life through poor physical or mental health. The domain measures morbidity, disability and premature mortality but not aspects of behaviour or environment that may be predictive of future health deprivation. + +The Living Environment Deprivation Domain measures the quality of the local environment. The indicators fall into two sub-domains. The ‘indoors’ living environment measures the quality of housing; while the ‘outdoors’ living environment contains measures of air quality and road traffic accidents. + +Reference: + McLennan, David et al. The English Indices of Deprivation 2019 : Technical Report. Ministry of Housing, Communities and Local Government, 2019. Print. + + +### Simple Linear Regression +In the simple linear regression example we have only one dependent variable (health_london_rank) and one independent variable (livingEnv_london_rank). + +```{r} + +reg_LivEnv_health <- lm(health_london_rank ~ livingEnv_london_rank, data = lon_dims_imd_2019) +# We put the dependent variable to the left of the '~' and the independent variable(s) to the right +# and we tell R which dataset we are referring to. + +summary(reg_LivEnv_health) + +``` + +From the result of this analysis, we can see that the Living Environment Deprivation Domain rank has a significant(small p-value, general rule of thumb <0.05) and positive relationship(positive coefficient) with the Health Deprivation and Disability Domain rank. + +One way of interpreting the result is: One unit increase in the Living Environment rank is related to around 0.343 (3.430e-01) points increase of the Health Deprivation and Disability rank. + +R-square shows the amount of variance of Y explained by X. In this case the Living Environment rank explains 6.225% of the variance in the Health Deprivation and Disability rank. Adj R2(6.205%) shows the same as R2 but adjusted by the # of cases and # of variables. When the # of variables is small and the # of cases is very large then Adj R2 is closer to R2. From 10ec4f5869dd57f2853e6da1b9b2f14b956aae67 Mon Sep 17 00:00:00 2001 From: quirksahern Date: Wed, 14 Aug 2024 13:05:49 +0100 Subject: [PATCH 06/35] Update 24-linear-reg-broom.Rmd predicted values and residuals --- episodes/24-linear-reg-broom.Rmd | 35 ++++++++++++++++++++++++++++++++ 1 file changed, 35 insertions(+) diff --git a/episodes/24-linear-reg-broom.Rmd b/episodes/24-linear-reg-broom.Rmd index 649ce4182..dbe044e4f 100644 --- a/episodes/24-linear-reg-broom.Rmd +++ b/episodes/24-linear-reg-broom.Rmd @@ -78,3 +78,38 @@ From the result of this analysis, we can see that the Living Environment Depriva One way of interpreting the result is: One unit increase in the Living Environment rank is related to around 0.343 (3.430e-01) points increase of the Health Deprivation and Disability rank. R-square shows the amount of variance of Y explained by X. In this case the Living Environment rank explains 6.225% of the variance in the Health Deprivation and Disability rank. Adj R2(6.205%) shows the same as R2 but adjusted by the # of cases and # of variables. When the # of variables is small and the # of cases is very large then Adj R2 is closer to R2. + +### Logit + +Don't know what to put here yet + +### Predicted values and Residuals + +We can expand our simple linear regression example to incorporate the Barriers to Housing and Services Domain rank. +The Barriers to Housing and Services Domain measures the physical and financial accessibility of housing and local services. The indicators fall into two sub-domains: ‘geographical barriers’, which relate to the physical proximity of local services, and ‘wider barriers’ which includes issues relating to access to housing, such as affordability. + +```{r} + +reg_LivEnv_barriers_health <- lm(health_london_rank ~ livingEnv_london_rank + barriers_london_rank, data = lon_dims_imd_2019) + +summary(reg_LivEnv_barriers_health) + +``` + +After running the regression model, we can access the model predicted values and the residuals compared to the real observations. + +```{r} +#first we fit the predictions +health_rank_pred<-fitted(reg_LivEnv_barriers_health) +health_rank_pred<-as.data.frame(health_rank_pred) + +#now we add the residual values too +health_rank_resid<-residuals(reg_LivEnv_barriers_health) +health_rank_pred$resid<-health_rank_resid + +#We can thenview the predictions and residuals +View(health_rank_pred) +``` + +### Robust Regression + From e451c6387624cd5c3061f17651d93d4224565d7f Mon Sep 17 00:00:00 2001 From: quirksahern Date: Wed, 14 Aug 2024 15:00:29 +0100 Subject: [PATCH 07/35] Update 24-linear-reg-broom.Rmd robust regression --- episodes/24-linear-reg-broom.Rmd | 18 +++++++++++++++++- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/episodes/24-linear-reg-broom.Rmd b/episodes/24-linear-reg-broom.Rmd index dbe044e4f..b7112d342 100644 --- a/episodes/24-linear-reg-broom.Rmd +++ b/episodes/24-linear-reg-broom.Rmd @@ -34,8 +34,10 @@ source: Rmd ```{r libraries, message=FALSE, warning=FALSE} -# We will need this library and this data later. +# We will need these libraries and this data later. library(ggplot2) +library(lmtest) +library(sandwich) lon_dims_imd_2019 <- read.csv(".data/English_IMD_2019_Domains_rebased_London_by_CDRC.csv") @@ -113,3 +115,17 @@ View(health_rank_pred) ### Robust Regression +We can run the robust standard error regressions(control for heteroskedasticity, meaning unequal variances): + +```{r} +reg_LivEnv_barriers_health$robse <- vcovHC(reg_LivEnv_barriers_health, type="HC1") +coeftest(reg_LivEnv_barriers_health,reg_LivEnv_barriers_health$robse) +``` + +In addition, we can access the cluster-robust standard errors regression results: + +```{r} +#cluster-robust standard errors +coeftest(reg_LivEnv_barriers_health, reg_LivEnv_barriers_health$clse) +``` + From 9c148e544afe0df2dd1b89a17498794041b25ebb Mon Sep 17 00:00:00 2001 From: quirksahern Date: Wed, 14 Aug 2024 16:33:26 +0100 Subject: [PATCH 08/35] Update 24-linear-reg-broom.Rmd cat variable --- episodes/24-linear-reg-broom.Rmd | 24 +++++++++++++++++++++++- 1 file changed, 23 insertions(+), 1 deletion(-) diff --git a/episodes/24-linear-reg-broom.Rmd b/episodes/24-linear-reg-broom.Rmd index b7112d342..97cd4c2a1 100644 --- a/episodes/24-linear-reg-broom.Rmd +++ b/episodes/24-linear-reg-broom.Rmd @@ -81,7 +81,7 @@ One way of interpreting the result is: One unit increase in the Living Environme R-square shows the amount of variance of Y explained by X. In this case the Living Environment rank explains 6.225% of the variance in the Health Deprivation and Disability rank. Adj R2(6.205%) shows the same as R2 but adjusted by the # of cases and # of variables. When the # of variables is small and the # of cases is very large then Adj R2 is closer to R2. -### Logit +### Log transform Don't know what to put here yet @@ -129,3 +129,25 @@ In addition, we can access the cluster-robust standard errors regression result coeftest(reg_LivEnv_barriers_health, reg_LivEnv_barriers_health$clse) ``` +### Regression with Categorical independent variables + +We will explore the use of categorical independent variables in linear regression in this episode. +When the dependent variable is a categorical variable, you may consider the alternatives of linear regression like logit regression and multinomial regression. + +```{r} +# As a categorical variable we have added la19nm, these are the names of the London boroughs +reg_cat_var <- lm(health_london_rank ~ livingEnv_london_rank + barriers_london_rank + la19nm, data = lon_dims_imd_2019) + +summary(reg_cat_var) +``` + +R automatically recognizes la19nm as a factor and treats it accordingly. +The missing one in the coefficient summary (Barking and Dagenham) is treated as a base line, therefore the value is 0. +However, we can also modify our model to show for all: + +```{r} +reg_cat_var_showall <- lm(health_london_rank ~ 0 + livingEnv_london_rank + barriers_london_rank + la19nm, data = lon_dims_imd_2019) + +summary(reg_cat_var_showall) + +``` From 6bedd4d9c0cc4d793105a0984ab1ba3fd0b65483 Mon Sep 17 00:00:00 2001 From: quirksahern Date: Wed, 14 Aug 2024 16:34:27 +0100 Subject: [PATCH 09/35] Update 24-linear-reg-broom.Rmd logit to log transform --- episodes/24-linear-reg-broom.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/episodes/24-linear-reg-broom.Rmd b/episodes/24-linear-reg-broom.Rmd index 97cd4c2a1..98868b197 100644 --- a/episodes/24-linear-reg-broom.Rmd +++ b/episodes/24-linear-reg-broom.Rmd @@ -26,7 +26,7 @@ source: Rmd ## Content - Linear Regression Models -- Use of Logit +- Use of Log transform - Assumption Diagnostics and Regression Troubleshooting - Use of Broom From 2ed62683e9a49a17a4fd4cf66619e8cca93deebf Mon Sep 17 00:00:00 2001 From: quirksahern Date: Mon, 19 Aug 2024 10:09:18 +0100 Subject: [PATCH 10/35] Update 24-linear-reg-broom.Rmd added cat with int terms Example of use of ategorical variable and interaction terms added --- episodes/24-linear-reg-broom.Rmd | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/episodes/24-linear-reg-broom.Rmd b/episodes/24-linear-reg-broom.Rmd index 98868b197..afd293d3c 100644 --- a/episodes/24-linear-reg-broom.Rmd +++ b/episodes/24-linear-reg-broom.Rmd @@ -151,3 +151,15 @@ reg_cat_var_showall <- lm(health_london_rank ~ 0 + livingEnv_london_rank + barri summary(reg_cat_var_showall) ``` +### Categorical variables with interaction terms + +Sometimes we are interested in how a variable interacts with another variable. +We can explore any interactions between locations (la19nm) and the living environment and barrier ranks. + +```{r} +reg_cat_var_int <- lm(health_london_rank ~ la19nm*(livingEnv_london_rank + barriers_london_rank), data = lon_dims_imd_2019) + +summary(reg_cat_var_int) +``` + +## Assumption Diagnostics and Regression Trouble Shooting From 25d25897ab547e2892cb0085b5c025dd778bacc0 Mon Sep 17 00:00:00 2001 From: quirksahern Date: Mon, 19 Aug 2024 10:23:41 +0100 Subject: [PATCH 11/35] Update 24-linear-reg-broom.Rmd assumptions text --- episodes/24-linear-reg-broom.Rmd | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/episodes/24-linear-reg-broom.Rmd b/episodes/24-linear-reg-broom.Rmd index afd293d3c..0bc80bf21 100644 --- a/episodes/24-linear-reg-broom.Rmd +++ b/episodes/24-linear-reg-broom.Rmd @@ -1,6 +1,6 @@ --- title: "Linear Regression and Broom" -teaching: 60 +teaching: 80 exercises: 20 source: Rmd --- @@ -163,3 +163,16 @@ summary(reg_cat_var_int) ``` ## Assumption Diagnostics and Regression Trouble Shooting + +For accurate model interpretation and prediction, there are a number of assumptions about linear regression models that need to be verified. +These are: +* Linear Relationship: The core premise of multiple linear regression is the existence of a linear relationship between the dependent (outcome) variable and the independent variables. T +* Multivariate Normality: The analysis assumes that the residuals (the differences between observed and predicted values) are normally distributed. This assumption can be assessed by examining histograms or Q-Q plots of the residuals, or through statistical tests such as the Kolmogorov-Smirnov test. +* No Multicollinearity: It is essential that the independent variables are not too highly correlated with each other, a condition known as multicollinearity. This can be checked using: + + Correlation matrices, where correlation coefficients should ideally be below 0.80. + + Variance Inflation Factor (VIF), with VIF values above 10 indicating problematic multicollinearity. +* Homoscedasticity: The variance of error terms (residuals) should be consistent across all levels of the independent variables. + +We can perform a number of tests to see if the assumptions are met. + +### Normality Check From fa3644bc2866e22e6721077cd4bdfdd74a43ac35 Mon Sep 17 00:00:00 2001 From: quirksahern Date: Mon, 19 Aug 2024 11:53:02 +0100 Subject: [PATCH 12/35] Update 24-linear-reg-broom.Rmd moving diagnostics --- episodes/24-linear-reg-broom.Rmd | 14 -------------- 1 file changed, 14 deletions(-) diff --git a/episodes/24-linear-reg-broom.Rmd b/episodes/24-linear-reg-broom.Rmd index 0bc80bf21..abf63d8cc 100644 --- a/episodes/24-linear-reg-broom.Rmd +++ b/episodes/24-linear-reg-broom.Rmd @@ -162,17 +162,3 @@ reg_cat_var_int <- lm(health_london_rank ~ la19nm*(livingEnv_london_rank + barri summary(reg_cat_var_int) ``` -## Assumption Diagnostics and Regression Trouble Shooting - -For accurate model interpretation and prediction, there are a number of assumptions about linear regression models that need to be verified. -These are: -* Linear Relationship: The core premise of multiple linear regression is the existence of a linear relationship between the dependent (outcome) variable and the independent variables. T -* Multivariate Normality: The analysis assumes that the residuals (the differences between observed and predicted values) are normally distributed. This assumption can be assessed by examining histograms or Q-Q plots of the residuals, or through statistical tests such as the Kolmogorov-Smirnov test. -* No Multicollinearity: It is essential that the independent variables are not too highly correlated with each other, a condition known as multicollinearity. This can be checked using: - + Correlation matrices, where correlation coefficients should ideally be below 0.80. - + Variance Inflation Factor (VIF), with VIF values above 10 indicating problematic multicollinearity. -* Homoscedasticity: The variance of error terms (residuals) should be consistent across all levels of the independent variables. - -We can perform a number of tests to see if the assumptions are met. - -### Normality Check From 26a3ba3d26dca66272f5ac4fda0c9e92a26c3daf Mon Sep 17 00:00:00 2001 From: quirksahern Date: Mon, 19 Aug 2024 11:54:18 +0100 Subject: [PATCH 13/35] Create 25-linreg-diagnostics content moved to new file --- episodes/25-linreg-diagnostics | 72 ++++++++++++++++++++++++++++++++++ 1 file changed, 72 insertions(+) create mode 100644 episodes/25-linreg-diagnostics diff --git a/episodes/25-linreg-diagnostics b/episodes/25-linreg-diagnostics new file mode 100644 index 000000000..a7d842741 --- /dev/null +++ b/episodes/25-linreg-diagnostics @@ -0,0 +1,72 @@ + + + + +## Assumption Diagnostics and Regression Trouble Shooting + +For accurate model interpretation and prediction, there are a number of assumptions about linear regression models that need to be verified. +These are: +* Linear Relationship: The core premise of multiple linear regression is the existence of a linear relationship between the dependent (outcome) variable and the independent variables. T +* Multivariate Normality: The analysis assumes that the residuals (the differences between observed and predicted values) are normally distributed. This assumption can be assessed by examining histograms or Q-Q plots of the residuals, or through statistical tests such as the Kolmogorov-Smirnov test. +* No Multicollinearity: It is essential that the independent variables are not too highly correlated with each other, a condition known as multicollinearity. This can be checked using: + + Correlation matrices, where correlation coefficients should ideally be below 0.80. + + Variance Inflation Factor (VIF), with VIF values above 10 indicating problematic multicollinearity. +* Homoscedasticity: The variance of error terms (residuals) should be consistent across all levels of the independent variables. + +We can perform a number of tests to see if the assumptions are met. + +We will first undertake some residual diagnostics using the package 'olsrr'. + +### Residual Diagnostics + +#### Residual QQ Plot +Plot for detecting violation of normality assumption. +```{r} +library(olsrr) + +model1 <- lm(health_london_rank ~ livingEnv_london_rank + barriers_london_rank + la19nm, data = lon_dims_imd_2019) + +ols_plot_resid_qq(model1) +``` + +#### Residual Normality Test +Test for detecting violation of normality assumption. +```{r} + +ols_test_normality(model1) + +``` + +Correlation between observed residuals and expected residuals under normality. +```{r} + +ols_test_correlation(model1) + +``` +From these tests we can see that our assumptions are seemingly correct. + +#### Residual vs Fitted Values Plot +We can create a scatter plot of residuals on the y axis and fitted values on the x axis to detect non-linearity, unequal error variances, and outliers. + +Characteristics of a well behaved residual vs fitted plot: +* The residuals spread randomly around the 0 line indicating that the relationship is linear. +* The residuals form an approximate horizontal band around the 0 line indicating homogeneity of error variance. +* No one residual is visibly away from the random pattern of the residuals indicating that there are no outliers. + +```{r} + +ols_plot_resid_fit(model1) + +``` + +#### Residual Histogram +Additionally, we can create a histogram of residuals for detecting violation of normality assumption. + +```{r} + +ols_plot_resid_hist(model1) + +``` + +### Additional Diagnostics +In addition the residual diagnostics, we can also assess our model for Heteroskedasticity and Multicollinearity From 0bb678a1ebb2ce0000901ac0dcebf7285d91152a Mon Sep 17 00:00:00 2001 From: quirksahern Date: Mon, 19 Aug 2024 11:56:21 +0100 Subject: [PATCH 14/35] Update 25-linreg-diagnostics placeholder intro text added --- episodes/25-linreg-diagnostics | 41 +++++++++++++++++++++++++++++++++- 1 file changed, 40 insertions(+), 1 deletion(-) diff --git a/episodes/25-linreg-diagnostics b/episodes/25-linreg-diagnostics index a7d842741..36481c739 100644 --- a/episodes/25-linreg-diagnostics +++ b/episodes/25-linreg-diagnostics @@ -1,8 +1,47 @@ +--- +title: Assumption Diagnostics and Regression Trouble Shooting +teaching: 40 +exercises: 20 +source: Rmd +--- +::::::::::::::::::::::::::::::::::::::: objectives +- To be able to explore relationships between variables +- To be able to calculate predicted variables and residuals +- To be able to construct linear regression models +- To be able to check assumptions for linear regressiona models +- To be able to present model outcomes using Broom +:::::::::::::::::::::::::::::::::::::::::::::::::: -## Assumption Diagnostics and Regression Trouble Shooting +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I explore relationships between variables in my data? +- How can I check that my data is suitable for use in a linear regression model? +- How can I present model outputs in an easier to read way? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Content + +- Linear Regression Models +- Use of Log transform +- Assumption Diagnostics and Regression Troubleshooting +- Use of Broom + +## Data + +```{r libraries, message=FALSE, warning=FALSE} + +# We will need these libraries and this data later. +library(ggplot2) +library(lmtest) +library(sandwich) + +lon_dims_imd_2019 <- read.csv(".data/English_IMD_2019_Domains_rebased_London_by_CDRC.csv") + +``` For accurate model interpretation and prediction, there are a number of assumptions about linear regression models that need to be verified. These are: From d1ad33a76373d0e0efb43a857828652448464a41 Mon Sep 17 00:00:00 2001 From: quirksahern Date: Mon, 19 Aug 2024 13:40:10 +0100 Subject: [PATCH 15/35] Update 25-linreg-diagnostics examples complete --- episodes/25-linreg-diagnostics | 111 +++++++++++++++++++++++++-------- 1 file changed, 86 insertions(+), 25 deletions(-) diff --git a/episodes/25-linreg-diagnostics b/episodes/25-linreg-diagnostics index 36481c739..792c2d626 100644 --- a/episodes/25-linreg-diagnostics +++ b/episodes/25-linreg-diagnostics @@ -1,34 +1,27 @@ --- title: Assumption Diagnostics and Regression Trouble Shooting teaching: 40 -exercises: 20 source: Rmd --- -::::::::::::::::::::::::::::::::::::::: objectives +::::::::::::::::::::::::::::::::::::::: objective -- To be able to explore relationships between variables -- To be able to calculate predicted variables and residuals -- To be able to construct linear regression models -- To be able to check assumptions for linear regressiona models -- To be able to present model outcomes using Broom +- To be able to check assumptions for linear regression models :::::::::::::::::::::::::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::: questions +:::::::::::::::::::::::::::::::::::::::: question -- How can I explore relationships between variables in my data? - How can I check that my data is suitable for use in a linear regression model? -- How can I present model outputs in an easier to read way? :::::::::::::::::::::::::::::::::::::::::::::::::: ## Content -- Linear Regression Models -- Use of Log transform -- Assumption Diagnostics and Regression Troubleshooting -- Use of Broom +- Residual diagnostics +- Heteroskedasticity Check +- Multicollinearity Check +- Identification of Influential Points ## Data @@ -36,8 +29,8 @@ source: Rmd # We will need these libraries and this data later. library(ggplot2) -library(lmtest) -library(sandwich) +library(olsrr) +library(car) lon_dims_imd_2019 <- read.csv(".data/English_IMD_2019_Domains_rebased_London_by_CDRC.csv") @@ -54,21 +47,18 @@ These are: We can perform a number of tests to see if the assumptions are met. -We will first undertake some residual diagnostics using the package 'olsrr'. - -### Residual Diagnostics +## Residual Diagnostics -#### Residual QQ Plot +### Residual QQ Plot Plot for detecting violation of normality assumption. ```{r} -library(olsrr) model1 <- lm(health_london_rank ~ livingEnv_london_rank + barriers_london_rank + la19nm, data = lon_dims_imd_2019) ols_plot_resid_qq(model1) ``` -#### Residual Normality Test +### Residual Normality Test Test for detecting violation of normality assumption. ```{r} @@ -98,7 +88,7 @@ ols_plot_resid_fit(model1) ``` -#### Residual Histogram +### Residual Histogram Additionally, we can create a histogram of residuals for detecting violation of normality assumption. ```{r} @@ -107,5 +97,76 @@ ols_plot_resid_hist(model1) ``` -### Additional Diagnostics -In addition the residual diagnostics, we can also assess our model for Heteroskedasticity and Multicollinearity +## Additional Diagnostics +In addition the residual diagnostics, we can also assess our model for Heteroskedasticity, Multicollinearity and any Influential/High leverage points. + +### Heteroskedasticity Check + +We can use the ncvTest() function to test for equal variances. + +```{r} + +ncvTest(model1) + +``` +The p-value here is low, indicating that this model may have a problem of unequal variances. + +### Multicollinearity Check +We want to know whether we have too many variables that have high correlation with each other. +If the value of GVIF is greater than 4, it suggests collinearity. + +```{r} + +vif(model1) + +``` + +### Outlier Identification +We can identify outliers in our data by creating a QQPlot and running statistical tests on the data. +The outlierTest() reports the Bonferroni p-values for testing each observation in turn to be a mean-shift outlier, based Studentized residuals in +linear (t-tests), generalized linear models (normal tests), and linear mixed models. + +```{r} +#Create a qqPlot +qqPlot(model1,id.n=2) + +#Test for outliers +outlierTest(model1) +``` + +The null hypothesis for the Bonferonni adjusted outlier test is the observation is an outlier. Here observation related to ‘4612’ is an outlier. +In our data '4612' is difficult to identify as it may refer to either a Living Environment or a Barriers rank. + +### Identifying Influence Points + +We can identify potential influential points via an Influence Plot, these may be Ouliers and/or High Leverage points. +A point can be considered influential if it's exclusion causes substantial change in the estimated regression. + +An influence plot summarises the 3 metrics for influence analysis in one single glance. +The X-axis plots the leverage, which is normalised between 0 and 1. +The Y-axis plots the studentized residuals, which can be positive or negative. +The size of the circles for each data point reflect its Cook’s distance, degree of influence. + +Vertical reference lines are drawn at twice and three times the average hat value, horizontal reference lines at -2, 0, and 2 on the Studentized-residual scale. +```{r} +#Influence Plots +influencePlot(model1) +``` +Identified influential points are returned in a data frame with the hat values, Studentized residuals and Cook's distance of the identified points. +If no points are identified, nothing is returned. + +Influence points can be further explored with an Influence Index Plot which provides index plots of influence and related diagnostics for a regression model. + +```{r} +influenceIndexPlot(model1, id.n=3) +``` + +If an observation is influential then that observation can change the fit of the linear model. + +An observation may be influential because: +1. Someone made a recording error +2. Someone made a fundamental mistake collecting the observation +3. The data point is perfectly valid, in which case the model cannot account for the behaviour. + +If the case is 1 or 2, then you can remove the point (or correct it). +If it's 3, it's not worth deleting a valid point; the data may be better suited a non-linear model. From c59f6c33399e38f522b0cc9bc16aad52609a01ba Mon Sep 17 00:00:00 2001 From: quirksahern Date: Mon, 19 Aug 2024 13:41:38 +0100 Subject: [PATCH 16/35] Update 24-linear-reg-broom.Rmd updated contents and questions --- episodes/24-linear-reg-broom.Rmd | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/episodes/24-linear-reg-broom.Rmd b/episodes/24-linear-reg-broom.Rmd index abf63d8cc..17b540b97 100644 --- a/episodes/24-linear-reg-broom.Rmd +++ b/episodes/24-linear-reg-broom.Rmd @@ -10,7 +10,6 @@ source: Rmd - To be able to explore relationships between variables - To be able to calculate predicted variables and residuals - To be able to construct linear regression models -- To be able to check assumptions for linear regressiona models - To be able to present model outcomes using Broom :::::::::::::::::::::::::::::::::::::::::::::::::: @@ -18,7 +17,6 @@ source: Rmd :::::::::::::::::::::::::::::::::::::::: questions - How can I explore relationships between variables in my data? -- How can I check that my data is suitable for use in a linear regression model? - How can I present model outputs in an easier to read way? :::::::::::::::::::::::::::::::::::::::::::::::::: @@ -27,7 +25,7 @@ source: Rmd - Linear Regression Models - Use of Log transform -- Assumption Diagnostics and Regression Troubleshooting +- Use of Categorical Variables - Use of Broom ## Data From 78c012e385b7d42cb6baf70f6f3af1860e0f1905 Mon Sep 17 00:00:00 2001 From: quirksahern Date: Mon, 19 Aug 2024 15:15:42 +0100 Subject: [PATCH 17/35] Update 24-linear-reg-broom.Rmd initial broom examples added --- episodes/24-linear-reg-broom.Rmd | 51 ++++++++++++++++++++++++++++++++ 1 file changed, 51 insertions(+) diff --git a/episodes/24-linear-reg-broom.Rmd b/episodes/24-linear-reg-broom.Rmd index 17b540b97..4fa28f90a 100644 --- a/episodes/24-linear-reg-broom.Rmd +++ b/episodes/24-linear-reg-broom.Rmd @@ -36,6 +36,7 @@ source: Rmd library(ggplot2) library(lmtest) library(sandwich) +library(broom) lon_dims_imd_2019 <- read.csv(".data/English_IMD_2019_Domains_rebased_London_by_CDRC.csv") @@ -160,3 +161,53 @@ reg_cat_var_int <- lm(health_london_rank ~ la19nm*(livingEnv_london_rank + barri summary(reg_cat_var_int) ``` +## Broom + +The 'broom' package offers an alternative way of presenting the output of statistical analysis. +It centers around three S3 methods, each of which take common objects produced by R statistical functions (lm, t.test, nls, etc) and convert them into a tibble. + +These are: +* tidy: constructs a tibble that summarizes the model’s statistical findings. This includes coefficients and p-values for each term in a regression, per-cluster information in clustering applications, or per-test information for multtest functions. +* augment: add columns to the original data that was modeled. This includes predictions, residuals, and cluster assignments. +* glance: construct a concise one-row summary of the model. This typically contains values such as R^2, adjusted R^2, and residual standard error that are computed once for the entire model. + + +Let's revisit our initial linear model: +```{r} + +reg_LivEnv_health <- lm(health_london_rank ~ livingEnv_london_rank, data = lon_dims_imd_2019) + +summary(reg_LivEnv_health) + +``` + +There is a lot of useful information, but it not available in a way so that you can combine it with other models or do further analysis. +We can convert this to tabular data using the 'tidy' function. + +```{r} + +tidy(reg_LivEnv_health) + +``` +The row names have been moved into a column called term, and the column names are simple and consistent (and can be accessed using $). + +Information about the model can be explored with 'augment'. The function augments the original data with information from the model, +such as the fitted values and residuals for each of the original points in the regression. + + +```{r} + +augment(reg_LivEnv_health) + +``` + +Some of the data presented by 'augment' will be discussed in the episode Linear Regression Diagnostics. + +Summary statistics are computed for the entire regression, such as R^2 and the F-statistic can be accessed with the 'glance' function: + + +```{r} + +glance(reg_LivEnv_health) + +``` From 16b5d1179a0e0d1ef18c20e46ec5cb2219c6cd0e Mon Sep 17 00:00:00 2001 From: quirksahern Date: Mon, 19 Aug 2024 16:40:07 +0100 Subject: [PATCH 18/35] Update 24-linear-reg-broom.Rmd broom content finalised --- episodes/24-linear-reg-broom.Rmd | 92 ++++++++++++++++++++++++++++++++ 1 file changed, 92 insertions(+) diff --git a/episodes/24-linear-reg-broom.Rmd b/episodes/24-linear-reg-broom.Rmd index 4fa28f90a..4258005c6 100644 --- a/episodes/24-linear-reg-broom.Rmd +++ b/episodes/24-linear-reg-broom.Rmd @@ -34,6 +34,7 @@ source: Rmd # We will need these libraries and this data later. library(ggplot2) +library(tidyverse) library(lmtest) library(sandwich) library(broom) @@ -211,3 +212,94 @@ Summary statistics are computed for the entire regression, such as R^2 and the F glance(reg_LivEnv_health) ``` +### Generalised linear models + +We can also use the 'broom' functions to present data from Generalised linear and non-linear models. +For example, if we wanted to explore the Income Rank in relation to whether or not an area was within the City of London. + +```{r} + +# add a variable to indicate whether or not an area is within the City of London +lon_dims_imd_2019 <- lon_dims_imd_2019 %>% mutate(city = la19nm == "City of London") + +# create a Generalised Linear Model +glmlondims <- glm(city ~ Income_london_rank, lon_dims_imd_2019, family = "binomial") +summary(glmlondims) + +``` + +Use of 'tidy': +```{r} + +tidy(glmlondims) + +``` + +Use of 'augment': +```{r} + +augment(glmlondims) + +``` + +Use of 'glance': +```{r} + +glance(glmlondims) + +``` +You will notice that the statistics computed by 'glance' are different for glm objects than for lm (e.g. deviance rather than R^2). + +### Hypothesis Testing + +The tidy function can also be applied a range of hypotheses tests, such as built-in functions like t.test, cor.test, and wilcox.test. + +### t-test +```{r} + +tt <- t.test(Income_london_rank ~ city, lon_dims_imd_2019) +tidy(tt) + +``` +Some cases might have fewer columns (for example, no confidence interval). + +Wilcox test: +```{r} + +wt <- wilcox.test(Income_london_rank ~ city, lon_dims_imd_2019) +tidy(wt) + +``` + +Since the 'tidy' output is already only one row, glance returns the same output: +```{r} +#t-test +glance(tt) + +#Wilcox test +glance(wt) + +``` + +The 'augment' method is defined only for chi-squared tests, since there is no meaningful sense, for other tests, +in which a hypothesis test produces output about each initial data point. + +```{r} +# xtabs creates a frequency table of IMD deciles within London borooughs +chit <- chisq.test(xtabs(~la19nm + IDAOP_london_decile,data = lon_dims_imd_2019)) + +tidy(chit) + +``` + +```{r} + +augment(chit) + +``` + +### Conventions +There are some conventions that enable consistency across the broom functions, these are: +* The output of the tidy, augment and glance functions is always a tibble. +* The output never has rownames. This ensures that you can combine it with other tidy outputs without fear of losing information (since rownames in R cannot contain duplicates). +* Some column names are kept consistent, so that they can be combined across different models and so that you know what to expect (in contrast to asking “is it pval or PValue?” every time). From 2a3fd109d27e634bb3161ef753cef869f787f765 Mon Sep 17 00:00:00 2001 From: quirksahern Date: Tue, 20 Aug 2024 10:54:04 +0100 Subject: [PATCH 19/35] Update 24-linear-reg-broom.Rmd exercises added --- episodes/24-linear-reg-broom.Rmd | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+) diff --git a/episodes/24-linear-reg-broom.Rmd b/episodes/24-linear-reg-broom.Rmd index 4258005c6..f615a65e3 100644 --- a/episodes/24-linear-reg-broom.Rmd +++ b/episodes/24-linear-reg-broom.Rmd @@ -128,6 +128,13 @@ In addition, we can access the cluster-robust standard errors regression result #cluster-robust standard errors coeftest(reg_LivEnv_barriers_health, reg_LivEnv_barriers_health$clse) ``` +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 1 + +Using the GapMinder data to create a linear model between two continuous variables. + +Discuss your question and your findings. ### Regression with Categorical independent variables @@ -162,6 +169,14 @@ reg_cat_var_int <- lm(health_london_rank ~ la19nm*(livingEnv_london_rank + barri summary(reg_cat_var_int) ``` +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 2 + +Using the GapMinder data to create a linear model between a categorical and a continuous variable . + +Discuss your question and your findings. + ## Broom The 'broom' package offers an alternative way of presenting the output of statistical analysis. @@ -298,6 +313,14 @@ augment(chit) ``` +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 3 + +Use broom to amend the display of your model outputs. + +Which function(s) did you use and why? + ### Conventions There are some conventions that enable consistency across the broom functions, these are: * The output of the tidy, augment and glance functions is always a tibble. From d919cdc043e30ea37c00ab0e03f4d0f6aeb2f426 Mon Sep 17 00:00:00 2001 From: quirksahern Date: Tue, 20 Aug 2024 10:55:56 +0100 Subject: [PATCH 20/35] Update 25-linreg-diagnostics exercise added --- episodes/25-linreg-diagnostics | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/episodes/25-linreg-diagnostics b/episodes/25-linreg-diagnostics index 792c2d626..d70caedd8 100644 --- a/episodes/25-linreg-diagnostics +++ b/episodes/25-linreg-diagnostics @@ -1,6 +1,7 @@ --- title: Assumption Diagnostics and Regression Trouble Shooting -teaching: 40 +teaching: 40 minutes +exercises: 20 minutes source: Rmd --- @@ -170,3 +171,11 @@ An observation may be influential because: If the case is 1 or 2, then you can remove the point (or correct it). If it's 3, it's not worth deleting a valid point; the data may be better suited a non-linear model. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Choose 3 of the of the regression diagnostics, check that the assumptions are met for the linear models you have created using the GapMinder data. + +State what you tested and discuss your findings. From f48cb221f7ec6aa31caa903e44ba5ebd3291c7ac Mon Sep 17 00:00:00 2001 From: quirksahern Date: Tue, 20 Aug 2024 11:28:16 +0100 Subject: [PATCH 21/35] Update 24-linear-reg-broom.Rmd Log transform added --- episodes/24-linear-reg-broom.Rmd | 17 ++++++++++++++++- 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/episodes/24-linear-reg-broom.Rmd b/episodes/24-linear-reg-broom.Rmd index f615a65e3..865c20a4f 100644 --- a/episodes/24-linear-reg-broom.Rmd +++ b/episodes/24-linear-reg-broom.Rmd @@ -83,7 +83,22 @@ R-square shows the amount of variance of Y explained by X. In this case the Livi ### Log transform -Don't know what to put here yet +If your data is skwewed, it can be useful to transform a variable to it's log form when doing the regression. +You can either transform the variable beforehand or do so in the equation. + +```{r} + +reg_logbarriers_health <- lm(health_london_rank ~ log (barriers_london_rank), data = lon_dims_imd_2019) + +summary(reg_logbarriers_health) + +``` +The interpretation of the log-transformed variable is a bit different. +In this example only the predictor variable is log tranformed, therefore to interpret the slope coefficient we divide it by 100 (2917.0/100=29.170). + +If the dependent/response variable is solely log-transformed. Exponentiate the coefficient. This gives the multiplicative factor for every one-unit increase in the independent variable. Example: the coefficient is 0.198. exp(0.198) = 1.218962. For every one-unit increase in the independent variable, our dependent variable increases by a factor of about 1.22, or 22%. Recall that multiplying a number by 1.22 is the same as increasing the number by 22%. Likewise, multiplying a number by, say 0.84, is the same as decreasing the number by 1 – 0.84 = 0.16, or 16%. + +If both are transformed. nterpret the coefficient as the percent increase in the dependent variable for every 1% increase in the independent variable. Example: the coefficient is 0.198. For every 1% increase in the independent variable, our dependent variable increases by about 0.20%. For x percent increase, calculate 1.x to the power of the coefficient, subtract 1, and multiply by 100. Example: For every 20% increase in the independent variable, our dependent variable increases by about (1.20 0.198 - 1) * 100 = 3.7 percent. ### Predicted values and Residuals From 177fbfa76b8f081a97618e772030e088ab88c387 Mon Sep 17 00:00:00 2001 From: Milan Malfait <38256462+milanmlft@users.noreply.github.com> Date: Fri, 23 Aug 2024 10:41:18 +0100 Subject: [PATCH 22/35] Update renv --- renv/profiles/lesson-requirements/renv.lock | 41 +++++++++++++++++++++ 1 file changed, 41 insertions(+) diff --git a/renv/profiles/lesson-requirements/renv.lock b/renv/profiles/lesson-requirements/renv.lock index 2fde3baa5..2306a1ee5 100644 --- a/renv/profiles/lesson-requirements/renv.lock +++ b/renv/profiles/lesson-requirements/renv.lock @@ -849,6 +849,19 @@ ], "Hash": "b8552d117e1b808b09a832f589b79035" }, + "lmtest": { + "Package": "lmtest", + "Version": "0.9-40", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "graphics", + "stats", + "zoo" + ], + "Hash": "c6fafa6cccb1e1dfe7f7d122efd6e6a7" + }, "lubridate": { "Package": "lubridate", "Version": "1.9.3", @@ -1221,6 +1234,19 @@ ], "Hash": "0bcf0c6f274e90ea314b812a6d19a519" }, + "sandwich": { + "Package": "sandwich", + "Version": "3.1-0", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "stats", + "utils", + "zoo" + ], + "Hash": "1cf6ae532f0179350862fefeb0987c9b" + }, "sass": { "Package": "sass", "Version": "0.4.9", @@ -1591,6 +1617,21 @@ "Source": "Repository", "Repository": "CRAN", "Hash": "51dab85c6c98e50a18d7551e9d49f76c" + }, + "zoo": { + "Package": "zoo", + "Version": "1.8-12", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "grDevices", + "graphics", + "lattice", + "stats", + "utils" + ], + "Hash": "5c715954112b45499fb1dadc6ee6ee3e" } } } From 053ad1b62e9c6b9918c4c36c0a16e0dba936af5f Mon Sep 17 00:00:00 2001 From: Milan Malfait <38256462+milanmlft@users.noreply.github.com> Date: Fri, 23 Aug 2024 10:44:34 +0100 Subject: [PATCH 23/35] Add file extension to episode 25 --- episodes/{25-linreg-diagnostics => 25-linreg-diagnostics.Rmd} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename episodes/{25-linreg-diagnostics => 25-linreg-diagnostics.Rmd} (100%) diff --git a/episodes/25-linreg-diagnostics b/episodes/25-linreg-diagnostics.Rmd similarity index 100% rename from episodes/25-linreg-diagnostics rename to episodes/25-linreg-diagnostics.Rmd From 28a4b6ffeada56f623781e454795ca27d21507bf Mon Sep 17 00:00:00 2001 From: Milan Malfait <38256462+milanmlft@users.noreply.github.com> Date: Fri, 23 Aug 2024 10:45:08 +0100 Subject: [PATCH 24/35] Add linreg episodes to config --- config.yaml | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/config.yaml b/config.yaml index 6aff2efbc..c5819a9b3 100644 --- a/config.yaml +++ b/config.yaml @@ -87,9 +87,8 @@ episodes: # basic stats - 22-quarto.md - 23-statistics.Rmd - # 23-regression - # 24-logregression - # 25-broom + - 24-linear-reg-broom.Rmd + - 25-linreg-diagnostics.Rmd # 26-r-sql - 27-wrap-up.Rmd From 52981ffc20794dc5c65e31e318d3d90d11ca1643 Mon Sep 17 00:00:00 2001 From: Milan Malfait <38256462+milanmlft@users.noreply.github.com> Date: Fri, 23 Aug 2024 10:46:40 +0100 Subject: [PATCH 25/35] Formatting: whitespace, code style, etc. --- episodes/24-linear-reg-broom.Rmd | 91 ++++++++++-------------------- episodes/25-linreg-diagnostics.Rmd | 51 ++++++----------- 2 files changed, 47 insertions(+), 95 deletions(-) diff --git a/episodes/24-linear-reg-broom.Rmd b/episodes/24-linear-reg-broom.Rmd index 865c20a4f..25ae0bc9e 100644 --- a/episodes/24-linear-reg-broom.Rmd +++ b/episodes/24-linear-reg-broom.Rmd @@ -8,7 +8,7 @@ source: Rmd ::::::::::::::::::::::::::::::::::::::: objectives - To be able to explore relationships between variables -- To be able to calculate predicted variables and residuals +- To be able to calculate predicted variables and residuals - To be able to construct linear regression models - To be able to present model outcomes using Broom @@ -31,7 +31,6 @@ source: Rmd ## Data ```{r libraries, message=FALSE, warning=FALSE} - # We will need these libraries and this data later. library(ggplot2) library(tidyverse) @@ -40,25 +39,24 @@ library(sandwich) library(broom) lon_dims_imd_2019 <- read.csv(".data/English_IMD_2019_Domains_rebased_London_by_CDRC.csv") - ``` We are going to use the data from the Consumer Data Research Centre, specifically the London IMD 2019 (English IMD 2019 Domains rebased). -Atribution: Data provided by the Consumer Data Research Centre, an ESRC Data Investment: ES/L011840/1, ES/L011891/1 +Atribution: Data provided by the Consumer Data Research Centre, an ESRC Data Investment: ES/L011840/1, ES/L011891/1 The statistical unit areas across the country are Lower layer Super Output Areas (LSOAs). We will explore the relationships between the different dimensions of the Indices of Multiple Deprivation. ## Linear Regression -Linear Regression enables use to to explore the the linear relationship of the dependent variable Y and independent variable(s) X(s). +Linear Regression enables use to to explore the the linear relationship of the dependent variable Y and independent variable(s) X(s). We are going to explore the linear relationship between the Health Deprivation and Disability Domain and the Living Environment Deprivation Domain. The Health Deprivation and Disability Domain measures the risk of premature death and the impairment of quality of life through poor physical or mental health. The domain measures morbidity, disability and premature mortality but not aspects of behaviour or environment that may be predictive of future health deprivation. The Living Environment Deprivation Domain measures the quality of the local environment. The indicators fall into two sub-domains. The ‘indoors’ living environment measures the quality of housing; while the ‘outdoors’ living environment contains measures of air quality and road traffic accidents. -Reference: +Reference: McLennan, David et al. The English Indices of Deprivation 2019 : Technical Report. Ministry of Housing, Communities and Local Government, 2019. Print. @@ -66,34 +64,30 @@ Reference: In the simple linear regression example we have only one dependent variable (health_london_rank) and one independent variable (livingEnv_london_rank). ```{r} - reg_LivEnv_health <- lm(health_london_rank ~ livingEnv_london_rank, data = lon_dims_imd_2019) -# We put the dependent variable to the left of the '~' and the independent variable(s) to the right +# We put the dependent variable to the left of the '~' and the independent variable(s) to the right # and we tell R which dataset we are referring to. summary(reg_LivEnv_health) - ``` From the result of this analysis, we can see that the Living Environment Deprivation Domain rank has a significant(small p-value, general rule of thumb <0.05) and positive relationship(positive coefficient) with the Health Deprivation and Disability Domain rank. -One way of interpreting the result is: One unit increase in the Living Environment rank is related to around 0.343 (3.430e-01) points increase of the Health Deprivation and Disability rank. +One way of interpreting the result is: One unit increase in the Living Environment rank is related to around 0.343 (3.430e-01) points increase of the Health Deprivation and Disability rank. R-square shows the amount of variance of Y explained by X. In this case the Living Environment rank explains 6.225% of the variance in the Health Deprivation and Disability rank. Adj R2(6.205%) shows the same as R2 but adjusted by the # of cases and # of variables. When the # of variables is small and the # of cases is very large then Adj R2 is closer to R2. ### Log transform -If your data is skwewed, it can be useful to transform a variable to it's log form when doing the regression. +If your data is skwewed, it can be useful to transform a variable to it's log form when doing the regression. You can either transform the variable beforehand or do so in the equation. ```{r} - -reg_logbarriers_health <- lm(health_london_rank ~ log (barriers_london_rank), data = lon_dims_imd_2019) +reg_logbarriers_health <- lm(health_london_rank ~ log(barriers_london_rank), data = lon_dims_imd_2019) summary(reg_logbarriers_health) - ``` -The interpretation of the log-transformed variable is a bit different. +The interpretation of the log-transformed variable is a bit different. In this example only the predictor variable is log tranformed, therefore to interpret the slope coefficient we divide it by 100 (2917.0/100=29.170). If the dependent/response variable is solely log-transformed. Exponentiate the coefficient. This gives the multiplicative factor for every one-unit increase in the independent variable. Example: the coefficient is 0.198. exp(0.198) = 1.218962. For every one-unit increase in the independent variable, our dependent variable increases by a factor of about 1.22, or 22%. Recall that multiplying a number by 1.22 is the same as increasing the number by 22%. Likewise, multiplying a number by, say 0.84, is the same as decreasing the number by 1 – 0.84 = 0.16, or 16%. @@ -106,25 +100,23 @@ We can expand our simple linear regression example to incorporate the Barriers t The Barriers to Housing and Services Domain measures the physical and financial accessibility of housing and local services. The indicators fall into two sub-domains: ‘geographical barriers’, which relate to the physical proximity of local services, and ‘wider barriers’ which includes issues relating to access to housing, such as affordability. ```{r} - reg_LivEnv_barriers_health <- lm(health_london_rank ~ livingEnv_london_rank + barriers_london_rank, data = lon_dims_imd_2019) summary(reg_LivEnv_barriers_health) - ``` After running the regression model, we can access the model predicted values and the residuals compared to the real observations. ```{r} -#first we fit the predictions -health_rank_pred<-fitted(reg_LivEnv_barriers_health) -health_rank_pred<-as.data.frame(health_rank_pred) +# first we fit the predictions +health_rank_pred <- fitted(reg_LivEnv_barriers_health) +health_rank_pred <- as.data.frame(health_rank_pred) -#now we add the residual values too -health_rank_resid<-residuals(reg_LivEnv_barriers_health) -health_rank_pred$resid<-health_rank_resid +# now we add the residual values too +health_rank_resid <- residuals(reg_LivEnv_barriers_health) +health_rank_pred$resid <- health_rank_resid -#We can thenview the predictions and residuals +# We can thenview the predictions and residuals View(health_rank_pred) ``` @@ -133,15 +125,15 @@ View(health_rank_pred) We can run the robust standard error regressions(control for heteroskedasticity, meaning unequal variances): ```{r} -reg_LivEnv_barriers_health$robse <- vcovHC(reg_LivEnv_barriers_health, type="HC1") -coeftest(reg_LivEnv_barriers_health,reg_LivEnv_barriers_health$robse) +reg_LivEnv_barriers_health$robse <- vcovHC(reg_LivEnv_barriers_health, type = "HC1") +coeftest(reg_LivEnv_barriers_health, reg_LivEnv_barriers_health$robse) ``` In addition, we can access the cluster-robust standard errors regression results: ```{r} -#cluster-robust standard errors -coeftest(reg_LivEnv_barriers_health, reg_LivEnv_barriers_health$clse) +# cluster-robust standard errors +coeftest(reg_LivEnv_barriers_health, reg_LivEnv_barriers_health$clse) ``` ::::::::::::::::::::::::::::::::::::::: challenge @@ -163,7 +155,7 @@ reg_cat_var <- lm(health_london_rank ~ livingEnv_london_rank + barriers_london_r summary(reg_cat_var) ``` -R automatically recognizes la19nm as a factor and treats it accordingly. +R automatically recognizes la19nm as a factor and treats it accordingly. The missing one in the coefficient summary (Barking and Dagenham) is treated as a base line, therefore the value is 0. However, we can also modify our model to show for all: @@ -171,7 +163,6 @@ However, we can also modify our model to show for all: reg_cat_var_showall <- lm(health_london_rank ~ 0 + livingEnv_london_rank + barriers_london_rank + la19nm, data = lon_dims_imd_2019) summary(reg_cat_var_showall) - ``` ### Categorical variables with interaction terms @@ -179,7 +170,7 @@ Sometimes we are interested in how a variable interacts with another variable. We can explore any interactions between locations (la19nm) and the living environment and barrier ranks. ```{r} -reg_cat_var_int <- lm(health_london_rank ~ la19nm*(livingEnv_london_rank + barriers_london_rank), data = lon_dims_imd_2019) +reg_cat_var_int <- lm(health_london_rank ~ la19nm * (livingEnv_london_rank + barriers_london_rank), data = lon_dims_imd_2019) summary(reg_cat_var_int) ``` @@ -205,31 +196,25 @@ These are: Let's revisit our initial linear model: ```{r} - reg_LivEnv_health <- lm(health_london_rank ~ livingEnv_london_rank, data = lon_dims_imd_2019) summary(reg_LivEnv_health) - ``` There is a lot of useful information, but it not available in a way so that you can combine it with other models or do further analysis. We can convert this to tabular data using the 'tidy' function. ```{r} - tidy(reg_LivEnv_health) - ``` The row names have been moved into a column called term, and the column names are simple and consistent (and can be accessed using $). -Information about the model can be explored with 'augment'. The function augments the original data with information from the model, -such as the fitted values and residuals for each of the original points in the regression. +Information about the model can be explored with 'augment'. The function augments the original data with information from the model, +such as the fitted values and residuals for each of the original points in the regression. ```{r} - augment(reg_LivEnv_health) - ``` Some of the data presented by 'augment' will be discussed in the episode Linear Regression Diagnostics. @@ -238,9 +223,7 @@ Summary statistics are computed for the entire regression, such as R^2 and the F ```{r} - glance(reg_LivEnv_health) - ``` ### Generalised linear models @@ -248,35 +231,27 @@ We can also use the 'broom' functions to present data from Generalised linear an For example, if we wanted to explore the Income Rank in relation to whether or not an area was within the City of London. ```{r} - # add a variable to indicate whether or not an area is within the City of London lon_dims_imd_2019 <- lon_dims_imd_2019 %>% mutate(city = la19nm == "City of London") # create a Generalised Linear Model glmlondims <- glm(city ~ Income_london_rank, lon_dims_imd_2019, family = "binomial") summary(glmlondims) - ``` Use of 'tidy': ```{r} - tidy(glmlondims) - ``` Use of 'augment': ```{r} - augment(glmlondims) - ``` Use of 'glance': ```{r} - glance(glmlondims) - ``` You will notice that the statistics computed by 'glance' are different for glm objects than for lm (e.g. deviance rather than R^2). @@ -286,46 +261,38 @@ The tidy function can also be applied a range of hypotheses tests, such as built ### t-test ```{r} - tt <- t.test(Income_london_rank ~ city, lon_dims_imd_2019) tidy(tt) - ``` Some cases might have fewer columns (for example, no confidence interval). Wilcox test: ```{r} - wt <- wilcox.test(Income_london_rank ~ city, lon_dims_imd_2019) tidy(wt) - ``` Since the 'tidy' output is already only one row, glance returns the same output: ```{r} -#t-test +# t-test glance(tt) -#Wilcox test +# Wilcox test glance(wt) - ``` -The 'augment' method is defined only for chi-squared tests, since there is no meaningful sense, for other tests, +The 'augment' method is defined only for chi-squared tests, since there is no meaningful sense, for other tests, in which a hypothesis test produces output about each initial data point. ```{r} # xtabs creates a frequency table of IMD deciles within London borooughs -chit <- chisq.test(xtabs(~la19nm + IDAOP_london_decile,data = lon_dims_imd_2019)) +chit <- chisq.test(xtabs(~ la19nm + IDAOP_london_decile, data = lon_dims_imd_2019)) tidy(chit) - ``` ```{r} - augment(chit) - ``` ::::::::::::::::::::::::::::::::::::::: challenge @@ -340,4 +307,4 @@ Which function(s) did you use and why? There are some conventions that enable consistency across the broom functions, these are: * The output of the tidy, augment and glance functions is always a tibble. * The output never has rownames. This ensures that you can combine it with other tidy outputs without fear of losing information (since rownames in R cannot contain duplicates). -* Some column names are kept consistent, so that they can be combined across different models and so that you know what to expect (in contrast to asking “is it pval or PValue?” every time). +* Some column names are kept consistent, so that they can be combined across different models and so that you know what to expect (in contrast to asking “is it pval or PValue?” every time). diff --git a/episodes/25-linreg-diagnostics.Rmd b/episodes/25-linreg-diagnostics.Rmd index d70caedd8..d168b323d 100644 --- a/episodes/25-linreg-diagnostics.Rmd +++ b/episodes/25-linreg-diagnostics.Rmd @@ -21,20 +21,18 @@ source: Rmd - Residual diagnostics - Heteroskedasticity Check -- Multicollinearity Check +- Multicollinearity Check - Identification of Influential Points ## Data ```{r libraries, message=FALSE, warning=FALSE} - # We will need these libraries and this data later. library(ggplot2) library(olsrr) library(car) lon_dims_imd_2019 <- read.csv(".data/English_IMD_2019_Domains_rebased_London_by_CDRC.csv") - ``` For accurate model interpretation and prediction, there are a number of assumptions about linear regression models that need to be verified. @@ -44,16 +42,15 @@ These are: * No Multicollinearity: It is essential that the independent variables are not too highly correlated with each other, a condition known as multicollinearity. This can be checked using: + Correlation matrices, where correlation coefficients should ideally be below 0.80. + Variance Inflation Factor (VIF), with VIF values above 10 indicating problematic multicollinearity. -* Homoscedasticity: The variance of error terms (residuals) should be consistent across all levels of the independent variables. +* Homoscedasticity: The variance of error terms (residuals) should be consistent across all levels of the independent variables. -We can perform a number of tests to see if the assumptions are met. +We can perform a number of tests to see if the assumptions are met. ## Residual Diagnostics ### Residual QQ Plot Plot for detecting violation of normality assumption. ```{r} - model1 <- lm(health_london_rank ~ livingEnv_london_rank + barriers_london_rank + la19nm, data = lon_dims_imd_2019) ols_plot_resid_qq(model1) @@ -62,16 +59,12 @@ ols_plot_resid_qq(model1) ### Residual Normality Test Test for detecting violation of normality assumption. ```{r} - ols_test_normality(model1) - ``` Correlation between observed residuals and expected residuals under normality. ```{r} - ols_test_correlation(model1) - ``` From these tests we can see that our assumptions are seemingly correct. @@ -84,18 +77,14 @@ Characteristics of a well behaved residual vs fitted plot: * No one residual is visibly away from the random pattern of the residuals indicating that there are no outliers. ```{r} - ols_plot_resid_fit(model1) - ``` ### Residual Histogram Additionally, we can create a histogram of residuals for detecting violation of normality assumption. ```{r} - ols_plot_resid_hist(model1) - ``` ## Additional Diagnostics @@ -106,9 +95,7 @@ In addition the residual diagnostics, we can also assess our model for Heteroske We can use the ncvTest() function to test for equal variances. ```{r} - ncvTest(model1) - ``` The p-value here is low, indicating that this model may have a problem of unequal variances. @@ -117,9 +104,7 @@ We want to know whether we have too many variables that have high correlation wi If the value of GVIF is greater than 4, it suggests collinearity. ```{r} - vif(model1) - ``` ### Outlier Identification @@ -128,10 +113,10 @@ The outlierTest() reports the Bonferroni p-values for testing each observation i linear (t-tests), generalized linear models (normal tests), and linear mixed models. ```{r} -#Create a qqPlot -qqPlot(model1,id.n=2) +# Create a qqPlot +qqPlot(model1, id.n = 2) -#Test for outliers +# Test for outliers outlierTest(model1) ``` @@ -140,26 +125,26 @@ In our data '4612' is difficult to identify as it may refer to either a Living E ### Identifying Influence Points -We can identify potential influential points via an Influence Plot, these may be Ouliers and/or High Leverage points. +We can identify potential influential points via an Influence Plot, these may be Ouliers and/or High Leverage points. A point can be considered influential if it's exclusion causes substantial change in the estimated regression. -An influence plot summarises the 3 metrics for influence analysis in one single glance. -The X-axis plots the leverage, which is normalised between 0 and 1. -The Y-axis plots the studentized residuals, which can be positive or negative. +An influence plot summarises the 3 metrics for influence analysis in one single glance. +The X-axis plots the leverage, which is normalised between 0 and 1. +The Y-axis plots the studentized residuals, which can be positive or negative. The size of the circles for each data point reflect its Cook’s distance, degree of influence. -Vertical reference lines are drawn at twice and three times the average hat value, horizontal reference lines at -2, 0, and 2 on the Studentized-residual scale. +Vertical reference lines are drawn at twice and three times the average hat value, horizontal reference lines at -2, 0, and 2 on the Studentized-residual scale. ```{r} -#Influence Plots +# Influence Plots influencePlot(model1) ``` -Identified influential points are returned in a data frame with the hat values, Studentized residuals and Cook's distance of the identified points. -If no points are identified, nothing is returned. +Identified influential points are returned in a data frame with the hat values, Studentized residuals and Cook's distance of the identified points. +If no points are identified, nothing is returned. -Influence points can be further explored with an Influence Index Plot which provides index plots of influence and related diagnostics for a regression model. +Influence points can be further explored with an Influence Index Plot which provides index plots of influence and related diagnostics for a regression model. ```{r} -influenceIndexPlot(model1, id.n=3) +influenceIndexPlot(model1, id.n = 3) ``` If an observation is influential then that observation can change the fit of the linear model. @@ -169,12 +154,12 @@ An observation may be influential because: 2. Someone made a fundamental mistake collecting the observation 3. The data point is perfectly valid, in which case the model cannot account for the behaviour. -If the case is 1 or 2, then you can remove the point (or correct it). +If the case is 1 or 2, then you can remove the point (or correct it). If it's 3, it's not worth deleting a valid point; the data may be better suited a non-linear model. ::::::::::::::::::::::::::::::::::::::: challenge -## Challenge +## Challenge Choose 3 of the of the regression diagnostics, check that the assumptions are met for the linear models you have created using the GapMinder data. From 91b0fb3e59a26d706f879f510c2d7a499e32b09b Mon Sep 17 00:00:00 2001 From: Milan Malfait <38256462+milanmlft@users.noreply.github.com> Date: Fri, 23 Aug 2024 10:47:14 +0100 Subject: [PATCH 26/35] Fix questions callout --- episodes/25-linreg-diagnostics.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/episodes/25-linreg-diagnostics.Rmd b/episodes/25-linreg-diagnostics.Rmd index d168b323d..bf71d7c1c 100644 --- a/episodes/25-linreg-diagnostics.Rmd +++ b/episodes/25-linreg-diagnostics.Rmd @@ -11,7 +11,7 @@ source: Rmd :::::::::::::::::::::::::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::: question +:::::::::::::::::::::::::::::::::::::::: questions - How can I check that my data is suitable for use in a linear regression model? From 4f995d5257755cb2eb1be914890e922c7e452eb8 Mon Sep 17 00:00:00 2001 From: Milan Malfait <38256462+milanmlft@users.noreply.github.com> Date: Fri, 23 Aug 2024 10:52:58 +0100 Subject: [PATCH 27/35] Add missing div close tags --- episodes/24-linear-reg-broom.Rmd | 13 ++++++++++--- episodes/25-linreg-diagnostics.Rmd | 4 +++- 2 files changed, 13 insertions(+), 4 deletions(-) diff --git a/episodes/24-linear-reg-broom.Rmd b/episodes/24-linear-reg-broom.Rmd index 25ae0bc9e..0228b0f24 100644 --- a/episodes/24-linear-reg-broom.Rmd +++ b/episodes/24-linear-reg-broom.Rmd @@ -135,14 +135,17 @@ In addition, we can access the cluster-robust standard errors regression result # cluster-robust standard errors coeftest(reg_LivEnv_barriers_health, reg_LivEnv_barriers_health$clse) ``` -::::::::::::::::::::::::::::::::::::::: challenge + +::::::::::::::::::::::::::::::::::::::: challenge ## Challenge 1 -Using the GapMinder data to create a linear model between two continuous variables. +Use the `gapminder` data to create a linear model between two continuous variables. Discuss your question and your findings. +:::::::::::::::::::::::::::::::::::::::::::::::::: + ### Regression with Categorical independent variables We will explore the use of categorical independent variables in linear regression in this episode. @@ -179,10 +182,12 @@ summary(reg_cat_var_int) ## Challenge 2 -Using the GapMinder data to create a linear model between a categorical and a continuous variable . +Using the `gapminder` data to create a linear model between a categorical and a continuous variable . Discuss your question and your findings. +:::::::::::::::::::::::::::::::::::::::::::::::::: + ## Broom The 'broom' package offers an alternative way of presenting the output of statistical analysis. @@ -303,6 +308,8 @@ Use broom to amend the display of your model outputs. Which function(s) did you use and why? +:::::::::::::::::::::::::::::::::::::::::::::::::: + ### Conventions There are some conventions that enable consistency across the broom functions, these are: * The output of the tidy, augment and glance functions is always a tibble. diff --git a/episodes/25-linreg-diagnostics.Rmd b/episodes/25-linreg-diagnostics.Rmd index bf71d7c1c..4697cedac 100644 --- a/episodes/25-linreg-diagnostics.Rmd +++ b/episodes/25-linreg-diagnostics.Rmd @@ -161,6 +161,8 @@ If it's 3, it's not worth deleting a valid point; the data may be better suited ## Challenge -Choose 3 of the of the regression diagnostics, check that the assumptions are met for the linear models you have created using the GapMinder data. +Choose 3 of the of the regression diagnostics, check that the assumptions are met for the linear models you have created using the `gapminder` data. State what you tested and discuss your findings. + +:::::::::::::::::::::::::::::::::::::::::::::::::: From c781738714681ce0a1d0c8f89812ed29aca51261 Mon Sep 17 00:00:00 2001 From: Milan Malfait <38256462+milanmlft@users.noreply.github.com> Date: Fri, 23 Aug 2024 10:54:55 +0100 Subject: [PATCH 28/35] Fix `objectives` callout --- episodes/25-linreg-diagnostics.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/episodes/25-linreg-diagnostics.Rmd b/episodes/25-linreg-diagnostics.Rmd index 4697cedac..5a2408a6d 100644 --- a/episodes/25-linreg-diagnostics.Rmd +++ b/episodes/25-linreg-diagnostics.Rmd @@ -5,7 +5,7 @@ exercises: 20 minutes source: Rmd --- -::::::::::::::::::::::::::::::::::::::: objective +::::::::::::::::::::::::::::::::::::::: objectives - To be able to check assumptions for linear regression models From efa533a0c20992edc266651275acfd5bbf3e53db Mon Sep 17 00:00:00 2001 From: Milan Malfait <38256462+milanmlft@users.noreply.github.com> Date: Fri, 23 Aug 2024 10:56:50 +0100 Subject: [PATCH 29/35] Timings should be just integer numbers --- episodes/25-linreg-diagnostics.Rmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/episodes/25-linreg-diagnostics.Rmd b/episodes/25-linreg-diagnostics.Rmd index 5a2408a6d..2e9588420 100644 --- a/episodes/25-linreg-diagnostics.Rmd +++ b/episodes/25-linreg-diagnostics.Rmd @@ -1,7 +1,7 @@ --- title: Assumption Diagnostics and Regression Trouble Shooting -teaching: 40 minutes -exercises: 20 minutes +teaching: 80 +exercises: 20 source: Rmd --- From ff3c8f8b07856e8e243e36db0c1a8b1ea2f00680 Mon Sep 17 00:00:00 2001 From: Milan Malfait <38256462+milanmlft@users.noreply.github.com> Date: Fri, 23 Aug 2024 11:06:53 +0100 Subject: [PATCH 30/35] More code formatting, avoid long lines --- episodes/23-statistics.Rmd | 5 ++--- episodes/24-linear-reg-broom.Rmd | 10 ++++++++-- 2 files changed, 10 insertions(+), 5 deletions(-) diff --git a/episodes/23-statistics.Rmd b/episodes/23-statistics.Rmd index be62da523..77eba4d2f 100644 --- a/episodes/23-statistics.Rmd +++ b/episodes/23-statistics.Rmd @@ -277,9 +277,8 @@ Is the difference between the income ranks statistically significant? ## Doing a t-test ```{r} - - t.test(health_london_rank ~ city, data = lon_dims_imd_2019)$statistic - t.test(health_london_rank ~ city, data = lon_dims_imd_2019)$parameter +t.test(health_london_rank ~ city, data = lon_dims_imd_2019)$statistic +t.test(health_london_rank ~ city, data = lon_dims_imd_2019)$parameter ``` Notice that the summary()** of the test contains more data than is output by default. diff --git a/episodes/24-linear-reg-broom.Rmd b/episodes/24-linear-reg-broom.Rmd index 0228b0f24..917dc1be8 100644 --- a/episodes/24-linear-reg-broom.Rmd +++ b/episodes/24-linear-reg-broom.Rmd @@ -100,7 +100,10 @@ We can expand our simple linear regression example to incorporate the Barriers t The Barriers to Housing and Services Domain measures the physical and financial accessibility of housing and local services. The indicators fall into two sub-domains: ‘geographical barriers’, which relate to the physical proximity of local services, and ‘wider barriers’ which includes issues relating to access to housing, such as affordability. ```{r} -reg_LivEnv_barriers_health <- lm(health_london_rank ~ livingEnv_london_rank + barriers_london_rank, data = lon_dims_imd_2019) +reg_LivEnv_barriers_health <- lm( + health_london_rank ~ livingEnv_london_rank + barriers_london_rank, + data = lon_dims_imd_2019 +) summary(reg_LivEnv_barriers_health) ``` @@ -163,7 +166,10 @@ The missing one in the coefficient summary (Barking and Dagenham) is treated as However, we can also modify our model to show for all: ```{r} -reg_cat_var_showall <- lm(health_london_rank ~ 0 + livingEnv_london_rank + barriers_london_rank + la19nm, data = lon_dims_imd_2019) +reg_cat_var_showall <- lm( + health_london_rank ~ 0 + livingEnv_london_rank + barriers_london_rank + la19nm, + data = lon_dims_imd_2019 +) summary(reg_cat_var_showall) ``` From f5fc7f500835ea2c8c7757a5fab004044a649a94 Mon Sep 17 00:00:00 2001 From: Milan Malfait <38256462+milanmlft@users.noreply.github.com> Date: Fri, 23 Aug 2024 11:07:14 +0100 Subject: [PATCH 31/35] Update renv; add ep 25 dependencies --- renv/profiles/lesson-requirements/renv.lock | 459 ++++++++++++++++++++ 1 file changed, 459 insertions(+) diff --git a/renv/profiles/lesson-requirements/renv.lock b/renv/profiles/lesson-requirements/renv.lock index 2306a1ee5..b79c82e53 100644 --- a/renv/profiles/lesson-requirements/renv.lock +++ b/renv/profiles/lesson-requirements/renv.lock @@ -28,6 +28,16 @@ ], "Hash": "065ae649b05f1ff66bb0c793107508f5" }, + "Deriv": { + "Package": "Deriv", + "Version": "4.1.3", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "methods" + ], + "Hash": "bc727aba82bcda74db0bbe188ec6757c" + }, "DiagrammeR": { "Package": "DiagrammeR", "Version": "1.0.11", @@ -88,6 +98,19 @@ ], "Hash": "1920b2f11133b12350024297d8a4ff4a" }, + "MatrixModels": { + "Package": "MatrixModels", + "Version": "0.5-3", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "Matrix", + "R", + "methods", + "stats" + ], + "Hash": "0776bf7526869e0286b0463cb72fb211" + }, "R6": { "Package": "R6", "Version": "2.5.1", @@ -108,6 +131,56 @@ ], "Hash": "45f0398006e83a5b10b72a90663d8d8c" }, + "Rcpp": { + "Package": "Rcpp", + "Version": "1.0.13", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "methods", + "utils" + ], + "Hash": "f27411eb6d9c3dada5edd444b8416675" + }, + "RcppEigen": { + "Package": "RcppEigen", + "Version": "0.3.4.0.1", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "Rcpp", + "stats", + "utils" + ], + "Hash": "598ee071e758b5f667b0ebe83f5d5085" + }, + "SparseM": { + "Package": "SparseM", + "Version": "1.84-2", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "graphics", + "methods", + "stats", + "utils" + ], + "Hash": "e78499cbcbbca98200254bd171379165" + }, + "abind": { + "Package": "abind", + "Version": "1.4-5", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "methods", + "utils" + ], + "Hash": "4f57884290cc75ab22f4af9e9d4ca862" + }, "askpass": { "Package": "askpass", "Version": "1.2.0", @@ -174,6 +247,18 @@ ], "Hash": "40415719b5a479b87949f3aa0aee737c" }, + "boot": { + "Package": "boot", + "Version": "1.3-30", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "graphics", + "stats" + ], + "Hash": "96abeed416a286d4a0f52e550b612343" + }, "broom": { "Package": "broom", "Version": "1.0.6", @@ -240,6 +325,40 @@ ], "Hash": "d7e13f49c19103ece9e58ad2d83a7354" }, + "car": { + "Package": "car", + "Version": "3.1-2", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "MASS", + "R", + "abind", + "carData", + "grDevices", + "graphics", + "lme4", + "mgcv", + "nlme", + "nnet", + "pbkrtest", + "quantreg", + "scales", + "stats", + "utils" + ], + "Hash": "839b351f31d56e0147439eb22c00a66a" + }, + "carData": { + "Package": "carData", + "Version": "3.0-5", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R" + ], + "Hash": "ac6cdb8552c61bd36b0e54d07cf2aab7" + }, "cellranger": { "Package": "cellranger", "Version": "1.1.0", @@ -287,6 +406,13 @@ ], "Hash": "d954cb1c57e8d8b756165d7ba18aa55a" }, + "commonmark": { + "Package": "commonmark", + "Version": "1.9.1", + "Source": "Repository", + "Repository": "CRAN", + "Hash": "5d8225445acb167abf7797de48b2ee3c" + }, "conflicted": { "Package": "conflicted", "Version": "1.2.0", @@ -300,6 +426,23 @@ ], "Hash": "bb097fccb22d156624fd07cd2894ddb6" }, + "cowplot": { + "Package": "cowplot", + "Version": "1.1.3", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "ggplot2", + "grDevices", + "grid", + "gtable", + "methods", + "rlang", + "scales" + ], + "Hash": "8ef2084dd7d28847b374e55440e4f8cb" + }, "cpp11": { "Package": "cpp11", "Version": "0.4.7", @@ -382,6 +525,30 @@ ], "Hash": "fd6824ad91ede64151e93af67df6376b" }, + "doBy": { + "Package": "doBy", + "Version": "4.6.22", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "Deriv", + "MASS", + "Matrix", + "R", + "boot", + "broom", + "cowplot", + "dplyr", + "ggplot2", + "methods", + "microbenchmark", + "modelr", + "rlang", + "tibble", + "tidyr" + ], + "Hash": "a9b56885b9596c284168df26d6179c40" + }, "dplyr": { "Package": "dplyr", "Version": "1.1.4", @@ -569,6 +736,17 @@ ], "Hash": "e0b3a53876554bd45879e596cdb10a52" }, + "goftest": { + "Package": "goftest", + "Version": "1.2-3", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "stats" + ], + "Hash": "dbe0201f91eeb15918dd3fbf01ee689a" + }, "googledrive": { "Package": "googledrive", "Version": "2.1.1", @@ -622,6 +800,20 @@ ], "Hash": "d6db1667059d027da730decdc214b959" }, + "gridExtra": { + "Package": "gridExtra", + "Version": "2.3", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "grDevices", + "graphics", + "grid", + "gtable", + "utils" + ], + "Hash": "7d7f283939f563670a697165b2cf5560" + }, "gtable": { "Package": "gtable", "Version": "0.3.5", @@ -714,6 +906,21 @@ ], "Hash": "04291cc45198225444a397606810ac37" }, + "httpuv": { + "Package": "httpuv", + "Version": "1.6.15", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "R6", + "Rcpp", + "later", + "promises", + "utils" + ], + "Hash": "d55aa087c47a63ead0f6fc10f8fa1ee0" + }, "httr": { "Package": "httr", "Version": "1.4.7", @@ -821,6 +1028,17 @@ ], "Hash": "b64ec208ac5bc1852b285f665d6368b3" }, + "later": { + "Package": "later", + "Version": "1.3.2", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "Rcpp", + "rlang" + ], + "Hash": "a3e051d405326b8b0012377434c62b37" + }, "lattice": { "Package": "lattice", "Version": "0.22-6", @@ -849,6 +1067,32 @@ ], "Hash": "b8552d117e1b808b09a832f589b79035" }, + "lme4": { + "Package": "lme4", + "Version": "1.1-35.5", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "MASS", + "Matrix", + "R", + "Rcpp", + "RcppEigen", + "boot", + "graphics", + "grid", + "lattice", + "methods", + "minqa", + "nlme", + "nloptr", + "parallel", + "splines", + "stats", + "utils" + ], + "Hash": "16a08fc75007da0d08e0c0388c7c33e6" + }, "lmtest": { "Package": "lmtest", "Version": "0.9-40", @@ -913,6 +1157,17 @@ ], "Hash": "110ee9d83b496279960e162ac97764ce" }, + "microbenchmark": { + "Package": "microbenchmark", + "Version": "1.4.10", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "graphics", + "stats" + ], + "Hash": "db81b552e393ed092872cf7023469bc2" + }, "mime": { "Package": "mime", "Version": "0.12", @@ -923,6 +1178,16 @@ ], "Hash": "18e9c28c1d3ca1560ce30658b22ce104" }, + "minqa": { + "Package": "minqa", + "Version": "1.2.8", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "Rcpp" + ], + "Hash": "785ef8e22389d4a7634c6c944f2dc07d" + }, "modelr": { "Package": "modelr", "Version": "0.1.11", @@ -966,6 +1231,64 @@ ], "Hash": "2769a88be217841b1f33ed469675c3cc" }, + "nloptr": { + "Package": "nloptr", + "Version": "2.1.1", + "Source": "Repository", + "Repository": "CRAN", + "Hash": "27550641889a3abf3aec4d91186311ec" + }, + "nnet": { + "Package": "nnet", + "Version": "7.3-19", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "stats", + "utils" + ], + "Hash": "2c797b46eea7fb58ede195bc0b1f1138" + }, + "nortest": { + "Package": "nortest", + "Version": "1.0-4", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "stats" + ], + "Hash": "e587e7a30c737ad415590976481332e4" + }, + "numDeriv": { + "Package": "numDeriv", + "Version": "2016.8-1.1", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R" + ], + "Hash": "df58958f293b166e4ab885ebcad90e02" + }, + "olsrr": { + "Package": "olsrr", + "Version": "0.6.0", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "car", + "ggplot2", + "goftest", + "graphics", + "gridExtra", + "nortest", + "stats", + "utils", + "xplorerr" + ], + "Hash": "fc700aca442f1467f237803f5b0387c8" + }, "openssl": { "Package": "openssl", "Version": "2.2.0", @@ -976,6 +1299,24 @@ ], "Hash": "2bcca3848e4734eb3b16103bc9aa4b8e" }, + "pbkrtest": { + "Package": "pbkrtest", + "Version": "0.5.3", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "MASS", + "Matrix", + "R", + "broom", + "doBy", + "dplyr", + "lme4", + "methods", + "numDeriv" + ], + "Hash": "938e6bbc4ac57534f8b43224506a8966" + }, "pillar": { "Package": "pillar", "Version": "1.9.0", @@ -1040,6 +1381,22 @@ ], "Hash": "f4625e061cb2865f111b47ff163a5ca6" }, + "promises": { + "Package": "promises", + "Version": "1.3.0", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R6", + "Rcpp", + "fastmap", + "later", + "magrittr", + "rlang", + "stats" + ], + "Hash": "434cd5388a3979e74be5c219bcd6e77d" + }, "ps": { "Package": "ps", "Version": "1.7.7", @@ -1066,6 +1423,24 @@ ], "Hash": "1cba04a4e9414bdefc9dcaa99649a8dc" }, + "quantreg": { + "Package": "quantreg", + "Version": "5.98", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "MASS", + "Matrix", + "MatrixModels", + "R", + "SparseM", + "graphics", + "methods", + "stats", + "survival" + ], + "Hash": "017561f17632c065388b7062da030952" + }, "ragg": { "Package": "ragg", "Version": "1.3.2", @@ -1294,6 +1669,49 @@ ], "Hash": "3838071b66e0c566d55cc26bd6e27bf4" }, + "shiny": { + "Package": "shiny", + "Version": "1.9.1", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "R6", + "bslib", + "cachem", + "commonmark", + "crayon", + "fastmap", + "fontawesome", + "glue", + "grDevices", + "htmltools", + "httpuv", + "jsonlite", + "later", + "lifecycle", + "methods", + "mime", + "promises", + "rlang", + "sourcetools", + "tools", + "utils", + "withr", + "xtable" + ], + "Hash": "6a293995a66e12c48d13aa1f957d09c7" + }, + "sourcetools": { + "Package": "sourcetools", + "Version": "0.1.7-1", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R" + ], + "Hash": "5f5a7629f956619d519205ec475fe647" + }, "stringi": { "Package": "stringi", "Version": "1.8.4", @@ -1324,6 +1742,22 @@ ], "Hash": "960e2ae9e09656611e0b8214ad543207" }, + "survival": { + "Package": "survival", + "Version": "3.7-0", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "Matrix", + "R", + "graphics", + "methods", + "splines", + "stats", + "utils" + ], + "Hash": "5aaa9cbaf4aba20f8e06fdea1850a398" + }, "sys": { "Package": "sys", "Version": "3.4.2", @@ -1611,6 +2045,31 @@ ], "Hash": "1d0336142f4cd25d8d23cd3ba7a8fb61" }, + "xplorerr": { + "Package": "xplorerr", + "Version": "0.1.2", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "Rcpp", + "shiny", + "utils" + ], + "Hash": "88a988b19df700056d77966b74cdfdab" + }, + "xtable": { + "Package": "xtable", + "Version": "1.8-4", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "stats", + "utils" + ], + "Hash": "b8acdf8af494d9ec19ccb2481a9b11c2" + }, "yaml": { "Package": "yaml", "Version": "2.3.10", From 35c895da487d77b2fcad09c0b97a9fc1d3b491cf Mon Sep 17 00:00:00 2001 From: Milan Malfait <38256462+milanmlft@users.noreply.github.com> Date: Fri, 23 Aug 2024 11:09:35 +0100 Subject: [PATCH 32/35] Fix data path --- episodes/24-linear-reg-broom.Rmd | 2 +- episodes/25-linreg-diagnostics.Rmd | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/episodes/24-linear-reg-broom.Rmd b/episodes/24-linear-reg-broom.Rmd index 917dc1be8..5e46f0eda 100644 --- a/episodes/24-linear-reg-broom.Rmd +++ b/episodes/24-linear-reg-broom.Rmd @@ -38,7 +38,7 @@ library(lmtest) library(sandwich) library(broom) -lon_dims_imd_2019 <- read.csv(".data/English_IMD_2019_Domains_rebased_London_by_CDRC.csv") +lon_dims_imd_2019 <- read.csv("data/English_IMD_2019_Domains_rebased_London_by_CDRC.csv") ``` We are going to use the data from the Consumer Data Research Centre, specifically the London IMD 2019 (English IMD 2019 Domains rebased). diff --git a/episodes/25-linreg-diagnostics.Rmd b/episodes/25-linreg-diagnostics.Rmd index 2e9588420..bb1663453 100644 --- a/episodes/25-linreg-diagnostics.Rmd +++ b/episodes/25-linreg-diagnostics.Rmd @@ -32,7 +32,7 @@ library(ggplot2) library(olsrr) library(car) -lon_dims_imd_2019 <- read.csv(".data/English_IMD_2019_Domains_rebased_London_by_CDRC.csv") +lon_dims_imd_2019 <- read.csv("data/English_IMD_2019_Domains_rebased_London_by_CDRC.csv") ``` For accurate model interpretation and prediction, there are a number of assumptions about linear regression models that need to be verified. From 66f38cf8d95bee880a9ec4ea0243aba60379a607 Mon Sep 17 00:00:00 2001 From: quirksahern Date: Fri, 23 Aug 2024 12:50:05 +0100 Subject: [PATCH 33/35] Update episodes/24-linear-reg-broom.Rmd Fix view file option Co-authored-by: Milan Malfait <38256462+milanmlft@users.noreply.github.com> --- episodes/24-linear-reg-broom.Rmd | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/episodes/24-linear-reg-broom.Rmd b/episodes/24-linear-reg-broom.Rmd index 5e46f0eda..666083d79 100644 --- a/episodes/24-linear-reg-broom.Rmd +++ b/episodes/24-linear-reg-broom.Rmd @@ -120,6 +120,11 @@ health_rank_resid <- residuals(reg_LivEnv_barriers_health) health_rank_pred$resid <- health_rank_resid # We can thenview the predictions and residuals +head(health_rank_pred) +``` + +```{r, eval=FALSE} +# You can view the full data in RStudio with the View() function View(health_rank_pred) ``` From 84d7cbf1e2b80f30786d51419b942497164f9b52 Mon Sep 17 00:00:00 2001 From: quirksahern Date: Fri, 23 Aug 2024 13:09:37 +0100 Subject: [PATCH 34/35] Update 24-linear-reg-broom.Rmd chisq decile converted to factor Some text about ChiSq assumptions added --- episodes/24-linear-reg-broom.Rmd | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/episodes/24-linear-reg-broom.Rmd b/episodes/24-linear-reg-broom.Rmd index 666083d79..5a99dc10e 100644 --- a/episodes/24-linear-reg-broom.Rmd +++ b/episodes/24-linear-reg-broom.Rmd @@ -296,11 +296,15 @@ glance(tt) # Wilcox test glance(wt) ``` +The chisq.test enables use to investigate whether changes in one categorical variable are related to changes in another categorical variable. The 'augment' method is defined only for chi-squared tests, since there is no meaningful sense, for other tests, in which a hypothesis test produces output about each initial data point. ```{r} +# convert IDAOP_london_decile to a factor so it is not interprested as continuous data +lon_dims_imd_2019$IDAOP_london_decile <- factor(lon_dims_imd_2019$IDAOP_london_decile) + # xtabs creates a frequency table of IMD deciles within London borooughs chit <- chisq.test(xtabs(~ la19nm + IDAOP_london_decile, data = lon_dims_imd_2019)) @@ -310,6 +314,14 @@ tidy(chit) ```{r} augment(chit) ``` +There are a number of underlying assumptions of a Chi-Square test, these are: +* Independence: The Chi-Square test assumes that the observations in the data are independent of each other. This means that the outcome of one observation should not influence the outcome of another. +* Random Sampling: The data should be obtained through random sampling to ensure that it is representative of the population from which it was drawn. +* Expected Frequency: The Chi-Square test assumes that the expected frequency count for each cell in the contingency table should be at least 5. If this assumption is not met, the test results may not be reliable. + +As we have received a warning about the reliability of our test, it is likely that one of these assumptions has not been met, +and that this is not a suitable test for this data. + ::::::::::::::::::::::::::::::::::::::: challenge From d86dabf9e8f3aac4b7407ca80fa018578c2acff3 Mon Sep 17 00:00:00 2001 From: quirksahern Date: Fri, 23 Aug 2024 13:16:34 +0100 Subject: [PATCH 35/35] Update 25-linreg-diagnostics.Rmd line 147 parameter id.n = 3 removed as causing a large number of errors --- episodes/25-linreg-diagnostics.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/episodes/25-linreg-diagnostics.Rmd b/episodes/25-linreg-diagnostics.Rmd index bb1663453..df830e845 100644 --- a/episodes/25-linreg-diagnostics.Rmd +++ b/episodes/25-linreg-diagnostics.Rmd @@ -144,7 +144,7 @@ If no points are identified, nothing is returned. Influence points can be further explored with an Influence Index Plot which provides index plots of influence and related diagnostics for a regression model. ```{r} -influenceIndexPlot(model1, id.n = 3) +influenceIndexPlot(model1) ``` If an observation is influential then that observation can change the fit of the linear model.