UCL-ARC · quirksahern · Aug 8, 2024 · Aug 8, 2024
diff --git a/episodes/23-statistics.Rmd b/episodes/23-statistics.Rmd
@@ -33,16 +33,20 @@ source: Rmd
 ```{r libraries, message=FALSE, warning=FALSE}
 # We will need these libraries and this data later.
 library(tidyverse)
+library(ggplot2)
+
 # loading data
 lon_dims_imd_2019 <- read.csv("data/English_IMD_2019_Domains_rebased_London_by_CDRC.csv")
-# Commenting out as not used in this version
-# library(lubridate)
-#library(gapminder)
+
 # create a binary membership variable for City of London (for later examples)
 lon_dims_imd_2019 <- lon_dims_imd_2019 %>% mutate(city = la19nm == "City of London")
 ```
 
-We are going to use the data from the gapminder package.  We have added a variable *European* indicating if a country is in Europe.
+We are going to use the data from the Consumer Data Research Centre, specifically the London IMD 2019 (English IMD 2019 Domains rebased) data.
+Atribution: Data provided by the Consumer Data Research Centre, an ESRC Data Investment: ES/L011840/1, ES/L011891/1
+
+The statistical unit areas used to provide indices of relative deprivation across the country are Lower layer Super Output Areas (LSOAs), dimensions of depravation include income, employment, education, health, crime, barriers to housing and services, and the living environment.
+We have added a variable *city* indicating if an LSOA is within the City of London, or not.
 
 ## The big picture
 
@@ -246,7 +250,7 @@ It all starts with a hypothesis
 
 ## Comparing means
 
-Is there an absolute difference between the income ranks of the Lower-layer Super Output Areas
+Is there an absolute difference between the income ranks of the Lower-layer Super Output Areas?
 
 ```{r}
 lon_dims_imd_2019 %>%
@@ -295,14 +299,6 @@ Testing supported the rejection of the null hypothesis that there is no differen
 While the t-test is sufficient where there are two levels of the IV, for situations where there are more than two, we use the **ANOVA** family of procedures. To show this, we will create a variable that subsets our data by *per capita GDP* levels. If the ANOVA result is statistically significant, we will use a post-hoc test method to do pairwise comparisons (here Tukey's Honest Significant Differences.)
 
 ```{r}
-# quantile(gapminder$gdpPercap)
-# IQR(gapminder$gdpPercap)
-
-# gapminder$gdpGroup <- cut(gapminder$gdpPercap, breaks = c(241.1659, 1202.0603, 3531.8470, 9325.4623, 113523.1329), labels = FALSE)
-
-# gapminder$gdpGroup <- factor(gapminder$gdpGroup)
-
-# anovamodel <- aov(gapminder$pop ~ gapminder$gdpGroup)
 anovamodel <- aov(lon_dims_imd_2019$health_london_rank ~ lon_dims_imd_2019$la19nm)
 summary(anovamodel)
 
@@ -314,10 +310,8 @@ TukeyHSD(anovamodel)
 The most common use of regression modelling is to explore the relationship between two continuous variables, for example between `Income_london_rank` and `health_london_rank` in our data. We can first determine whether there is any significant correlation between the values, and if there is, plot the relationship.
 
 ```{r}
-# cor.test(gapminder$gdpPercap, gapminder$lifeExp)
 cor.test(lon_dims_imd_2019$Income_london_rank, lon_dims_imd_2019$health_london_rank)
 
-# ggplot(gapminder, aes(gdpPercap, log(lifeExp))) +
 ggplot(lon_dims_imd_2019, aes(Income_london_rank, health_london_rank)) +
   geom_point() +
   geom_smooth()
@@ -332,7 +326,7 @@ summary(modelone)
 
 ## Regression with a categorical IV (the t-test)
 
-Run the following code chunk and compare the results to the t test conducted earlier.
+Run the following code chunk and compare the results to the t-test conducted earlier.
 
 ```{r}
 lon_dims_imd_2019 %>%