Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update 23-statistics.Rmd removal of gapminder references #15

Merged
merged 1 commit into from
Aug 8, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 10 additions & 16 deletions episodes/23-statistics.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -33,16 +33,20 @@ source: Rmd
```{r libraries, message=FALSE, warning=FALSE}
# We will need these libraries and this data later.
library(tidyverse)
library(ggplot2)

# loading data
lon_dims_imd_2019 <- read.csv("data/English_IMD_2019_Domains_rebased_London_by_CDRC.csv")
# Commenting out as not used in this version
# library(lubridate)
#library(gapminder)

# create a binary membership variable for City of London (for later examples)
lon_dims_imd_2019 <- lon_dims_imd_2019 %>% mutate(city = la19nm == "City of London")
```

We are going to use the data from the gapminder package. We have added a variable *European* indicating if a country is in Europe.
We are going to use the data from the Consumer Data Research Centre, specifically the London IMD 2019 (English IMD 2019 Domains rebased) data.
Atribution: Data provided by the Consumer Data Research Centre, an ESRC Data Investment: ES/L011840/1, ES/L011891/1

The statistical unit areas used to provide indices of relative deprivation across the country are Lower layer Super Output Areas (LSOAs), dimensions of depravation include income, employment, education, health, crime, barriers to housing and services, and the living environment.
We have added a variable *city* indicating if an LSOA is within the City of London, or not.

## The big picture

Expand Down Expand Up @@ -246,7 +250,7 @@ It all starts with a hypothesis

## Comparing means

Is there an absolute difference between the income ranks of the Lower-layer Super Output Areas
Is there an absolute difference between the income ranks of the Lower-layer Super Output Areas?

```{r}
lon_dims_imd_2019 %>%
Expand Down Expand Up @@ -295,14 +299,6 @@ Testing supported the rejection of the null hypothesis that there is no differen
While the t-test is sufficient where there are two levels of the IV, for situations where there are more than two, we use the **ANOVA** family of procedures. To show this, we will create a variable that subsets our data by *per capita GDP* levels. If the ANOVA result is statistically significant, we will use a post-hoc test method to do pairwise comparisons (here Tukey's Honest Significant Differences.)

```{r}
# quantile(gapminder$gdpPercap)
# IQR(gapminder$gdpPercap)

# gapminder$gdpGroup <- cut(gapminder$gdpPercap, breaks = c(241.1659, 1202.0603, 3531.8470, 9325.4623, 113523.1329), labels = FALSE)

# gapminder$gdpGroup <- factor(gapminder$gdpGroup)

# anovamodel <- aov(gapminder$pop ~ gapminder$gdpGroup)
anovamodel <- aov(lon_dims_imd_2019$health_london_rank ~ lon_dims_imd_2019$la19nm)
summary(anovamodel)

Expand All @@ -314,10 +310,8 @@ TukeyHSD(anovamodel)
The most common use of regression modelling is to explore the relationship between two continuous variables, for example between `Income_london_rank` and `health_london_rank` in our data. We can first determine whether there is any significant correlation between the values, and if there is, plot the relationship.

```{r}
# cor.test(gapminder$gdpPercap, gapminder$lifeExp)
cor.test(lon_dims_imd_2019$Income_london_rank, lon_dims_imd_2019$health_london_rank)

# ggplot(gapminder, aes(gdpPercap, log(lifeExp))) +
ggplot(lon_dims_imd_2019, aes(Income_london_rank, health_london_rank)) +
geom_point() +
geom_smooth()
Expand All @@ -332,7 +326,7 @@ summary(modelone)

## Regression with a categorical IV (the t-test)

Run the following code chunk and compare the results to the t test conducted earlier.
Run the following code chunk and compare the results to the t-test conducted earlier.

```{r}
lon_dims_imd_2019 %>%
Expand Down
Loading