Kasper Welbers & Wouter van Atteveldt October 2020
In this tutorial we show how to use Generalized Linear Models and Multilevel models in R. The focus is primarily on the code, but we do provide some general information about when to use these models and how to interpret them, so you can follow this tutorial even if these methods are new to you. Still, note that to properly apply these methods in your own research it is recommended to also look into material that focuses on the statistical models.
For this tutorial we use the tidyverse and sjPlot package to manage the data and present the models. For the statistical models we use the stats packages (loaded by default) and the amazing lme4 package
library(tidyverse)
library(sjPlot)
library(lme4)
In statistics, the generalized linear model (GLM) is a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value. Wikipedia
Probably the most common use of generalized linear models is the logistic regression, or binary regression, that allows for a dichotomous response variable. Another common use is the Poisson regression, which allows for a response variable with a Poisson distribution.
Generalized linear models can be performed with the regular stats
package, in a way that is very similar to the regular linear models.
Where regular linear models have the lm
function, generalized linear
models have the glm
function. Other than this, the main difference is
that the family function, or link function, of the glm needs to be
given.
In this tutorial we will mainly focus on how to use the glm
function,
using logistic regression as an example. If you are not familiar with
GLM’s, beware that the interpretation of coefficients and model fit can
be quite different from ordinary linear regression. For a more detailed
tutorial that includes pointers on interpretation, see the Generalized
Linear Models tutorial on the R course material GitHub page.
We’ll use the iris
demo data, which contains measurements of sepal and
petal length and width for three species of iris. First, we create a
copy of the iris data, but with a dichotomous variable for whether a
case (row) is of the species “versicolor”. As an example case, we’ll
then use logistic linear regression to predict whether the species is
versicolor
based on the Sepal size (length and width).
d = mutate(iris, versicolor = as.numeric(Species == 'versicolor'))
head(d)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species versicolor
## 1 5.1 3.5 1.4 0.2 setosa 0
## 2 4.9 3.0 1.4 0.2 setosa 0
## 3 4.7 3.2 1.3 0.2 setosa 0
## 4 4.6 3.1 1.5 0.2 setosa 0
## 5 5.0 3.6 1.4 0.2 setosa 0
## 6 5.4 3.9 1.7 0.4 setosa 0
The glm
function works mostly the same as the lm
function, and uses
the same type of formula: dependent ~ independent1 + independent2 + ...
. The main difference is that we’ll need to specify the error
distribution family and link function with the family
argument. For
logistic regression, we need to use the binomial
family (which by
default uses the canonical logit
link function). To interpret the
model we use the tab_model function from the sjPlot package.
m = glm(versicolor ~ Sepal.Length + Sepal.Width, family = binomial, data = d)
tab_model(m)
|
versicolor |
||
---|---|---|---|
Predictors |
Odds Ratios |
CI |
p |
(Intercept) |
3270.76 |
38.30 – 473371.01 |
0.001 |
Sepal.Length |
1.14 |
0.70 – 1.86 |
0.600 |
Sepal.Width |
0.04 |
0.01 – 0.13 |
<0.001 |
Observations |
150 |
||
R2 Tjur |
0.231 |
The output in the regression table looks very similar to that of ordinary linear regression, but there are some notable differences. Most importantly, note that the coefficients represent Odds Ratios[1]. In ordinary linear regression coefficients are additive, meaning that for every unit increase of an independent variable the ‘predicted’ value of the dependent variable goes up by the coefficient value. The odds ratios, however, are multiplicative. For every unit increase, the odds that y=1 (versus y=0) are multiplied by the coefficient value. For example, in the current model the Odds Ratios for Sepal.Width (0.04) indicate that for every unit increase of Sepal.Width, the odds are multiplied by 0.04.
It’s important to realize that this means that the effect of Sepal Width is negative! For every unit increase of sepal width, the odds that the iris species is versicolor (y=1) decreases. More generally speaking, be careful that when odds ratios are reported, a positive value does not indicate a positive effect. If the value is between 0 and 1, the effect is negative, because the odds decrease. If the value is higher than 1, the effect is positive.
Note that odds can be transformed to probabilities. Thus, we can compute the probability that the species is versicolor. Here we’ll be brief (see the dedicated tutorial on GLM for details). We’ll calculate the probability for the case that Sepal Length is 1 and Sepal Width is 2.
odds = 3270.76 ## intercept
odds = odds * 1.14 ## sepal length is 1, so multiply by 1.14 once
odds = odds * 0.04^2 ## sepal width is 2, so multiply by 0.04 twice
odds / (odds + 1) ## transform odds to probability
## [1] 0.8564428
Alternatively, we can use the predict function to predict the response for a given combination of values for independent variables.
predict(m, type='response',
newdata = data.frame(Sepal.Length = 1, Sepal.Width = 2))
## 1
## 0.8577438
Just as in ordinary linear regression, p-values are reported to interpret whether an effect is significant. In this case, we see that only the effect of Sepal.Width is significant.
To evaluate model fit, you can compare models using the anova function. For example, here we first make a base model that does not contain any independent variables (other than the intercept), and also make a second model with the Petal information included. We then compare how the model fit changes.
m_base = glm(versicolor ~ 1, family = binomial, data = d)
m1 = glm(versicolor ~ Sepal.Length + Sepal.Width, family = binomial, data = d)
m2 = glm(versicolor ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, family = binomial, data = d)
anova(m_base, m1,m2, test="Chisq")
## Analysis of Deviance Table
##
## Model 1: versicolor ~ 1
## Model 2: versicolor ~ Sepal.Length + Sepal.Width
## Model 3: versicolor ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 149 190.95
## 2 147 151.65 2 39.304 2.919e-09 ***
## 3 145 145.07 2 6.581 0.03724 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can see that adding the variables improves the model fit, in particular for m1, but also for m2. The chi2 test shows whether the change in the model fit is statistically significant for consequtive models (m_base to m1, and m1 to m2). If we look at the model, we also see that Petal.Width is a significant predictor.
tab_model(m1,m2)
|
versicolor |
versicolor |
||||
---|---|---|---|---|---|---|
Predictors |
Odds Ratios |
CI |
p |
Odds Ratios |
CI |
p |
(Intercept) |
3270.76 |
38.30 – 473371.01 |
0.001 |
1601.16 |
14.60 – 280625.32 |
0.003 |
Sepal.Length |
1.14 |
0.70 – 1.86 |
0.600 |
0.78 |
0.21 – 2.79 |
0.706 |
Sepal.Width |
0.04 |
0.01 – 0.13 |
<0.001 |
0.06 |
0.01 – 0.26 |
<0.001 |
Petal.Length |
3.72 |
1.02 – 15.09 |
0.055 |
|||
Petal.Width |
0.06 |
0.01 – 0.55 |
0.018 |
|||
Observations |
150 |
150 |
||||
R2 Tjur |
0.231 |
0.275 |
Note that the tab_model
function also reports ths R2 Tjur, which is a
so-called pseudo-R2 measure. Pseudo R2 measures are designed to be
interpretable in the same way as regular R2 (i.e. a value between 0 and
1, where 0 means no fit and 1 means perfect fit). These are not computed
by the glm function, but the sjPlot package has implementations for
various popular pseudo R2 measures, and by default reports the R2 Tjur
for logistic regression. Long story short, if you report these measures,
beware that they are not real R2 values, that there are several
alternatives, and that there is some debate on whether and which pseudo
R2 measures are appropriate. Finally, to illustrate how fitting a glm
is different from using a regular lm
, we can plot the regression line
for the Sepal.Width. For this we can use the amazing and extremely
versatile plot_model
function from the sjPlot package. Here we plot
the prediction of the model for different values of Sepal.Width.
plot_model(m2, type = 'pred', terms = 'Sepal.Width')
The curved prediction line shows that the probability approaches 100% the smaller the Sepal.Width gets, and approaches 0 the larger the Sepal.Width gets.
You can also produce this plot for all independent variables in a grid.
p = plot_model(m2, type = 'pred') ## creates a list of plots
plot_grid(p)
Multilevel models are, simply put, linear models that can account for
multiple levels in the data. Here we briefly explain what what
multilevel analysis in and how to apply it in R with the lme4
package.
The examples in the lme4 packages use the sleepstudy data, which measures reaction time of participants after sleep deprivation. The data contains three variables: Reaction, Days and Subject. Subjects were limited to 3 hours of sleep each night, so the question is whether their reaction time slows after more days of limited sleep.
head(sleepstudy)
## Reaction Days Subject
## 1 249.5600 0 308
## 2 258.7047 1 308
## 3 250.8006 2 308
## 4 321.4398 3 308
## 5 356.8519 4 308
## 6 414.6901 5 308
The sleepstudy data requires multilevel analysis, because the observations are nested in Subjects. Linear regression models assume that observations are independent, but that is not the case here. Different subjects might have different reaction speeds in general, and might also be more or less affected by sleep deprivation.
To account for this, multilevel models can have random intercepts
and
random slopes
. By using random intercepts, each Subject has its own
intercept, which accounts for differences between subjects in overall
reaction speed. Random slopes can be applied for each independent
variable in the model, so that each Subject also has its own slope
(i.e. coefficient) for this variable. This can be used simply to
controll for implications of nested data (to not violate the
independence assumption in linear regression). But moreover, it can be
used to better investigate variance and effects at different levels. For
instance, to what extent student learning success is explained by
individual level factors (doing homework, participating in class) or
class level factors (class size, experience of teacher).
We will not try to explain exactly what a multilevel model is and when and how you should use it (this would be a workshop in itself). The main goal here is to show how to perform multilevel modeling in R, and to visually illustrate what happens when you add higher level variables to a linear model. For this we’ll use toy data in which the random intercepts (and slopes) are extremely obvious. We’ll stick to the names of the sleepstudy data for sake of simplicity.
d = data.frame(Reaction = c(0,1,7,9,17,16,12,10,29,27,24,22,39,36,33,30,49,47,42,42),
Days = c(1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4),
Subject = c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5))
Here the Subjects have clearly different intercepts (average Reaction) and slopes (effect of Days on Reaction). We can show this with a scatterplot, in which different subjects are given different colors.
cols = rainbow(5) # make colors for 5 subjects
plot(d$Days, d$Reaction, col=cols[d$Subject], pch=16)
If we just look at the dots without the colors, we hardly see any patterns. Taking the colors into account, we see that the average reaction time (in our toy data) is vastly different for each Subject. Also, we see that overall the reaction time decreases for most Subjects, but with some differences in how fast it decreases, and for one subject it even increases.
To show how we can model this with random intercepts and random slopes, we’ll fit this data with three models: regular linear model, multilevel model with random intercepts, multilevel model with random slopes.
Let’s see what happens if we fit a regular linear regression, and plot the regression line.
m = lm(Reaction ~ Days, data=d)
tab_model(m)
|
Reaction |
||
---|---|---|---|
Predictors |
Estimates |
CI |
p |
(Intercept) |
28.20 |
10.33 – 46.07 |
0.004 |
Days |
-1.44 |
-7.96 – 5.08 |
0.648 |
Observations |
20 |
||
R2 / R2 adjusted |
0.012 / -0.043 |
The model states that there is no relation between Days and Reaction. We can visualize the problem of fitting a regular linear model to this data by adding the coefficient line to the plot.
plot(d$Days, d$Reaction, col=cols[d$Subject], pch=16)
abline(coef(m)[1], coef(m)[2])
It does captures the overal pattern (Reaction times goes down), but only roughly.
To fit a linear multilevel model we will use the lmer
function from
the lme4
package. Notice the similarity to the lm
function.
The difference is that you need to specify the higher level in the formula, which looks like this.
m_ri = lmer(Reaction ~ Days + (1 | Subject), data=d)
The multilevel part is the added + (1 | Subject)
. The part between the
parentheses has two parts:
- The part after the | symbol gives the name of the higher level group, in this case Subject.
- The part before the | specifies the model for this higher level. In
the current example, this is only
1
, which referes to the intercept. Normally, the intercept is implicit, but here you do always need to specify it.
Here we can also use the tab_model
function (from the sjPlot
package) to print the regression table.
tab_model(m_ri)
|
Reaction |
||
---|---|---|---|
Predictors |
Estimates |
CI |
p |
(Intercept) |
28.20 |
13.66 – 42.74 |
<0.001 |
Days |
-1.44 |
-2.73 – -0.15 |
0.029 |
Random Effects |
|||
σ2 |
10.83 |
||
τ00 Subject |
259.00 |
||
ICC |
0.96 |
||
N Subject |
5 |
||
Observations |
20 |
||
Marginal R2 / Conditional R2 |
0.010 / 0.960 |
Interestingly, we see that the intercept and the effect for Days is
identical (though now the effect is Days statistically significant).
This part of the output still shows the overall intercept (or grand
intercept) and effect. What’s new is the lower part, which shows the way
the model is fit and the random effects. As with the glm
we see that
there is no R2. To evaluate the model fit the common approach is to
compare different models.
For now, focus on the Var: Subject (Intercept)
and Var: Residual
rows. These report the variance for the two levels (The residual is the
variance at the level of individual observations). We see that the
Subject variance (259.00) is much higher than the Residual variance
(10.83). This makes sense for our data, because the biggest differences
in the Reaction scores can be explained by the differences between
Subjects. We can see this more clearly by visualizing the effect of Days
with the random interceps. The random intercept values are not reported
in the model above, but they are stored in the output (m_ri).
plot(d$Days, d$Reaction, col=cols[d$Subject], pch=16)
for (i in 1:5) { ## for each subject
abline(coef(m_ri)$Subject[i,1], coef(m_ri)$Subject[i,2], col=cols[i])
}
Now each Subject has it’s own regression line for the effect of Days on Reaction. This also makes it clear why the variance is mainly on the Subject level. The distance of the observations (the dots) to the lines of the same colour is relatively small. The bigger distance is between the lines.
In the random intercepts model the slope for the effect of Days is still the same. So now, let’s fit the model with random intercepts AND random slopes. We do this by adding the Days variable in the multilevel part (the part between parentheses) of the formula.
m_rs = lmer(Reaction ~ Days + (1 + Days| Subject), data=d)
tab_model(m_rs)
|
Reaction |
||
---|---|---|---|
Predictors |
Estimates |
CI |
p |
(Intercept) |
28.20 |
9.36 – 47.04 |
0.003 |
Days |
-1.44 |
-3.77 – 0.89 |
0.226 |
Random Effects |
|||
σ2 |
1.02 |
||
τ00 Subject |
460.52 |
||
τ11 Subject.Days |
6.87 |
||
ρ01 Subject |
-0.86 |
||
ICC |
1.00 |
||
N Subject |
5 |
||
Observations |
20 |
||
Marginal R2 / Conditional R2 |
0.010 / 0.996 |
Again, the fixed effects for the Intercept and Days are the same. What
is new is that the variance of Subject Days
is reported. This is the
variance between Subject in the effect of Days on Reaction. In our data,
this variance is most clearly seen in the effect for the ‘red’ Subject
at the bottom of the graph, which is the only Subject for whom Reaction
time somehow increased.
plot(d$Days, d$Reaction, col=cols[d$Subject], pch=16) ## redo the plot for clarity
for (i in 1:5) { ## for each subject
abline(coef(m_rs)$Subject[i,1], coef(m_rs)$Subject[i,2], col=cols[i])
}
By comparing the models, we can see how the variance at different levels changes. Here we first make a base model, which is a random intercepts model without any predictor variables. Then we add Days at the individual level, and finally we add the random slopes for Days.
m_base = lmer(Reaction ~ (1 | Subject), data=d)
m1 = lmer(Reaction ~ Days + (1 | Subject), data=d)
m2 = lmer(Reaction ~ Days + (1 + Days| Subject), data=d)
anova(m_base,m1,m2)
## Data: d
## Models:
## m_base: Reaction ~ (1 | Subject)
## m1: Reaction ~ Days + (1 | Subject)
## m2: Reaction ~ Days + (1 + Days | Subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## m_base 3 135.52 138.51 -64.762 129.52
## m1 4 133.11 137.10 -62.557 125.11 4.4104 1 0.03572 *
## m2 6 115.58 121.55 -51.789 103.58 21.5361 2 2.106e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The anova shows that each model is an improvement of the previous model. Interestingly, the improvement from m_base to m1 is not that great. This is because the overall effect of Days (not taking random slopes into account) isn’t that great (remember that it was not statistically significant)
tab_model(m_base,m1,m2)
|
Reaction |
Reaction |
Reaction |
||||||
---|---|---|---|---|---|---|---|---|---|
Predictors |
Estimates |
CI |
p |
Estimates |
CI |
p |
Estimates |
CI |
p |
(Intercept) |
24.60 |
10.42 – 38.78 |
0.001 |
28.20 |
13.66 – 42.74 |
<0.001 |
28.20 |
9.36 – 47.04 |
0.003 |
Days |
-1.44 |
-2.73 – -0.15 |
0.029 |
-1.44 |
-3.77 – 0.89 |
0.226 |
|||
Random Effects |
|||||||||
σ2 |
13.57 |
10.83 |
1.02 |
||||||
τ00 |
258.31 Subject |
259.00 Subject |
460.52 Subject |
||||||
τ11 |
|
|
6.87 Subject.Days |
||||||
ρ01 |
|
|
-0.86 Subject |
||||||
ICC |
0.95 |
0.96 |
1.00 |
||||||
N |
5 Subject |
5 Subject |
5 Subject |
||||||
Observations |
20 |
20 |
20 |
||||||
Marginal R2 / Conditional R2 |
0.000 / 0.950 |
0.010 / 0.960 |
0.010 / 0.996 |
It is interesting to look at the random effects to see how the variance
is explained at different levels. From m_base to m1 (adding the Days
effect) we see that variance is mostly explained at Var: Residual
.
This makes sense, because Days this is a within Subject variable. From
m1 to m2 (adding the random slope for Days), we see that even more
variance at the Var: Residual
level is explained. This again makes
sense, because the lines within the Subjects now much better fit the
data. This variance can now be measured as variance between the slopes
for Days, as seen in VAR: Subject Days
.
The lmer
function can be seen as the lm
function for multilevel
models. Similarly, the lme4
package has the glmer
function, which
can be seen as the glm
function for multilevel models. In terms of
syntax, using glmer
is very similar to lmer
, and like in glm
, you
need to specify a family
(e.g., binomial for logistic regression,
poisson for poisson regression).
- The actual coefficients in the logistic regression model are
log odds ratios
. To interpret these coefficients it’s easier to transform them toodds ratios
(i.e. using the exponential function to undo the log). In thetab_model
function this transformation is performed by default. Always check which values are reported to avoid misinterpretation.