-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathSTblogpost.Rmd
236 lines (189 loc) · 11 KB
/
STblogpost.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
---
title: "brms"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(dplyr)
library(magrittr)
library(brms)
library(gganimate)
library(tidybayes)
```
### Assessing uncertainty of total height models with brms and gganimate
Brian Clough · 2019/02/08 · 8 minute read
Assessing the expected uncertainty of a fitted model when predicting for new data is an important step in the model checking process, and one we do not pay enough attention to in forest biometrics. Understanding uncertainty of models is key to reducing risk, since it provides important insight into the reliability of a value we attach to forests (in dollars, carbon credits, other ecosystem services, etc.) resulting from a model-based inventory.
In this post we’ll take a look at an example that features a few of my favorite R packages for quantifying and visualizing uncertainty: brms, tidybayes, and gganimate.
Our example will be total height models fitted to some FIA data. Let’s start by loading in the packages we’ll use for this session.
```{r}
library(readr)
heightSub <- read_csv("/home/kevin/github/brmsForestry/STData.csv")
```
The data we’ll use are FIA tree data, taken from plots measured in 2017 in the Arrowhead region of northern Minnesota, USA. These include species, diameter at breast height (in), and total height (ft) for all live trees with a field measured total height. Downloading and manipulating FIA data in R to create fitting datasets like this will be the topic of a future blog post. You can download the already formatted dataset used in this analysis by calling read_csv(https://s3.amazonaws.com/silviaterra-biometrics-public/arrowhead-height-data.csv).
After reading in the data, we’ll drop any species with fewer than 25 unique height observations. This leaves us with a dataset containing 12 species to work with.
```{r}
heightSub <- heightSub %>%
group_by(common) %>%
nest() %>%
mutate(n = map_dbl(data, n_distinct)) %>%
filter(n >= 25) %>%
select(common, data) %>%
unnest()
ggplot(heightSub, aes(x = diameter, y = height)) +
geom_point(col = 'dark green', alpha = 0.5) +
xlab("diameter (inches)") +
ylab("height (feet)") +
facet_wrap(~common) +
theme_bw()
```
Now let’s fit the model. We’ll specify a nonlinear (Schumacher) height function, then build a hierarchical model that accounts for correlated species effects. The bf wrapper makes it easy to set up this structure, allowing us to specify a ‘submodel’ a + b ~ 1 + (1 | common) that establishes both the population and group-level effects on the model parameters a and b. We complete the specification by setting nl = TRUE so that brms knows we are fitting a nonlinear model.
```{r}
priors <- prior(normal(0, 5), nlpar = "a") + prior(normal(0, 5), nlpar = "b")
htMod <- brm(
bf(height ~ exp(a + b / diameter), a + b ~ 1 + (1 | common), nl = TRUE),
data = heightSub, prior = priors, control = list(adapt_delta = 0.99),
family = gaussian()
)
summary(htMod)
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: height ~ exp(a + b/diameter)
## a ~ 1 + (1 | common)
## b ~ 1 + (1 | common)
## Data: heightSub (Number of observations: 1684)
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 4000
##
## Group-Level Effects:
## ~common (Number of levels: 12)
## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sd(a_Intercept) 0.24 0.08 0.13 0.44 1096 1.00
## sd(b_Intercept) 1.12 0.31 0.68 1.85 1028 1.00
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## a_Intercept 4.21 0.08 4.05 4.36 1061 1.00
## b_Intercept -2.66 0.35 -3.36 -1.99 979 1.00
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sigma 6.64 0.11 6.42 6.87 4764 1.00
##
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
## is a crude measure of effective sample size, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
```
A quick look at the Rhat statistics and effective sample sizes suggests the model is well converged, so we’ll proceed. Our next step is to generate posterior simulations from the fitted model for each of the species in our dataset. This is where tidybayes comes in, as tidybayes::add_fitted_draws() function combined with a group_by() call makes it very convenient to get the posterior predictions we’ll need to plot some uncertainty visualizations.
set.seed(123456)
nDraws <- 50
postSims <- heightSub %>%
group_by(common) %>%
add_fitted_draws(htMod, n = nDraws)
```{r}
postSims %>% head() %>% kable()
## # A tibble: 6 x 10
## # Groups: common, CN, SPCD, diameter, height, .row [1]
## common CN SPCD diameter height .row .chain .iteration .draw .value
## <chr> <dbl> <dbl> <dbl> <dbl> <int> <int> <int> <int> <dbl>
## 1 red m… 5.00e14 316 8.7 63 1 NA NA 319 50.3
## 2 red m… 5.00e14 316 8.7 63 1 NA NA 321 51.1
## 3 red m… 5.00e14 316 8.7 63 1 NA NA 386 52.2
## 4 red m… 5.00e14 316 8.7 63 1 NA NA 473 50.8
## 5 red m… 5.00e14 316 8.7 63 1 NA NA 528 51.3
## 6 red m… 5.00e14 316 8.7 63 1 NA NA 636 50.2
```
When we call head(postSims), we see that several new columns have been appended to our original dataset. The .row column indexes the original rows of our data frame, while .value contains the posterior draws for each tree.
Alright, now we’re ready to visualize these results. We’ll take a look at some hypothetical outcomes plots, which are an increasingly popular way of visualizing uncertainty in model fit. The examples here are based on code from Matthew Kay’s tutorial on extracting and visualizing tidy draws from brms models. I definitely recommend checking it out for more in depth information on tidybayes, as well as ideas for lots of great uncertainty visualizations.
Let’s first focus on a single species, red maple. We’ll first visualize all of the posterior linear fits simultaneously.
```{r}
postSims %>%
filter(common == "red maple") %>%
ggplot(aes(x = diameter, y = height)) +
geom_point(
data = heightSub %>% filter(common == "red maple"),
col = "gray", alpha = 0.25
) +
geom_line(
aes(y = .value, group = .draw),
alpha = 1/20, color = "dark green"
) +
xlab("diameter (inches)") +
ylab("height (feet)") +
theme_bw()
```
This figure does visualize the overall uncertainty space of the fitted lines pretty well. It also keys in on some of the obvious patterns we’d expect, like the fact that model uncertainty is highest for the largest trees. It’s tough to get a sense of how the shape of the function may be changing, though. This is where gganimate comes in:
```{r}
postSims %>%
filter(common == "red maple") %>%
ggplot(aes(x = diameter, y = height)) +
geom_point(
data = heightSub %>% filter(common == 'red maple'),
col = 'gray', alpha = 0.25
) +
geom_line(
aes(y = .value, group = .draw),
alpha = 0.5, color = 'dark green'
) +
xlab('diameter (in)') +
ylab('total height (ft)') +
theme_bw() +
transition_manual(.draw)
```
Much better! Not only can we see the increasing uncertainty with increasing size, but we can also see the way the diameter:height relationship subtly changes across different posterior realizations. This model looks xreasonable for red maple, as the shifts in expected height for a given diameter class are pretty small.
Now let’s use the same methods to visualize uncertainty for the full dataset:
```{r}
postSims %>%
ggplot(aes(x = diameter, y = height)) +
geom_point(
data = heightSub,
col = 'gray', alpha = 0.25
) +
geom_line(
aes(y = .value, group = .draw),
alpha = 0.5, color = 'dark green'
) +
facet_wrap(~common) +
xlab('diameter (in)') +
ylab('total height (ft)') +
theme_bw() +
transition_manual(.draw)
```
I like this because we can quickly identify which species will have a less stable model fit. The variation in the shape of the northern white-cedar, white spruce, and jack pine curves stand out. These results could certainly be the basis for considering a different model, at least for those three species and perhaps a few others.
Of course, we’re also interested in what the actual predictions look like. We can animate posterior predicted simulations using very similar code to what we wrote above. The key difference is that we’ll call tidybayes::add_predicted_draws() to generate simulations from the posterior predictive distribution. Then we’ll create a new object that contains the median, upper, and lower quantiles for these posterior predictions to use in our plot.
```{r}
predSpecies <- c('red maple', 'black spruce', 'white spruce')
predPostSims <- heightSub %>%
filter(common %in% predSpecies) %>%
group_by(common) %>%
add_predicted_draws(htMod, n = nDraws)
heightSum <- predPostSims %>%
summarise(
postMedian = quantile(.prediction, 0.5),
upperQt = quantile(.prediction, 0.975),
lowerQt = quantile(.prediction, 0.025)
) %>%
ungroup() %>%
select(common, diameter, height = postMedian, upperQt, lowerQt)
predPostSims %>%
ggplot(aes(x = diameter, y = height)) +
geom_point(
data = heightSum,
col = 'gray', alpha = 0.25
) +
geom_errorbar(
data = heightSum,
aes(x = diameter, ymin = lowerQt, ymax = upperQt),
col = 'gray', alpha = 0.25
) +
geom_point(
aes(y = .prediction, group = .draw),
alpha = 0.5, color = 'dark green'
) +
facet_wrap( ~ common) +
xlab('diameter (in)') +
ylab('predicted total height (ft)') +
theme_bw() +
transition_manual(.draw)
```
I really like this one. I am not sure it tells us much about our model that a static error bar plot wouldn’t, but it’s a nice visual to demonstrate how much the height predictions can change as we move across different realizations. One thing that jumps out is that, given the uncertainty in the model and the residual error distribution, some of the height predictions from this model for small diameter trees can be negative. In the future we’ll dig into ways in which we can expand on standard, nonlinear height models to present such undesirable behavior. For now, though, this is a great demonstration of the power of Bayes for prediction, where we can integrate across the full posterior for individual points rather than having to condition on a single point estimate.
So there’s a taste of some techniques for visualizing uncertainty with a forestry example. Thanks to magic of tidy data, this code can be easily adapted for any number of analyses. I am really looking forward to making some animated growth trajectories, or using tidybayes + gganimate to visualize uncertainty in our diameter distribution models. Look out for more data viz posts down the line.