-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path02_intercept_only_model.Rmd
278 lines (208 loc) · 7.88 KB
/
02_intercept_only_model.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
---
title: "02 Intercept-only model"
author: "Anders Sundelin"
date: "2023-03-09"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(ggplot2)
library(GGally)
library(tidyr)
library(dplyr)
library(brms)
library(tidybayes)
library(bayesplot)
library(patchwork)
```
```{r}
set.seed(12345)
source("ingest_data.R")
```
# Intercept-only model (baseline)
The simplest possible model, only intercepts (on population, committerteam and repo level):
This model assumes that each repository has a unique intercept (based on the population-level intercept, plus a repo-level difference from that intercept), and then each team has another offset from the resulting intercept. This model does not take into account any statistics about the actual change - it only assumes "when team T changes in repo R, then the likelihood of introducing duplicates are y".
As we are more interesting in predicting the rate of introducing duplicates, we leave the zero-inflation at the population level intercept (having a global likelihood of )
```{r}
d <- data |> select(y=INTROD,
team=committerteam,
repo=repo)
formula <- bf(y ~ 1 + (1 | team) + (1 | team:repo),
zi ~ 1 + (1 | team) + (1 | team:repo))
get_prior(data=d,
family=zero_inflated_negbinomial,
formula=formula)
```
```{r}
priors <- c(prior(normal(0, 0.5), class = Intercept),
prior(weibull(2, .75), class = sd),
prior(normal(0, 0.5), class = Intercept, dpar=zi),
prior(weibull(2, 1), class = sd, dpar=zi),
prior(gamma(1, 0.5), class = shape))
priors <- c(prior(normal(0, 0.5), class = Intercept),
prior(weibull(2, .25), class = sd),
prior(normal(0, 0.5), class = Intercept, dpar=zi),
prior(weibull(2, 0.25), class = sd, dpar=zi),
prior(gamma(0.1, 0.1), class = shape))
validate_prior(prior=priors,
formula=formula,
data=d,
family=zero_inflated_negbinomial)
```
Using the brms default shape prior (gamma(0.01,0.01)) shows that we get unreasonably many zeros, and the max observed value are also incredibly large (3e7 or more). Because of this, we also have very few observations in the plausible range (1-1000 duplicates). We also get some (1%) divergent transitions.
Thus, we need to tighten that prior to some more realistic shape.
After some experimenting, we arrive at gamma(1, 0.1) as a good compromise.
## Prior Predictive Checks
```{r}
M_ppc <- brm(data = d,
family = zero_inflated_negbinomial,
formula = formula,
prior = priors,
warmup = 1000,
iter = ITERATIONS,
chains = CHAINS,
cores = CORES,
sample_prior = "only",
backend="cmdstanr",
file_refit = "on_change",
threads = threading(THREADS),
save_pars = SAVE_PARS,
adapt_delta = ADAPT_DELTA)
```
```{r}
m <- M_ppc
```
```{r prior_predict}
yrep <- posterior_predict(m)
```
Proportion of zeros
```{r}
ppc_stat(y = d$y, yrep, stat = function(y) mean(y == 0))
```
Our prior for the zero-inflation are not particularly strong - but at least we assume that the overall ratio of zeros should be over half the changes - somewhat lower than the ~95% present in the data.
Our model decided that the maximum observed value most likely are somewhere below 50k issues. Which is a bit much, but at least we know that if we ever see such data, our model will allow us to see it.
```{r}
sim_max <- ppc_stat(y = d$y, yrep, stat = "max")
sim_max
```
Scaling to more reasonable values show that our model actually expects somewhat less max values, on average - but at least it has some probability on the observed max value.
```{r}
sim_max + scale_x_continuous(limits = c(0,1000))
```
Rootogram does not work for this set or priors, unfortunately. R runs out of memory before the rootogram completes. Probably some chains do not converge appropriately.
```{r rootogram, eval=FALSE}
rootogram <- ppc_rootogram(y = d$y, yrep, style="suspended")
rootogram
```
Rootogram, sized according to reasonable (observed) values.
```{r, eval=FALSE}
rootogram + scale_x_continuous(limits=c(0,50))
```
The rootogram indicates that we overestimate the number of observed duplicates - possibly because of our zero-inflation estimated around 0.75, rather than the observed ~0.95. While the maximum observed outcome is large (about 50k), it is no longer unreasonably large, as it was with the default shape prior.
```{r model_execution}
M_intercepts_only <-
brm(data = d,
family = zero_inflated_negbinomial,
file = ".cache/added-M_intercepts_only",
formula = formula,
prior = priors,
warmup = 1000,
iter = ITERATIONS,
chains = CHAINS,
cores = CORES,
backend="cmdstanr",
file_refit = "on_change",
threads = threading(THREADS),
save_pars = SAVE_PARS,
adapt_delta = ADAPT_DELTA)
```
```{r loo}
M_intercepts_only <- add_criterion(M_intercepts_only, criterion = "loo")
```
```{r}
m <- M_intercepts_only
```
We have 6 observations with high Pareto k values. Most likely due to sparse data for some observations.
We can do a more exact Pareto calculation by refitting the model 6 times, each time excluding one of the problematic observations.
If these calculations produce reasonable Pareto values (< 0.7), then we can still trust the results.
```{r, eval=FALSE}
reloo <- reloo(m, loo, chains=CHAINS)
reloo
```
Doing the reloo takes several hours, so it is best done manually (e.g over night).
Monte Carlo SE of elpd_loo is 0.2.
All Pareto k estimates are good (k < 0.5).
## Model diagnostics
```{r}
p <- mcmc_trace(m)
pars <- levels(p[["data"]][["parameter"]])
plots <- seq(1, to=length(pars), by=12)
lapply(plots, function(i) {
start <- i
end <- start+11
mcmc_trace(m, pars = na.omit(pars[start:end]))
})
```
```{r}
rhat(m) |> mcmc_rhat()
neff_ratio(m) |> mcmc_neff()
```
Both MCMC chains, Rhat and Neff ratio looks good.
```{r loo_plot}
loo <- loo(m)
loo
plot(loo)
```
Six outlier values (possibly solved by reloo)
TODO reloo
```{r}
```
## Posterior predictive checks
```{r posterior_predict}
yrep <- posterior_predict(m)
```
```{r rootogram}
rootogram <- ppc_rootogram(y = d$y, yrep, style="suspended")
rootogram
```
Rootogram, sized according to reasonable (observed) values.
```{r}
rootogram + scale_x_continuous(limits=c(0,50))
```
Proportion of zeros
```{r}
ppc_stat(y = d$y, yrep, stat = function(y) mean(y == 0))
```
The distribution of zeros are spot-on.
```{r}
sim_max <- ppc_stat(y = d$y, yrep, stat = "max")
sim_max
```
The max values seem to be underestimated, most of the samples have 50-60 as their maximum.
```{r}
ppc_stat(y = d$y, yrep, stat = "sd")
```
```{r}
ppc_stat_2d(d$y, yrep, stat = c("q95", "q99")) + ggtitle("Posterior predicted Q95 vs Q99")
```
```{r}
ppc_stat_grouped(y = d$y, yrep, stat = "max", group = d$team) + ggtitle("Posterior predictive max observation per team")
```
```{r}
ppc_stat_grouped(y = d$y, yrep, stat = "max", group = d$repo) + ggtitle("Posterior predictive max observation per repo")
```
```{r}
ppc_stat_grouped(y = d$y, yrep, stat = "q99", group = d$team) + ggtitle("Posterior predictive 99% quartile per team")
```
```{r}
ppc_stat_grouped(y = d$y, yrep, stat = "q99", group = d$repo) + ggtitle("Posterior predictive 99% quartile per repo")
```
```{r}
heatmap_by_team_and_repo(posterior_predict_by_team_and_repo(m, added=q95(data$ADD), removed=q95(data$DEL), complexity=q95(data$COMPLEX), duplicates = q95(data$DUP), summary=function(x) q99(x)), "Quantile 99%", decimals=0)
```
```{r}
```
## Conclusion
We will not investigate the model summary, as we already see from the data that the model is not a particularly good fit to our data.
While the zero-inflation part looks OK (distribution centered around the observed value, between 0.93 and 0.94), our model seems to underestimate both the maximum expected value and the variation in the data.
Can we do better?