02_intercept_only_model.Rmd

---
title: "02 Intercept-only model"
author: "Anders Sundelin"
date: "2023-03-09"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(ggplot2)
library(GGally)
library(tidyr)
library(dplyr)
library(brms)
library(tidybayes)
library(bayesplot)
library(patchwork)
```

```{r}
set.seed(12345)
source("ingest_data.R")
```

# Intercept-only model (baseline)

The simplest possible model, only intercepts (on population, committerteam and repo level):

This model assumes that each repository has a unique intercept (based on the population-level intercept, plus a repo-level difference from that intercept), and then each team has another offset from the resulting intercept. This model does not take into account any statistics about the actual change - it only assumes "when team T changes in repo R, then the likelihood of introducing duplicates are y".

As we are more interesting in predicting the rate of introducing duplicates, we leave the zero-inflation at the population level intercept (having a global likelihood of )

```{r}
d <- data |> select(y=INTROD,
                    team=committerteam,
                    repo=repo)
formula <- bf(y ~ 1 + (1 | team) + (1 | team:repo),
              zi ~ 1 + (1 | team) + (1 | team:repo))
get_prior(data=d,
          family=zero_inflated_negbinomial,
          formula=formula)
```

```{r}
priors <- c(prior(normal(0, 0.5), class = Intercept),
            prior(weibull(2, .75), class = sd),
            prior(normal(0, 0.5), class = Intercept, dpar=zi),
            prior(weibull(2, 1), class = sd, dpar=zi),
            prior(gamma(1, 0.5), class = shape))
priors <- c(prior(normal(0, 0.5), class = Intercept),
            prior(weibull(2, .25), class = sd),
            prior(normal(0, 0.5), class = Intercept, dpar=zi),
            prior(weibull(2, 0.25), class = sd, dpar=zi),
            prior(gamma(0.1, 0.1), class = shape))

validate_prior(prior=priors,
               formula=formula,
               data=d,
               family=zero_inflated_negbinomial)
```

Using the brms default shape prior (gamma(0.01,0.01)) shows that we get unreasonably many zeros, and the max observed value are also incredibly large (3e7 or more). Because of this, we also have very few observations in the plausible range (1-1000 duplicates). We also get some (1%) divergent transitions.
Thus, we need to tighten that prior to some more realistic shape.
After some experimenting, we arrive at gamma(1, 0.1) as a good compromise.

## Prior Predictive Checks
```{r}
M_ppc <- brm(data = d,
      family = zero_inflated_negbinomial,
      formula = formula,
      prior = priors,
      warmup = 1000,
      iter  = ITERATIONS,
      chains = CHAINS,
      cores = CORES,
      sample_prior = "only",
      backend="cmdstanr",
      file_refit = "on_change",
      threads = threading(THREADS),
      save_pars = SAVE_PARS,
      adapt_delta = ADAPT_DELTA)
```
```{r}
m <- M_ppc
```


```{r prior_predict}
yrep <- posterior_predict(m)
```

Proportion of zeros

```{r}
ppc_stat(y = d$y, yrep, stat = function(y) mean(y == 0))
```

Our prior for the zero-inflation are not particularly strong - but at least we assume that the overall ratio of zeros should be over half the changes - somewhat lower than the ~95% present in the data.

Our model decided that the maximum observed value most likely are somewhere below 50k issues. Which is a bit much, but at least we know that if we ever see such data, our model will allow us to see it.

```{r}
sim_max <- ppc_stat(y = d$y, yrep, stat = "max")
sim_max
```

Scaling to more reasonable values show that our model actually expects somewhat less max values, on average - but at least it has some probability on the observed max value.

```{r}
sim_max + scale_x_continuous(limits = c(0,1000))
```


Rootogram does not work for this set or priors, unfortunately. R runs out of memory before the rootogram completes. Probably some chains do not converge appropriately.

```{r rootogram, eval=FALSE}
rootogram <- ppc_rootogram(y = d$y, yrep, style="suspended")
rootogram
```

Rootogram, sized according to reasonable (observed) values.

```{r, eval=FALSE}
rootogram + scale_x_continuous(limits=c(0,50))
```

The rootogram indicates that we overestimate the number of observed duplicates - possibly because of our zero-inflation estimated around 0.75, rather than the observed ~0.95. While the maximum observed outcome is large (about 50k), it is no longer unreasonably large, as it was with the default shape prior.

```{r model_execution}
M_intercepts_only <-
  brm(data = d,
      family = zero_inflated_negbinomial,
      file = ".cache/added-M_intercepts_only",
      formula = formula,
      prior = priors,
      warmup = 1000,
      iter  = ITERATIONS,
      chains = CHAINS,
      cores = CORES,
      backend="cmdstanr",
      file_refit = "on_change",
      threads = threading(THREADS),
      save_pars = SAVE_PARS,
      adapt_delta = ADAPT_DELTA)
```

```{r loo}
M_intercepts_only <- add_criterion(M_intercepts_only, criterion = "loo")
```

```{r}
m <- M_intercepts_only
```

We have 6 observations with high Pareto k values. Most likely due to sparse data for some observations.

We can do a more exact Pareto calculation by refitting the model 6 times, each time excluding one of the problematic observations.
If these calculations produce reasonable Pareto values (< 0.7), then we can still trust the results.

```{r, eval=FALSE}
reloo <- reloo(m, loo, chains=CHAINS)
reloo
```
Doing the reloo takes several hours, so it is best done manually (e.g over night).

Monte Carlo SE of elpd_loo is 0.2.

All Pareto k estimates are good (k < 0.5).

## Model diagnostics

```{r}
p <- mcmc_trace(m)
pars <- levels(p[["data"]][["parameter"]])
plots <- seq(1, to=length(pars), by=12)
lapply(plots, function(i) {
  start <- i
  end <- start+11
  mcmc_trace(m, pars = na.omit(pars[start:end]))
  })

```

```{r}
rhat(m) |> mcmc_rhat()
neff_ratio(m) |> mcmc_neff()
```

Both MCMC chains, Rhat and Neff ratio looks good.

```{r loo_plot}
loo <- loo(m)
loo
plot(loo)
```

Six outlier values (possibly solved by reloo)

TODO reloo
```{r}

```


## Posterior predictive checks

```{r posterior_predict}
yrep <- posterior_predict(m)
```

```{r rootogram}
rootogram <- ppc_rootogram(y = d$y, yrep, style="suspended")
rootogram
```

Rootogram, sized according to reasonable (observed) values.

```{r}
rootogram + scale_x_continuous(limits=c(0,50))
```

Proportion of zeros

```{r}
ppc_stat(y = d$y, yrep, stat = function(y) mean(y == 0))
```

The distribution of zeros are spot-on.

```{r}
sim_max <- ppc_stat(y = d$y, yrep, stat = "max")
sim_max
```

The max values seem to be underestimated, most of the samples have 50-60 as their maximum.

```{r}
ppc_stat(y = d$y, yrep, stat = "sd")
```


```{r}
ppc_stat_2d(d$y, yrep, stat = c("q95", "q99")) + ggtitle("Posterior predicted Q95 vs Q99")
```

```{r}
ppc_stat_grouped(y = d$y, yrep, stat = "max", group = d$team) + ggtitle("Posterior predictive max observation per team")
```


```{r}
ppc_stat_grouped(y = d$y, yrep, stat = "max", group = d$repo) + ggtitle("Posterior predictive max observation per repo")
```


```{r}
ppc_stat_grouped(y = d$y, yrep, stat = "q99", group = d$team) + ggtitle("Posterior predictive 99% quartile per team")
```

```{r}
ppc_stat_grouped(y = d$y, yrep, stat = "q99", group = d$repo) + ggtitle("Posterior predictive 99% quartile per repo")
```


```{r}
heatmap_by_team_and_repo(posterior_predict_by_team_and_repo(m, added=q95(data$ADD), removed=q95(data$DEL), complexity=q95(data$COMPLEX), duplicates = q95(data$DUP), summary=function(x) q99(x)), "Quantile 99%", decimals=0)
```
```{r}

```


## Conclusion

We will not investigate the model summary, as we already see from the data that the model is not a particularly good fit to our data.
While the zero-inflation part looks OK (distribution centered around the observed value, between 0.93 and 0.94), our model seems to underestimate both the maximum expected value and the variation in the data.
Can we do better?