project_draft_2.Rmd

---
title: "Relative age effect in Japan Professional Football League"
author: "Connor Demorest, Gabrielle Lemire, and James B. Wilson"
fontsize: 12 pt
output: pdf_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```


# Introduction

The Relative Age Effect (RAE) is a term used to describe how those born early in the academic year tend to have an advantage both athletically and academically. An earlier birth is typically associated with increased physical ability and this advantage may occur because those who are older are typically more physically, emotionally and cognitively developed than their younger counterparts. Much research has been done to look at RAEs in North American and European athletes. However very little research has attempted to extend RAEs to other parts of the world. In this paper, we attempt to replicate the work done in Hideaki Ishigami's paper *Relative age and birthplace effect in Japanese professional sports: a quantitative evaluation using a Bayesian hierarchical Poisson model* (published in the Journal of Sports Sciences in 2016). He specifically notes his intention to expand the RAE findings to Asia. Note when we are using the term "Relative Age Effect" or RAE, we are using terminology established in the literature to describe this phenomenon even though our analysis is establishing association and not necessarily correlation

Our simulation and modeling process will simplify the author's study design in which he uses data sourced from the Japan Professional Football League (J. League) from the 2012 soccer season. The J. League data consist of 40 teams, representing a total of 1013 players registered in the 2012 season and focusing on players between the ages of 23 and 25. The author characterized "becoming a professional sports player" as an event and thus we are dealing with discrete count data. Using a Poisson regression model it is possible to estimate the magnitude of the RAE on the likelihood of becoming a professional athlete. More details on the Poisson model can be found below. The author then ran an MCMC algorithm\footnote{The MCMC ran for 25,000 iterations with five chains where the first 5,000 samples were discarded as burn-in. Only every 100 iteration was saved for a total of 1,000 MCMC samples.} using JAGS (Plummer, 2012) and confirming with Stan (Stan Development Team, 2013).

The work replicated in our paper is an important step in extending the RAE's scope of inference. Additionally this paper and our analyses allow the magnitude of relative age to be quantified. In contrast, many other analyses have simply stated whether an association is statistically significant using, for example, a $\chi^2$ test. Finally this paper also includes an offset term in our model to capture the differences in birth rates by month which has also often been left out in other analyses. All of these factors make this topic and methodology valuable in the literature on RAE. 

We are interested in the magnitude of the RAE sizes between months, and in particular, the disadvantage for each month as compared to April and relative risk between each month and March for soccer players in Japan between the ages of 23 and 25 in the 2012 season. Disadvantage evaluates the extent to which children have a relative disadvantage across months. For example, if the disadvantage is measured at 10\% between two months, then the chance of becoming a professional athlete for a child born one month later will be 90\% of that born in the previous month. Relative risk on the other hand, is the ratio of the probabilities of becoming a professional athlete. The larger the relative risk, the greater the advantage children born earlier have over children born later in the year in terms of becoming a professional athlete.

## Model

### Author's Model
The author used a Bayesian hierarchical Poisson model to estimate the association between month and becoming a professional athlete where, 

$$y_{ijk} \sim \text{Poisson}(\lambda_{ijk}), \text{   for   } i=1,\dots,12; j=1\dots,47; k=1,2;$$
Where the author uses a link function for $\lambda_{ijk}$:
$$\lambda_{ijk} = \theta_{ij}\text{exp}\Big(\alpha_j + \beta_kRA_i +\gamma\text{Weather}_j + \delta_1\text{Pop}_j + \delta_2\text{Pop}^2 + \epsilon_i +\eta I[k=1]\Big)$$
In this model, $y_{ijk}$ is the number of professional players, and subscripts $i,j$ and $k$ indicate birth month, prefecture, and sport (soccer and baseball). $\theta_{ij}$ represents the total number of male children born in month $i$ in prefecture $j$. The intercept term, $\alpha_j$, aims to capture the differences in the likelihoods between the birthplaces that remain after controlling for the two factors of population size and weather conditions. $RA_i$ is the relative age of those born in month $i$ and the coefficients on $\beta_k$, measure the RAE. 

### Simplified Model
In this paper, we simplify this model by removing the hierarchical component given to prefecture, along with the factors of population and weather conditions, and measure only the RAE between months for J. League soccer players. Thus, we are interested in estimating the total number of professional J. League players born in a given month. The school year in Japan begins April and ends on March of the following calendar year, which corresponds with the competitive season of most professional sports. Since school year and competitive season both begin in April in Japan, relative age was coded as 0 (April) to 11 (March). 

#### Likelihood
Our data are denoted by $y_{i}$, which represents the number of professional sports players between ages 23 and 25 in the year 2012 who are born in month $i$. Our simplified likelihood model is therefore\footnote{With a likelihood funtion $L(\lambda; y_0, ... y_{11}) = \prod_{i=1}^{12} e^{-\lambda}\frac{\lambda^{y_i}}{y_i!}$, and a log-likelihood function of $l(\lambda; y_0, ... y_{11}) = -11\lambda - \sum_{i=1}^{12}ln(y_i!)+ ln(\lambda)\sum_{i=0}^{11}y_i$.}:

$$y_{i} \sim \text{Poisson}(\lambda_{i}) \text{ for months } i=1,\dots,12 \text{ where our link function is } \lambda_{i} = \theta_{i}\text{exp}\Big(\alpha + \beta RA_i \Big)$$

#### Priors
Mimicking the suggestions in the paper, we put uninformative priors on our regression coefficients where

$$\alpha \sim \text{Normal}(\mu_\alpha,\sigma_\alpha^2) \text{ with } \mu_\alpha \sim \text{Normal}(0,100^2) \text{ and } \sigma_\alpha \sim \text{Uniform}(0,100)$$
$$\beta \sim \text{Normal}(\mu_\beta, \sigma_\beta^2) \text{ with } \mu_\beta \sim \text{Normal}(0,100^2) \text{ and } \sigma_\beta \sim \text{half-Cauchy(5)}$$

#### Offset Term 
The author points out that many analyses of RAE assume that births are uniformly distributed across the year (i.e. that the same number of children are born in every month of the year). However we know that this is not true and we want our model to take into account the number of total births in a given month when assessing whether the number of professional players born in that same month are over-represented. One way to do this is to compare the monthly rates for becoming a professional athlete rather than comparing the number of professional athletes. We can accomplish this by using an offset term in our model. Note for a Poisson regression model for counts we have $log(\lambda_i) = \alpha + \beta y_i$. However if we want a Poisson regression model for rates we can use $log(\frac{\lambda_i}{\theta_i}) = \alpha + \beta y_i$, where $\theta_i$ is the number of Japanese males born in that month in the years of 1987-1989 (which corresponds to professional male athletes who are 23-25 in the year 2012). This is equivalent to $log(\lambda_i)-log(\theta_i) = \alpha + \beta y_i$ or $\lambda_i=\theta_i \times e^{\alpha + \beta y_i}$ (which is how we see the offset term represented in Ishigami's paper). To put it even more plainly, our rate $\frac{\lambda_i}{\theta_i}$ is exponential. Note that interpreting $\alpha$ and $\beta$ is the same as for a Poisson regression model for counts except we multiply the expected counts by $\theta_i$.

Our initial model from which we simulated data, we used the 227 professional J. League players as reported in the paper. Unfortunately, the author did not provide the data used in the offset term and therefore we sought out data sources which would provide values for $\theta_i$, the monthly number of Japanese males born in 1987-1989. We used birth data by country and year provided by United Nations Statistics Division <!-- http://data.un.org/Data.aspx?d=POP&f=tableCode%3A55 --> and the sex ratio for the year 2015 provided by Statista. <!-- https://www.statista.com/statistics/612108/japan-sex-ratio/ --> This data is reported in Table 1.  Additionally, we treat $RA$ as a continuous variable and check to see if there is a linear relationship between the count of players (logged) and relative age (RA). As shown in Figure 1, this linearity assumption seems reasonable.  

```{r echo =F, message=F, ref.label= "initial_data"}

```

```{r echo =F, message=F, ref.label= "initial_linearity", fig.cap="By visualizing the relationship between the count of players logged against their relative age (where RA is 0, 1, ... 11), and using a loess smoothing function to interpolate between values, we can confirm that our assumption of linearity is reasonable.", fig.height=3}

```

## Simulation of Artificial Data

Using the player data in the paper and our estimates for total male births, we fit our model from which to simulate professional players per month. Our JAGS model generates a posterior distribution for the $\alpha$ and $\beta$ coefficients. First, we throw away half of our iterations as burn in and we are left with 12,000. Of these we drew 1,000 $\alpha$ and 1,000 $\beta$ for each month $i$. We then randomly sampled one $\alpha$ and one $\beta$. Using these values we derived $\lambda_i$ with the link function $\lambda_{i} = \theta_{i}\text{exp}\Big(\alpha + \beta RA_i \Big)$. With these 12 $\lambda$ values, we randomly sampled a $y_i$ using the `rpois` function in \textsf{R}.

As reported in Table 2, the simulated players for each month seems reasonable, although we do see some variation between the player data reported in the paper and our simulated values. In Figure 2, we visualize the simulated count of players (logged) and relative age (RA), and see an even more linear relationship than in Figure 1 as we would expect. Note that the simulated data appears more disperse for January through March. 

```{r echo =F, message=F, ref.label= "sim_players", fig.cap="Simulated Data"}
```

```{r echo =F, message=F, ref.label= "linearity_sim_players", fig.cap="Once we have simulated the player count data by month we again log and graph it to compare it's shape to our original data in Figure 1. Our loess smoothing function looks even more linear than Figure 1, which makes sense given this the model used to simulate our data. Overall our simulated data look similar to our original data.", fig.height=3}

```

## Model Assessment

Using this simulated player data for $y_i$, we ran our model in JAGS with 5 chains, each with 25,000 iterations. Figures 3 and 4 show the trace plots for $\alpha$, $\beta$, and $\lambda_i$. The chains appear to be mixing well and we do not see any issues or concerns. 

```{r echo =F, message=F, ref.label= "trace_alpha_beta", fig.cap="We can see excellent mixing and stationarity in our traceplots for $\\alpha$ and $\\beta$. Although in our results we drop the first half of our iterations, all are included here for visual inspection.", fig.height=3}

```

```{r echo =F, message=F, ref.label= "trace_plots_lambda", fig.cap="Trace Plots for $\\lambda$"}

```

As suggested in Gelman et. al., we chose a burn in of 12,000, with model diagnostics reported in Table 3. We obtain $\hat{R}$'s of approximately 1 for all parameters, suggesting convergence and see healthy $n_{eff}$ for each parameter. However, when checking full model diagnostics for all parameters, including our priors, we found $n_{neff}$'s greater than the number of iterations performed, for the $\sigma_\alpha$ and $\sigma_\beta$ hyperparameters, which we found odd but were unable to make sense of. Despite this, the chains for our parameters of interest appear to be mixing well and the $\hat{R}$'s do not indicate any convergence concerns. 


```{r echo =F, message=F, ref.label= "post_burn_samps"}

```


\newpage 

## Posterior Predictive Check and Results

In our JAGS model, we kept track of the posterior predicted values for $y_i$, using the posterior generated values of $\lambda_i$. In Figure 5, we report a posterior predictive check, to ensure that our model can accurately capture the simulated data. We noted unusual simulated data points, in relation to the player data reported in the paper, but it appears that our simplified model realistically capture these features and find no need for modification. We also confirmed these results using Stan. 

```{r echo =F, message=F, ref.label= "post_predict", fig.cap="Each of our graphs contains 1000 posterior predictive draws for the number of professional players born in that month. Our blue line indicates the simulated value for y.", fig.height=6, fig.width=8}

```

Since we cannot directly assess the magnitude of the RAE using our simulated $y_i$'s from the Poisson regression model, we report the disadvantage ratio and relative risk. To calculate these RAE sizes, we found the probability, $p_i$, of becoming a professional J. League player by dividing the posterior generated $y_i$'s by the total number of male births in month $i$. Letting $p_{i}$ be the probability of becoming a professional player in month $i$ and $p_{i+1}$, the probability of becoming a professional athlete in the next month, the disadvantage is calculated as $(p_i - p_{i+1}/p_i)$ and relative risk is calculated as $p_i/p{i+1}$. 

The mean and 95\% uncertainty interval for the disadvantage of each month relative to April and for the relative risk of each month compared to March in the simulated data is shown in Figure 6. Since being born in April is the most advantageous month for becoming a professional athlete and March is the least, all months after April have a positive relationship with the disadvantage, but November is the first month that had a 95% uncertainty interval that was exclusively greater than 0. The relative risk of each month decreases from April to February. We found a 95% posterior probability that the mean disadvantage of being born in March relative to April for Japanese J. League players in 2012 is between 18.5% and 81.0%. 

The relative risk in this context could be interpreted for each month as the relative risk times more likely than March to produce a professional soccer player in the J League. The last month to have a 95% uncertainty interval that was strictly greater than 1 for the relative risk is July.  We also found a 95% posterior probability that the mean relative risk of being born in April is between 1.2 and 5.3 times more likely relative to March to become a professional player. Using the hierarchical Poisson model in the original paper, Ishigami found a 95% posterior probability of 44 and 78 for disadvantage and 1.72 and 4.29 for relative risk.

```{r echo =F, message=F, ref.label= "Dis_RR_visual", fig.cap="Disadvantage and relative risk by month", fig.height=4.8}

```
<!-- fig.height=6, fig.width=8 -->
\newpage

# Conclusions

The original paper found a significant relationship between month of birth and probability of becoming a J. League soccer player in Japan for 23-25 year olds in 2012 who were playing soccer. We followed a similar methodology, but estimated the number of professional players born in each month from a graph in the original paper given the data were not available. Similarly due to unavailable data, we sourced the number of births for males born in 1987-1989 from published United Nations data. Our modeling assumptions mirror those of the author to generate a distribution of parameters $\alpha$ and $\beta$ which was used to create simulated data. Using the simulated data, we reproduced the work the author did. Although we forewent a hierarchical model, the original paper also allowed for random effects by prefecture as well as prefecture level covariates `population` and `weather.`

Using this methodology we also found that the birth month of a professional soccer player born between 1987 and 1989 were associated with their probability of becoming a professional. For example in the case of relative risk (as seen in Figure 6) we see a number of our credible intervals do not overlap with 1 (up to July or August). Similarly a number of our credible intervals for disadvantage do not overlap with 0 (October onwards). In general looking at the consistent shift from month to month we have strong evidence that a birth in April (or soon after April) increases your probability of becoming a professional soccer player in Japan.

In considering the scope of inference of our findings as well as the findings in the original paper, we want to break down the question of "scope" into a number of dimensions: temporal, sport-type, and geographical. Given what we understand about the data generation process, we feel confident that there is some temporal generalizability. Specifically we expect that professional athletes in Japan in the years surrounding 2012 were probably recruited in similar ways. Unfortunately the author did not specify whether he expected this recruitment process to be the same for any given number of years, which would have been helpful in determining the temporal scope of inference. In terms of sport type, since the authors found differences in the RAE between baseball and soccer, it is not clear how readily this analysis would apply to other sports. In terms of geographical scope, the authors describe one of their initial aims as expanding the research of RAEs from only Western countries to Asia. Although we cannot say that the magnitude of the RAE would be the same for all Asian countries, the author has provided evidence future researchers can use in analyzing other Asian countries, or other areas of the world. In general, anyone attempting to use these results in their own analysis should consider what the birthday cut-offs are in their given geographical area since it differs by country (ex: it is January for Canadian students, September in America, April in Japan, etc). 

# Reflection of Data Analysis in Paper and What We Learned

We believe the chosen paper offers a more sophisticated analytical process with which to assess the question of RAE in comparison to many other previous analyses. As the paper mentions, past work assumed uniform births across all months and therefore estimating a rate by month (and including $\theta_i$'s) is a definite improvement on past methodologies. Although we did not include weather and location in our model these were valuable inclusions on the part of the author. Finally, by providing a Bayesian analysis the author allows the magnitude of our parameters of interest to be investigated in comparison to a Frequentist analysis (such as the $\chi^2$ test) which would simply detect there is a difference but not estimate it.

Despite these improvements, we have some critiques of the paper. The paper was not clear that they are only investigating association and not causation, and thus they ought to have provided a disclaimer in using the term "effect." Despite having access to data on athletes between the ages of 19 and 33, we are unclear why the author limited his analysis to only those between ages 23 and 25. We see this as a limitation given the author could have expanded his scope of inference. Similarly we expect that the original model, which includes weather and athlete location as covariates, would have limited relevance for sports that are not outdoors. Finally, although the author listed previous studies investigating RAE, the author has a very vague prior. We see this as a limitation given Bayesian methodologies can incorporate what was learned in past studies. This is a missed opportunity on the part of the author's original model.

# Group Work Collaboration Statement

When we first read the paper, the model itself was a bit confusing. Since we did not have much experience with hierarchical models or with JAGS and Stan, we worked together to figure out how to simplify the model for our purposes. James wrote the JAGS model and Connor confirmed the results with Stan. Gabby spent at least 3 hours setting a seed to get reproducible results\footnote{Temporal liberties were taken with this sentence. Also, I was successful but it broke everything else.}. Connor pointed out spelling mistakes mostly\footnote{His most important role}. James was the only one to show up on time for meetings\footnote{Can confirm.}. James wrote a lot of the code and made a lot of the plots, and Connor helped edit and improve them. James and Gabby wrote the Introduction, Model, Simulation and Critique sections. Connor and James wrote the Model Fitting sections. All three of us worked on the Conclusions but they were on Connors computer so he gets all the credit for that. Gabby helped Connor fix his merge conflicts. There were many.

\newpage

# References

Gelman, A., & Rubin, D. (1992). _Inference from iterative simulation using multiple sequences._ Statistical Sciences, 7, 457–472.

Geweke, J. (1992). _Evaluating the accuracy of sampling-based approaches to calculating posterior moments._ In J. Berdardo, J. Berger, A. David, & A. Smith (Eds.), Bayesian statistics 4. Oxford: Claredon Press.

Plummer, M. (2012). _Just another Gibbs sampler (ver. 3.4.0)._ Retrieved from http://mcmc-jags.sourceforge.net/

Stan Development Team. (2013). _Stan: A C++ library for probability and sampling (ver. 1.3.0)._ Retrieved from http://mc-stan.org/

Statista. _Number of males per 100 females in Japan from 1955 to 2035._ Accessed on Dec 15, 2021. https://www.statista.com/statistics/612108/japan-sex-ratio/

United Nations Statistics Division. _Live Births by Month of Birth._ Accessed on Dec 15, 2021. http://data.un.org/Data.aspx?d=POP&f=tableCode%3A55


\newpage

# Code Appendix

```{r initial_data, eval=FALSE, message=FALSE}
library(pander)
library(dplyr)
months <- c("April","May","June",
            "July","August","September", 
            "October", "November", "December",
            "January", "February", "March")
players <- c(35,28,30,32,27,23,25,20,21,15,12,10)

RA <- c(0,1,2,3,4,5,6,7,8,9,10,11)

#48.8% men using sex ratio of 1.05
#Read in Japan Data from CSV
japan_births <- read.csv("UNdata_Export_20211205_201842295.csv")
#Aggregate by month
agg_births <- aggregate(japan_births$Value, by=list(Category=japan_births$Month), FUN=sum)
colnames(agg_births) <- c("Month","Births")

#Find total males born 
agg_births$Births <- agg_births$Births *0.488
#Arrange April to March
agg_births <- agg_births %>% arrange(factor(Month, levels = months))
agg_births <- head(agg_births,-2)

#
births <- agg_births$Births

#make dataframe
data_df <- data.frame(months, players, RA, births)
pander(data_df, caption="Initial Data From Paper")

```

```{r initial_linearity, eval=FALSE, message=F}
library(ggplot2)
#Plot players vs RA to get a sense of linearity on continuous scale of players vs RA
# plot(x=data_df$RA, y= log(data_df$players), xlab="RA", ylab="log(players)")
# abline(lm(log(data_df$players) ~ data_df$RA))

ggplot(data = data_df, aes(x=RA, y = log(players))) +
  geom_point() +
  xlab("Relative Age (RA)") +
  ylab("Log Count of Players") +
  ggtitle("Assessing The Reasonability of Linearity between \n Log Count of Players and Relative Age") +
  geom_smooth(method = 'loess', se=FALSE, span=1) + 
  theme(plot.title = element_text(hjust = 0.5))

```

```{r sim_players, eval=F, message=F}
library(rjags)
library(coda)
library(extraDistr)
library(pander)

data_jags <- list(y = players, RA = RA, births=births)

#y_ppd is posterior prediction for players in each month
pars_to_save <- c("alpha","beta","lambda","y_ppd")
jags_out_pre <- rjags::jags.model("jags_model.txt", 
                           data = data_jags, 
                           n.adapt=1000, n.chains = 5,
                           inits = list(.RNG.name = "base::Wichmann-Hill",
                                        .RNG.seed = 1),
                           quiet=T)

jags_out <- coda.samples(jags_out_pre, pars_to_save,
                         n.iter = 25000, thin = 1)
sum_jags <- summary(jags_out)

set.seed(1213)
#Sample from the initial run to get simulated players
idx_samp_jags <- sample(12000:25000, size = 1000, rep = F)
all_samps_jags <- do.call(rbind, jags_out)
plot_samps_jags <- data.frame(all_samps_jags[idx_samp_jags, ])

alpha <- sample(plot_samps_jags$alpha, size=1)
beta <- sample(plot_samps_jags$beta, size=1)

sim_lambda <- vector(length=length(RA))
for (i in 1:length(RA)){
   sim_lambda[i] <- births[i]*exp(alpha+beta*RA[i])
}

sim_players <-vector(length=length(RA))

for (i in 1:12){
  sim_players[i] <- rpois(1, sim_lambda[i])
}

data_df$Sim.Players <- sim_players
pander(data_df, caption="Simulated Player Data")

```

```{r linearity_sim_players, eval=F,message=F}
#Plot simulated players vs RA to get a sense of linearity on continuous scale of players vs RA
# plot(x=data_df$RA, y= log(sim_players), xlab="RA", ylab="log(players)")
# abline(lm(log(sim_players) ~ data_df$RA))

ggplot(data = data_df, aes(x=RA, y = log(sim_players))) +
  geom_point() +
  xlab("Relative Age (RA)") +
  ylab("Log Count of Players") +
  ggtitle("Inspecting Simulated Data for \n Log Count of Players and their Relative Age") +
  geom_smooth(method = 'loess', se=FALSE, span = 1) + 
  theme(plot.title = element_text(hjust = 0.5))
```

```{r trace_alpha_beta, eval=F, message=F}
library(bayesplot)
set.seed(1205)

#Rerun the modeled with simulated players
data_jags_sim <- list(y = sim_players, RA = RA, births=births)

#y_ppd is posterior prediction for players in each month
pars_to_save_sim <- c("alpha","beta","lambda","y_ppd")
jags_out_pre_sim <- jags.model("jags_model.txt",
                           data = data_jags_sim ,
                           n.adapt=1000, n.chains = 5,
                           quiet=T
                           # ,
                           # inits = list(.RNG.name = "base::Wichmann-Hill",
                           #              .RNG.seed = 1)
                           )
jags_out_sim <- coda.samples(jags_out_pre_sim, pars_to_save_sim ,
                         n.iter = 25000, thin = 1)
#Rerun the modeled with simulated players
data_jags_sim <- list(y = sim_players, RA = RA, births=births)

#y_ppd is posterior prediction for players in each month
pars_to_save_sim <- c("alpha","beta","lambda","y_ppd")
jags_out_pre_sim <- jags.model("jags_model.txt",
                           data = data_jags_sim ,
                           n.adapt=1000, n.chains = 5,
                           quiet=T
                           # ,
                           # inits = list(.RNG.name = "base::Wichmann-Hill",
                           #              .RNG.seed = 1)
                           )
jags_out_sim <- coda.samples(jags_out_pre_sim, pars_to_save_sim ,
                         n.iter = 25000, thin = 1)
#Check mixing on the chains
#sum_jags_sim <- summary(jags_out_sim)

mcmc_trace(jags_out_sim, pars=c("alpha","beta"))
```

```{r trace_plots_lambda, eval=F, message=F}
#trace plots for lambda values 
mcmc_trace(jags_out_sim, pars=c("lambda[1]","lambda[2]", "lambda[3]","lambda[4]",
                                "lambda[5]","lambda[6]","lambda[7]","lambda[8]",
                             "lambda[9]","lambda[10]","lambda[11]","lambda[12]"))

```

```{r post_burn_samps, eval=F, message=F}
library(tidyverse)
library(janitor)
library(kableExtra)
#Use a 12,000 burn in although chains seems fine
jags_final <- lapply(jags_out_sim, function(x){out <- as.mcmc(x[12001:25000, ])})
jags_final <- as.mcmc.list(jags_final)

neff <- effectiveSize(jags_final)
r_hat_final <- matrix(NA, nrow=nvar(jags_final), ncol=2)
for (v in 1:nvar(jags_final)) {
  r_hat_final[v,] <- gelman.diag(jags_final[,v])$psrf
}

diag_df <- data.frame(round(r_hat_final[1],3), neff)
colnames(diag_df) <- c("R Hat", "Neff") 
kbl(diag_df[1:14,], caption = "Model Diagnostics")
```

```{r}
#geweke.diag(jags_final)

```

```{r post_predict, eval=F, message=F}

#Sample from the posterior prediction
idx_samp_jags <- sample(1:12000, size = 1000, rep = F)
all_samps_jags <- do.call(rbind, jags_final)
plot_samps_jags <- data.frame(all_samps_jags[idx_samp_jags, ])

#Function for posterior prediction
post_predict <- function(samples, length, data, label){
plot(table(samples)/length, 
     main = paste("Post Predic", label),
     type = "h", ylab = "Density", 
     xlab = "y",
     col = rgb(0.1,0.1,0.1, 0.2))
points(as.integer(names(table(samples))), table(samples)/length, pch = 20, cex = 2)
abline(v=data,col="blue",lwd=3)
}

par(mfrow=c(3,4))
post_predict(plot_samps_jags$y_ppd.1.,1000,sim_players[1], data_df$months[1])
post_predict(plot_samps_jags$y_ppd.2.,1000,sim_players[2], data_df$months[2])
post_predict(plot_samps_jags$y_ppd.3.,1000,sim_players[3], data_df$months[3])

post_predict(plot_samps_jags$y_ppd.4.,1000,sim_players[4], data_df$months[4])
post_predict(plot_samps_jags$y_ppd.5.,1000,sim_players[5], data_df$months[5])
post_predict(plot_samps_jags$y_ppd.6.,1000,sim_players[6], data_df$months[6])

post_predict(plot_samps_jags$y_ppd.7.,1000,sim_players[7], data_df$months[7])
post_predict(plot_samps_jags$y_ppd.8.,1000,sim_players[8], data_df$months[8])
post_predict(plot_samps_jags$y_ppd.9.,1000,sim_players[9], data_df$months[9])

post_predict(plot_samps_jags$y_ppd.10.,1000,sim_players[10], data_df$months[10])
post_predict(plot_samps_jags$y_ppd.11.,1000,sim_players[11], data_df$months[11])
post_predict(plot_samps_jags$y_ppd.12.,1000,sim_players[12], data_df$months[12])

```

```{r Dis_RR_visual, eval=F,message=F}
library(gridExtra)
#Calculate probabilities of become a professional athlete by month
dis_RR_df <- data.frame(April=plot_samps_jags$y_ppd.1./data_df$births[1],
                                       May= plot_samps_jags$y_ppd.2./data_df$births[2],
                                       June=plot_samps_jags$y_ppd.3./data_df$births[3],
                                       July=plot_samps_jags$y_ppd.4./data_df$births[4],
                                       August=plot_samps_jags$y_ppd.5./data_df$births[5],
                                       September=plot_samps_jags$y_ppd.6./data_df$births[6],
                                       October=plot_samps_jags$y_ppd.7./data_df$births[7],
                                       November=plot_samps_jags$y_ppd.8./data_df$births[8],
                                       December=plot_samps_jags$y_ppd.9./data_df$births[9],
                                       January=plot_samps_jags$y_ppd.10./data_df$births[10],
                                       February=plot_samps_jags$y_ppd.11./data_df$births[11],
                                       March=plot_samps_jags$y_ppd.12./data_df$births[12])

#Find disadvantage 
dis_RR_df$April_Apr_Dis <- (dis_RR_df$April-dis_RR_df$April)/dis_RR_df$April
dis_RR_df$April_May_Dis <- (dis_RR_df$April-dis_RR_df$May)/dis_RR_df$April
dis_RR_df$April_Jun_Dis <- (dis_RR_df$April-dis_RR_df$June)/dis_RR_df$April
dis_RR_df$April_Jul_Dis <- (dis_RR_df$April-dis_RR_df$July)/dis_RR_df$April
dis_RR_df$April_Aug_Dis <- (dis_RR_df$April-dis_RR_df$August)/dis_RR_df$April
dis_RR_df$April_Sept_Dis <- (dis_RR_df$April-dis_RR_df$September)/dis_RR_df$April
dis_RR_df$April_Oct_Dis <- (dis_RR_df$April-dis_RR_df$October)/dis_RR_df$April
dis_RR_df$April_Nov_Dis <- (dis_RR_df$April-dis_RR_df$November)/dis_RR_df$April
dis_RR_df$April_Dec_Dis <- (dis_RR_df$April-dis_RR_df$December)/dis_RR_df$April
dis_RR_df$April_Jan_Dis <- (dis_RR_df$April-dis_RR_df$January)/dis_RR_df$April
dis_RR_df$April_Feb_Dis <- (dis_RR_df$April-dis_RR_df$February)/dis_RR_df$April
dis_RR_df$April_March_Dis <- (dis_RR_df$April-dis_RR_df$March)/dis_RR_df$April

#Find relative risk
dis_RR_df$April_Mar_RR <- (dis_RR_df$April/dis_RR_df$March)
dis_RR_df$May_Mar_RR <- (dis_RR_df$May/dis_RR_df$March)
dis_RR_df$June_Mar_RR <- (dis_RR_df$June/dis_RR_df$March)
dis_RR_df$July_Mar_RR <- (dis_RR_df$July/dis_RR_df$March)
dis_RR_df$August_Mar_RR <- (dis_RR_df$August/dis_RR_df$March)
dis_RR_df$September_Mar_RR <- (dis_RR_df$September/dis_RR_df$March)
dis_RR_df$October_Mar_RR <- (dis_RR_df$October/dis_RR_df$March)
dis_RR_df$November_Mar_RR <- (dis_RR_df$November/dis_RR_df$March)
dis_RR_df$December_Mar_RR <- (dis_RR_df$December/dis_RR_df$March)
dis_RR_df$January_Mar_RR <- (dis_RR_df$January/dis_RR_df$March)
dis_RR_df$February_Mar_RR <- (dis_RR_df$February/dis_RR_df$March)
dis_RR_df$March_Mar_RR <- (dis_RR_df$March/dis_RR_df$March)


calc_quantiles <- function(disagvantage,rel_risk){
  return(c(quantile(disagvantage,0.025),quantile(disagvantage,0.50),quantile(disagvantage,0.975),
           quantile(rel_risk,0.025),quantile(rel_risk,0.50),quantile(rel_risk,0.975)))
}
apr <- calc_quantiles(dis_RR_df$April_Apr_Dis,dis_RR_df$April_Mar_RR)
may <- calc_quantiles(dis_RR_df$April_May_Dis, dis_RR_df$May_Mar_RR)
jun <- calc_quantiles(dis_RR_df$April_Jun_Dis, dis_RR_df$June_Mar_RR)
jul <- calc_quantiles(dis_RR_df$April_Jul_Dis, dis_RR_df$July_Mar_RR)
aug <- calc_quantiles(dis_RR_df$April_Aug_Dis, dis_RR_df$August_Mar_RR)
sep <- calc_quantiles(dis_RR_df$April_Sept_Dis, dis_RR_df$September_Mar_RR)
oct <- calc_quantiles(dis_RR_df$April_Oct_Dis, dis_RR_df$October_Mar_RR)
nov <- calc_quantiles(dis_RR_df$April_Nov_Dis, dis_RR_df$November_Mar_RR)
dec <- calc_quantiles(dis_RR_df$April_Dec_Dis, dis_RR_df$December_Mar_RR)
jan <- calc_quantiles(dis_RR_df$April_Jan_Dis, dis_RR_df$January_Mar_RR)
feb <- calc_quantiles(dis_RR_df$April_Feb_Dis, dis_RR_df$February_Mar_RR)
mar <- calc_quantiles(dis_RR_df$April_March_Dis, dis_RR_df$March_Mar_RR)

mean_dis_RR <- data.frame(April=apr,
                          May=may,
                          June=jun,
                          July=jul,
                          August=aug,
                          September=sep,
                          October=oct,
                          November=nov,
                          December=dec,
                          January=jan,
                          February=feb,
                          March=mar)

row.names(mean_dis_RR) <- c("Disadvantage_2.5", "Disadvantage_50","Disadvantage_97.5", 
                            "Relative_Risk_2.5","Relative_Risk_50","Relative_Risk_97.5")

#Relative risk and Disadvantage Data Transposed
df <- data.frame(t(mean_dis_RR))


#Plots from page 150 in paper
dis_plot <- df %>%
    ggplot(aes(x=row.names(df), y = Disadvantage_50*100, group=1)) +
    geom_line()+
  geom_point(size = 2) + 
    scale_x_discrete(limits =months)+
  geom_segment(aes(x=row.names(df), xend = row.names(df), 
                   y = Disadvantage_2.5*100, yend = Disadvantage_97.5*100), 
               inherit.aes = F, lwd = 3, alpha = 0.5, linetype = 1, color = "steelblue") + 
  xlab("") + 
  ylab("Disadvantage (%)") +
    scale_y_continuous(breaks=seq(-50,100,by=20)) +
    ggtitle("Disadvantage compared to April with 95% Posterior Intervals")


rr_plot <- df %>%
    ggplot(aes(x=row.names(df), y = Relative_Risk_50, group=1)) +
    geom_line()+
  geom_point(size = 2) +
    scale_x_discrete(limits =months)+
  geom_segment(aes(x=row.names(df), xend = row.names(df), 
                   y = Relative_Risk_2.5, yend = Relative_Risk_97.5), 
               inherit.aes = F, lwd = 3, alpha = 0.5, linetype = 1, color = "steelblue") +
  xlab("Birth Month") + 
  ylab("Relative Risk")+
    scale_y_continuous(breaks=seq(-0,12,by=1))+
    ggtitle("Relative Risk compared to March with 95% Posterior Intervals")

grid.arrange(dis_plot, rr_plot, ncol=1)
```

```{r eval=F}
# --- JAGS MODEL
model{
    ## Likelihood
    for(i in 1:length(RA)){
      y[i] ~ dpois(lambda[i])
      lambda[i] <- births[i]*exp(alpha + RA[i]*beta)
      }     
    ## Priors
    #alpha
    mu_a ~ dnorm(0,100*100)
    sig_a ~ dunif(0,100)
    alpha ~ dnorm(mu_a, sig_a*sig_a)
    
    #beta
    mu ~ dnorm(0,100*100)
    sig ~ dt(0, pow(2.5,-2), 5)T(0,)
    beta ~ dnorm(mu,sig*sig)
    
    #Posterior Predict Players
    for(i in 1:12) {
      y_ppd[i] ~ dpois(lambda[i])
    }
}
```

```{r eval=F}
# ---STAN CODE
data {
  int<lower = 0> I;
  int<lower = 0> y[I];  
  real<lower = 0> births[I];
  int<lower = 0> RA[I];
}

parameters {
  real alpha;
  real beta;
  // real<lower=0, upper = 100> sigma_a;
  // real<lower=0> sigma_b;
}

transformed parameters {
  real<lower=0> lambda[I];
  for (i in 1:I)
  lambda[i] = births[i]*exp(alpha+beta*RA[i]);
  }

model {
    alpha ~ normal(0, 1000);
    beta ~ normal(0, 1000);
    // sigma_a ~ uniform(0, 100);
    // sigma_b ~ cauchy(0,1);
        for (i in 1:I){
    y[i] ~ poisson(lambda[i]);
    }
}
```