forked from filippob/longitudinal_data_analysis
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path1.longitudinal_data.Rmd
268 lines (191 loc) · 8.2 KB
/
1.longitudinal_data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
---
title: "Linear Models"
author: "Filippo Biscarini"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output: html_document
---
```{r setup, include=FALSE}
library("knitr")
library("tidyverse")
library("data.table")
knitr::opts_chunk$set(echo = TRUE)
```
## Longitudinal data
In this R notebook we will get introduced to examples of longitudinal data, i.e. data with a **time component**:
$$
y = \mu + \beta_1 \text{time} + \beta_2 \text{treatment} + e
$$
## Read data
Data from [Spatiotemporally explicit model averaging for forecasting of Alaskan groundfish catch](https://onlinelibrary.wiley.com/doi/10.1002/ece3.4488)
(data repo [here](https://zenodo.org/record/4987796#.ZHcLL9JBxhE))
It's data on fish catch (multiple fish species) over time in different regions of Alaska.
```{r read-the-data}
basefolder = "/home/filippo/Dropbox/cursos/longitudinal_data_analysis/longitudinal_data_analysis"
inpfile = "data/alaska_groundfish_catch/stema_data.csv"
fname = file.path(basefolder, inpfile)
fish = fread(fname)
```
- **CPUE**: target variable, "catch per unit effort"
- **SST**: sea surface temperature
- **CV**: actually, the coefficient of variation for SST is used $\rightarrow$ the coefficient of variation is an improved measure of seasonal SST over the mean, because it standardizes scale and allows us to consider the changes in variation of SST with the changes in mean over (Hannah Correia, 2018 - Ecology and Evolution)
- **SSTcvW1-5**: CPUE is influenced by survival in the first year of life. Water temperature affects survival, and juvenile fish are more susceptible to environmental changes than adults. Therefore, CPUE for a given year is likely linked to the winter SST at the juvenile state. Since this survey targets waters during the summer and the four species covered reach maturity at 5--8 years, SST was lagged for years one through five to allow us to capture the effect of SST on the juvenile stages. All five lagged SST measures were included for modeling.
### Data preprocessing
```{r preprocess}
fish <- fish |>
select(-c(V1,Latitude,Longitude))
```
We see that, in order to accommodate variation in SST among stations, the CPUE value has been replicated multiple times. This would defeat our purpose of analysing data by group (fish species) over space and time: with only one value per group, a statistical analysis is a bit hard to be performed (no variation). Therefore, to the original CPUE values we add some random noise proportional to the average (by species, area, year):
```{r}
fish |> filter(Species == "Pacific cod", Area == "West Yakutat", Year == "1990")
```
```{r}
fish <- fish |>
group_by(Species, Area, Year) |>
mutate(avg = mean(CPUE), std = 0.1*avg)
```
```{r}
fish <- fish |>
ungroup() |>
mutate(noise = rnorm(n = n(),mean = 0,sd = std), CPUE = CPUE + noise)
```
```{r}
fish |> filter(Species == "Pacific cod", Area == "West Yakutat", Year == "1990")
```
### EDA (Exploratory Data Analysis)
Let's start by looking at the raw data. As we already saw, for each combination of species, area and year we have multiple observations; for instance, let's look at `Pacific cod` from `West Yakutat` in year `2000`. Therefore, a boxplot is a good way to plot these data:
```{r}
fish |>
filter(Species == "Pacific cod", Area == "West Yakutat", Year == 2000)
```
```{r}
p <- ggplot(fish, aes(Year, CPUE)) + geom_boxplot(aes(fill=Area))
p <- p + facet_wrap(~Species)
p
```
First, we note large variation in scale between fish species. Let's try to allow the scale to change by `Species`:
```{r}
p <- ggplot(fish, aes(Year, CPUE)) + geom_boxplot(aes(colour=Area))
p <- p + facet_wrap(~Species, scales = "free_y")
p
```
We now see CPUE oscillations overtime and between geogpraphical areas, but again this varies by fish species. What if we rescale CPUE?
```{r}
rescale01 <- function(x) {
rng <- range(x, na.rm = TRUE)
rescaled_variable = 100*(x - rng[1]) / (rng[2] - rng[1])
return(rescaled_variable)
}
fish <- fish |> group_by(Species) |>
mutate(
rescaled_cpue = rescale01(CPUE)
)
```
```{r}
fish |> group_by(Species) |>
summarise(imn = min(rescaled_cpue), max = max(rescaled_cpue))
```
```{r}
p <- ggplot(fish, aes(Year, rescaled_cpue)) + geom_boxplot(aes(colour=Area))
p <- p + facet_wrap(~Species)
p
```
### Trends
A trend is usually an average over time:
```{r eda}
dd <- fish |>
group_by(Species, Area, Year) |>
summarise(avg = round(mean(rescaled_cpue),2)) |>
spread(key = Year, value = avg)
print(dd)
```
```{r}
temp <- dd |>
gather(key = "Year", value = "CPUE", -c(Species, Area))
```
Now the variable `Year` is a `factor`; we need to take this into account when making the plot:
- `group`: we have only one observation for group (Species, Area, Year), so we must specify the grouping variable, in this case `Area`
- year is not a number now, and this is reflected in the x axis: no intervals, all values are plotted (so we can for example place them vertically and make them smaller, to avoid overlap)
```{r}
p <- ggplot(temp, aes(Year, CPUE)) + geom_line(aes(color = Area, group = Area))
p <- p + facet_wrap(~Species) + theme(axis.text.x = element_text(angle = 90, size = 6))
p
```
Alternatively, we can convert Year to numeric, then plot as usual:
```{r}
temp <- dd |>
gather(key = "Year", value = "CPUE", -c(Species, Area)) |>
mutate(Year = as.numeric(Year))
```
```{r}
p <- ggplot(temp, aes(Year, CPUE)) + geom_line(aes(color = Area))
p <- p + facet_wrap(~Species)
p
```
**Q: Do we have a trend?**
What about the standard deviation? Let's see if we have a trend there (!! remember, we introduced artificial random variation, no trend is actually expected, safe by chance !!):
```{r}
dd <- fish |>
group_by(Species, Area, Year) |>
summarise(std = round(sd(rescaled_cpue),2)) |>
spread(key = Year, value = std)
print(dd)
```
```{r}
temp <- dd |>
gather(key = "Year", value = "sd(CPUE)", -c(Species, Area)) |>
mutate(Year = as.numeric(Year))
```
```{r}
p <- ggplot(temp, aes(Year, `sd(CPUE)`)) + geom_line(aes(color = Area))
p <- p + facet_wrap(~Species)
p
```
**Q: What do you notice?**
### Model-based adjustments
```{r}
mod <- lm(rescaled_cpue ~ SST_cvW+SST_cvW1+SST_cvW2+SST_cvW3+SST_cvW4+SST_cvW5, data = fish)
summary(mod)
```
```{r}
hist(residuals(mod))
```
```{r}
fish$residuals <- residuals(mod)
fish$fitted_values <- fitted(mod)
```
```{r}
dd <- fish |>
group_by(Species, Area, Year) |>
summarise(avg = round(mean(residuals),2)) |>
spread(key = Year, value = avg)
print(dd)
```
```{r}
temp <- dd |>
gather(key = "Year", value = "CPUE", -c(Species, Area)) |>
mutate(Year = as.numeric(Year))
```
```{r}
p <- ggplot(temp, aes(Year, CPUE)) + geom_line(aes(color = Area))
p <- p + facet_wrap(~Species)
p
```
## Exercise
Data from the article "[Age-dependent trait variation: the relative contribution of within-individual change, selective appearance and disappearance in a long-lived seabird](https://besjournals.onlinelibrary.wiley.com/doi/10.1111/1365-2656.12321)"
(data repo [here](https://zenodo.org/record/5010983))
```{r read-the-cow-data}
basefolder = "/home/filippo/Dropbox/cursos/longitudinal_data_analysis"
inpfile = "data/seabird/data_traits.xlsx"
fname = file.path(basefolder, inpfile)
seabirds = readxl::read_xlsx(fname)
```
1. Within populations, the expression of phenotypic traits typically varies with age. Such age-dependent trait variation can be caused by within-individual change (improvement, senescence, terminal effects) and/or selective (dis)appearance of certain phenotypes among older age classes.
2. within-individual change and selective (dis)appearance, in a long-lived seabird, the common tern (*Sterna hirundo*).
3. At the population level, all traits, except the probability to breed, improved with age (i.e., phenology advanced and reproductive output increased). Both methods identified within-individual change as the main responsible process, and within individuals, performance improved until age 6-13, before levelling off. In contrast, within individuals, breeding probability decreased to age 10, then levelled off.
OPTIONS:
- look at males vs females across time
- look at males vs females in two/three groups: young, intermediate, old birds
Target variables can be: laying date or egg volume (approximately continuous variables)
```{r}
## your code here
```