-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtoolboxCW_data_prep.Rmd
253 lines (157 loc) · 10.8 KB
/
toolboxCW_data_prep.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
---
title: "CW - data prep"
output: html_document
---
In this document, you will learn how to process a data layer, and save it in `layers` folder.
We'll also walk through the general steps we recommend you include in your own prep document, including:
- introducing the goal/subgoal, goal model, and data
- setting up directories and loading libraries
- loading and formatting data
- plotting
- saving as .csv in layers folder
_The following example was modified from the prep document for [OHI-Baltic Clean Water goal](https://github.com/OHI-Science/bhi/blob/draft/baltic2015/prep/CW/eutrophication/eutrophication_prep.Rmd). We have secchi depth data, which measures water clarity. The desired data layer will provide average summer secchi depth in each year and in each region. At the moment, we don't know what the raw data looks like. We may need to aggregate and manipulate large quantities of data to get the final, clean data layer. _
# Introduction
_In this section you'll describe the goal/subgoal, what types of information or data are needed, data sources, goal model, and how to approach trend calculation, etc._
_For example, you can start with a general introduction of what this goal/subgoal is trying to measure, what it means in your local context, and what parameters make sense to be included or explored here._
This subgoal aims to represent the eutrophication level in the water in each region. We uses summer time water clarity, measured by secchi depth, as a proxy indicator, assuming a linear relationship between water clarity and nutrient levels. More info on secchi depth can be found [here](http://www.helcom.fi/baltic-sea-trends/indicators/water-clarity).
## Goal model
_Record what the goal model and reference point should be, how to approach trend calculations, etc._
### Status
Xao = Mean Stock Indicator Value / Reference pt
Stock indicators = two HELCOM core indicators assessed for good environmental status (each scored between 0 and 1 by BHI)
Reference pt = maximum possible good environmental status (value=1)
### Trend
_Typically we calculate trend as a linear trend of the last five years of status. In this assessment, however, this approach is not feasible. And an alternative approach is used and documented here._
CPUE time series are available for all stations used for the HELCOM coastal fish populations core indicators. These data were provided by Jens Olsson (FISH PRO II project). To calculate GES status, full time series were used. Therefore, only one status time point and cannot calculate trend of status over time. Instead, we'll follow approach from Bergström et al 2016, but only focus on the final time period for the slope (2004-2013).
Bergstrom et al. 2016. Long term changes in the status of coastal fish in the Baltic Sea. _Estuarin, Coast and Shelf Science_. 169:74-84
Method:
1. Select final time period of trend assessment (2004-2013)
2. Use time series from both indicators, Key Species and Functional groups. For functional groups,include both cyprinid and piscivore time series
3. For each time series: square-root transform data, z-score, fit linear regression, extract slope
4. Within each time series group (key species, cyprinid, piscivore), take the mean slope for each group within each basin
5. Within each basin take a mean functional group indicator slope (mean of cyprinid mean and piscivore mean)
6. For each basin take overall mean slope - mean of key species and functional group
7. Apply trend value for basin to all BHI regions (except in Gulf of Finland, do not apply Finnish site value to Estonia and Russian regions.)
## Data sources
_Here you can record where the data comes from, where it's stored, potential concerns with the data, why you included or excluded certain data, etc:_
**ICE**: Data extracted from database and sent by Hjalte Parner on Feb 10 2016.
_Note from Parner: "extraction from our database classified into HELCOM Assessment Units – HELCOM sub basins with coastal WFD water bodies or water types"_
Pros and cons of using these data:
- Pros: these are the most recent published data and thus reflect the most current conditions of water quality...
- Cons: these datasets don't have full spatial coverage, or don't have even temporal coverage, and thus for some regions we need to do gap filling...
Reasons for excluding certain datasets:
Direct measurements of nutrient levels (eg. phosphate, nitrate, etc) were excluded from this subgoal because not every region measure these chemicals regularly, or we didn't have time-series data on nutrients to be able to calculate trend...
# Data prep process
## Setup
This section will set up directories, functions, call commonly used libraries, etc, to prepare for the next steps of data prep.
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
## load common libraries, directories, functions, etc
## Libraries
library(readr) # install.packages('readr')
library(dplyr)
library(tidyr)
library(ggplot2)
library(stringr)
library(tools)
## Directories
dir_prep = '~/github/toolbox-demo/prep'
dir_cw = file.path(dir_prep, 'CW')
dir_layers = '~/github/toolbox-demo/region2016/layers'
```
## Read in data and initial exploration
ICES and SMHI data have non-overlapping observations and were combined to one data set for our use. Some observations were not assigned to any regions (ie. no region IDs attached) and were omitted.
Both data sets contains profile data (eg temperature, but secchi is only measured once). We need only unique secchi records. Duplicates were thus taken out.
_Basic functions we will encounter and you will use often: select, mutate, filter, rename_
```{r read in data, echo = FALSE}
#### Read in raw data files #####
## read in ices set
data1 = read_csv(file.path(dir_cw, 'example_data/ices_secchi.csv'))
## quick review of dataset
head(data1)
str(data1)
## other functions you could also use for data review
## colnames(data1)
## dim(data1)
#### Initial filtering #####
ices <- data1 %>% data.frame() %>%
select(bhi_id = BHI_ID, secchi, year= Year, month= Month,
lat= Latitude, lon = Longitude, date= Date) %>%
mutate(date = as.Date(date, format= "%Y-%m-%d"))
head(ices)
str(ices)
## which ices data have BHI_ID of NA?
ices.na <- ices %>%
filter(is.na(bhi_id))
nrow(ices.na) # counted total of 1684
## you could do further explorations on why these observations don't have an ID attached. but for simplicity and illustration,
## we will ignore bhi_id = NA
ices <- ices %>%
filter(!is.na(bhi_id))
#### Remove duplicated data within each data set
## is any data duplicated in ices itself?
ices.duplicated = duplicated(ices)
sum(ices.duplicated==TRUE) #180963 ## MANY duplicates
## keep only unique records
new_ices = unique(ices); nrow(new_ices) #take only unique records # 33018
```
## Select only summer observations
Only summer months post year 2000 were relevant to our use. Therefore we filtered for data in:
- Months 6-9 (June, July, August, September)
- Years 2010-2015
The plots showed that some regions don't have good data coverage. Some basins are missing data for most recent years, such as regions 22 and 25. It appeared that water quality data makes more sense at the basin level, and will be aggregated to the basin level in the next section. (This is an uncommon case in the Baltic's case, but it's left here for demonstration. )
_Plotting is a good way to spot data gaps and other potential problems. Without plotting, we might have missed these temporal gaps._
``` {r select summer data}
## select summer months
summer = new_ices %>% filter(month %in%c(6:9)) %>%
filter(year %in% c(2010:2015))
head(summer)
## plot: by month
ggplot(summer) + geom_point(aes(month, secchi))+
facet_wrap(~bhi_id, scales ="free_y")
## plot: by year
ggplot(summer) + geom_point(aes(year,secchi))+
facet_wrap(~bhi_id)
```
## Calculate mean summer secchi depth by region
Here we calculated the mean summer secchi depth per year for each region.
_Plotting per region by year showed that not all regions have continuous and adequate data. There are differents ways to deal with that, which we will not explore today. You could read more on gapfilling in our [OHI Manual](http://ohi-science.org/manual/#gapfilling)._
``` {r mean monthly secchi depth}
## calculate monthly means for each month
mean_months = summer %>% select(bhi_id, year, month, secchi) %>%
group_by(bhi_id, year, month) %>%
summarise(mean_secchi = round(mean(secchi, na.rm=TRUE), 1)) %>%
ungroup()
head(mean_months)
## plot monthly means
ggplot(mean_months) + geom_point(aes(year,mean_secchi, colour=factor(month))) +
geom_line(aes(year, mean_secchi, colour=factor(month))) +
facet_wrap(~bhi_id)+
scale_y_continuous(limits = c(0,10))
## calculate summer means by region
## region summer means = mean of region monthly mean values
mean_months_summer = mean_months %>%
group_by(bhi_id, year)%>%
summarise(mean_secchi = round(mean(mean_secchi, na.rm=TRUE), 1)) %>%
ungroup() %>%
rename(rgn_id = bhi_id) #### rgn_id is a required field in a data layer!! ###
## plot summer means by basin
ggplot(mean_months_summer) +
geom_point(aes(year,mean_secchi)) +
geom_line(aes(year,mean_secchi))+
facet_wrap(~bhi_id)+
scale_y_continuous(limits = c(0,10))
```
_Congratulations! You have successfully wrangled a large dataset and produced a clean data layer. In order for the Toolbox to use it for calculation, we need to save it in the right location and register it. Let's go back to the Training page for a minute and learn from there._
## Save data layer to `layers` folder
When modifying existing or creating new data layers in the prep folder, save the ready-to-use layers in `layers` folder. We recommend saving it as a new *.csv* file with:
- a prefix identifying the goal (eg. `fis_`)
- a suffix identifying your assessment (eg. `_sc2016.csv`).
Modifying the layer name provides an easy way to track which data layers have been updated regionally, and which rely on global data. Then, the original layers (`_gl2016.csv` and `_gl2016placeholder.csv`) can be deleted.
_**Tip**: filenames should not have any spaces: use an underscore instead. This will reduce problems when R reads the files._
```{r}
write_csv(mean_months_summer, file.path(dir_layers, "cw_mean_secchi_region2016.csv"))
```
**Last Step: To render this file and use it to communicate with other human beings**:
Click on _Knit_ at the top of this window to create an .md file, which can be displayed online as a webpage. Pull/Commit/Push as usual for this file. Click on the .md file in your repo on Github.com to view the rendered file.
_Now we have saved the layer in `layers` folder, there is only one more thing left to do - register it in the data registry. Let's switch back to where we left off in Training._