-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathexample.Rmd
197 lines (166 loc) · 8.43 KB
/
example.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
---
title: 'Example data use: Continuous surface temperatures for 185,000 lakes in the
Contiguous United States estimated using deep learning (1980-2020)'
author: "Jordan S. Read"
date: "8/25/2021"
output: github_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## download files
This example assumes you have downloaded files from the "Continuous surface temperatures for 185,000 lakes in the Contiguous United States estimated using deep learning (1980-2020)" data release at https://doi.org/10.5066/P9CEMS0M
Those files can be placed anywhere accessible to you, but for this example, they were downloaded into a directory called 'downloaded_files'
```{r loading}
library(tidyverse) # I used packageVersion 1.3.1 for this
library(ncdf4) # I used packageVersion 1.17 for this
data_dir <- 'downloaded_files'
```
```{r, echo = FALSE}
data_dir <- 'out_data' # this is where the data actually are for me
```
## find data for _one_ lake
Here, we're looking for information on a single lake. For this example, I used Smith et al., 2021 to find a lake identifier for Lake Mendota, WI
Smith, N.J., K.E. Webster, L.K. Rodriguez, K.S. Cheruvelil, and P.A. Soranno. 2021. LAGOS-US LOCUS v1.0: Data module of location, identifiers, and physical characteristics of lakes and their watersheds in the conterminous U.S. ver 1. Environmental Data Initiative. https://doi.org/10.6073/pasta/e5c2fb8d77467d3f03de4667ac2173ca (Accessed 2021-08-24).
```{r lake_id}
mendota_lake_nhdid <- "nhdhr_143249470" # from LAGOS-US lake_nhdid, for lagoslakeid = 5371
```
Read in the metadata file:
```{r}
all_metadata <- read_csv(file.path(data_dir, 'lake_metadata.csv'))
all_metadata
```
use this lake ID to filter the metadata to just Lake Mendota:
```{r}
mendota_info <- filter(all_metadata, site_id == mendota_lake_nhdid)
mendota_info
```
```{r, echo = FALSE}
data_dir <- 'tmp' # this is where the data actually are for me
```
Now, use the information in the metadata table to construct a file name for the prediction and weather NetCDF files that contain Lake Mendota. Since there are three files for prediction and weather data, you'll need to read data from the right one (assumes you have downloaded data release files to your local directory, which in this case is "downloaded_files/"):
```{r}
predict_fl <- file.path(data_dir, sprintf("%s_predicted_temp_%s.nc", mendota_info$group_id, mendota_info$group_bbox))
weather_fl <- file.path(data_dir, sprintf("%s_weather_%s.nc", mendota_info$group_id, mendota_info$group_bbox))
# check ?sprintf() if you are unfamiliar with it
# make sure you have this file locally:
file.exists(predict_fl)
```
Open up the NetCDF file and access the time information and use that to pull data from predicted temperatures of Lake Mendota.
```{r}
nc <- nc_open(predict_fl)
# look at the format of this file:
nc
```
The "site_id" (also a value in the lake metadata table) is used to uniquely identify each lake. We need to figure out which index (shortened to `idx` here) Lake Mendota's site_id is located in:
```{r}
mendota_predict_idx <- ncvar_get(nc, 'site_id') %>% {. == mendota_lake_nhdid} %>% which()
```
We're using a subset of the time period here to show that you don't always have to access all of the data at once.
```{r}
time_origin <- ncdf4::ncatt_get(nc, 'time', attname = 'units')$value %>% str_remove("days since ")
time <- ncvar_get(nc, 'time') + as.Date(time_origin)
# Find the time index when 2017 starts. Since the dataset goes all the way to 1980, we'll use this index to access a subset of the data
time_start <- which(time == as.Date('2017-01-01'))
time_end <- length(time)
# see ?ncdf4::ncvar_get if you need help with what these argument values mean
mendota_surftemp <- ncvar_get(nc, 'surftemp', start = c(time_start, mendota_predict_idx), count = c(-1, 1))
nc_close(nc)
```
The same access pattern can be used to access weather data. But, since weather is organized according to lat/lon cells, the `weather_id` (also a value in the lake metadata table) is the unique identifier for weather locations, since some cells cover more than one lake.
```{r}
nc <- nc_open(weather_fl)
mendota_weather_idx <- ncvar_get(nc, 'weather_id') %>% {. == mendota_info$weather_id} %>% which()
mendota_shortwave <- ncvar_get(nc, 'dswrfsfc', start = c(time_start, mendota_weather_idx), count = c(-1, 1))
nc_close(nc)
```
Now plot the data for surface temperature and shortwave radiation:
```{r, plot_single}
plot(time[time_start:time_end], mendota_surftemp, xlim = as.Date(c('2017-01-01','2020-12-31')),
pch = 16, xlab = '', ylab = 'EA-LSTM-predicted surface water temperature (°C)')
plot(time[time_start:time_end], mendota_shortwave, xlim = as.Date(c('2017-01-01','2020-12-31')),
pch = 16, xlab = '', ylab = 'NLDAS downwelling shortwave radiation (W/m2)')
```
## access data for all lakes (within a single file) at once
Adding the sf package and also specifying a single time that we're interested in, July 1st of 2017 in this example:
```{r}
library(sf) # I used packageVersion 0.9.8 for this
time_idx <- which(time == as.Date('2017-07-01'))
```
Open back up the prediction file. Keep in mind this is one of three prediction files, so if you want data for the complete Contiguous U.S., you need to combine data from all three files.
Set the data up to be ready to plot for the whole dataset:
```{r}
pred_fls <- all_metadata %>% mutate(filepath = file.path(data_dir, sprintf("%s_predicted_temp_%s.nc", group_id, group_bbox))) %>%
pull(filepath) %>% unique()
plot_data <- purrr::map(pred_fls, function(pred_fl){
nc <- nc_open(pred_fl)
# get temperature for a single day but for all lakes at once:
all_surftemp <- ncvar_get(nc, 'surftemp', start = c(time_idx, 1), count = c(1, -1))
lake_lat <- ncvar_get(nc, 'lat')
lake_lon <- ncvar_get(nc, 'lon')
nc_close(nc)
tibble(surftemp = all_surftemp, lat = lake_lat, lon = lake_lon)
}) %>% bind_rows() %>%
st_as_sf(coords = c("lon", "lat"), crs = 4326) %>%
st_transform("+proj=lcc +lat_1=30.7 +lat_2=29.3 +lat_0=28.5 +lon_0=-91.33333333333333 +x_0=999999.9999898402 +y_0=0 +ellps=GRS80 +datum=NAD83 +to_meter=0.3048006096012192 +no_defs")
```
Plot July 1st temperatures estimated for these lakes using ggplot2:
```{r, plot_all}
ggplot() +
geom_sf(data = plot_data, size = 0.5, aes(col=surftemp)) +
scale_colour_viridis_c(option = "magma")
```
### Python
This example uses python3. I installed fsspec, xarray, hvplot, s3fs, h5netcdf, dask, pykdtree, scipy, pyproj, datashader, and GeoViews. This example was created based off of Rich Signell's notebook
```python
import fsspec
import xarray as xr
import numpy as np
import hvplot.pandas
import geoviews as gv
from holoviews.operation.datashader import rasterize
url = 's3://prod-is-usgs-sb-prod-publish/60341c3ed34eb12031172aa6/01_predicted_temp_N24-53_W98-126.nc'
```
verify we have access
```python
fs = fsspec.filesystem('s3', anon=True)
fs.size(url)
[1] 1180271102
```
open the dataset and request chunks:
```python
ds = xr.open_dataset(fs.open(url), chunks={'time':24})
ds.surftemp
<xarray.DataArray 'surftemp' (site_id: 62560, time: 14976)>
dask.array<open_dataset-75f138412ff0f0130631cfe4fbaae841surftemp, shape=(62560, 14976), dtype=float64, chunksize=(62560, 24), chunktype=numpy.ndarray>
Coordinates:
alt (site_id) float64 dask.array<chunksize=(62560,), meta=np.ndarray>
lat (site_id) float64 dask.array<chunksize=(62560,), meta=np.ndarray>
lon (site_id) float64 dask.array<chunksize=(62560,), meta=np.ndarray>
* site_id (site_id) |S44 b'nhdhr_{C0902D8E-5338-4023-B399-F6758BDDD6A1}' ....
* time (time) datetime64[ns] 1980-01-01 1980-01-02 ... 2020-12-31
Attributes:
units: °C
long_name: Surface water temperature [°C]
least_significant_digit: [2]
```
define a plotting function for use later
```python
def plot_surftemp(da, label=None):
# Convert Xarray to Pandas dataframe so we can use hvplot.points for visualization
df = da.to_pandas().to_frame()
#The dataframe just has streamflow, so add longitude and latitude as columns
df = df.assign(latitude=ds['lat'])
df = df.assign(longitude=ds['lon'])
df.rename(columns={0: "surftemp"}, inplace=True)
p = df.hvplot.points('longitude', 'latitude', geo=True,
c='surftemp', colorbar=True, size=14, label=label)
return (p * gv.tile_sources.OSM)
```
select a time slice and plot
```python
select_time = '2017-06-01'
var = 'surftemp'
da = ds[var].sel(time=select_time)
plot_surftemp(da, label=f'{var}:{select_time}')
```