Part I-Standardized dataset creation.Rmd

---
title: "Part I: Standardized dataset creation"
author: "Phil Savoy, Mike Vlah, Lauren Koenig"
date: "Last updated on `r format(Sys.time(), '%d %B, %Y')`"
output: html_document
knit: (function(inputFile, encoding) {
  rmarkdown::render(inputFile, encoding = encoding, output_dir = "documentation") })
---

# Description

<font size="5">Summary of major workflow components</font> 

This workflow is part of a series that describes the preparation of data used in [Bernhardt et al. (2022)](https://doi.org/10.1073/pnas.2121976119). The creation of this dataset consists of several sub workflows as described below.

* **Part I: Standardized dataset creation:** Part I is the preparation of standardized datasets of stream metabolism from the [StreamPULSE data portal](https://data.streampulse.org/) and terrestrial ecosystem fluxes from the [FLUXNET2015 dataset](https://fluxnet.org/data/fluxnet2015-dataset/). Everything generated in Part I is included in `output_data` so that other users could use this larger dataset if they choose to work with a different subset of data than we used.

* **Part II. Filtering, gap-filling, and calculating metrics:** In Part II the standardized dataset of lotic and terrestrial metabolism data was filtered down to the subset of sites used in Bernhardt et al. (2022). After filtering, additional descriptive metrics were calculated to use in analysis.

* **Part III. Plotting and export of stats dataset:** Datasets from part II are minimally subsetted and recast for plotting convenience. Figures 1-6 are generated. An analysis-ready dataset is compiled.

* **Part IV. Structural equation modeling:** Code used for structural equation model (SEM) analysis on annual metabolism dataset. We used an observed variables model to estimates the effect of light (PAR reaching the stream surface) and hydrologic variability (skewness of daily discharge) on annual river GPP.  

<font size="5">In this document</font> 

* **Preparation of standardized lotic metabolism data:** The preparation of a standardized dataset of stream metabolism available through the [StreamPULSE data portal](https://data.streampulse.org/). The metabolism data, along with ancillary data and additional model estimates, are placed into a standardized format to facilitate investigation by other researchers.

* **Preparation of standardized terrestrial metabolism data:** The preparation of a standardized dataset of terrestrial eddy covariance derived GPP and ER from the [FLUXNET2015 dataset](https://fluxnet.org/data/fluxnet2015-dataset/).

<font size="5">Key</font> 

Throughout this document certain terms are indicated with changes in the color or emphasis of text. Please refer to the following key for their meaning:

***

* <span style="color:orange">**R functions**</span>
* **R Package**
* <span style="color:SlateBlue">**R executable scripts**</span>
* **<span style="color:#009688">"Column name"</span>**
* <i><span style="color:#00cc99">Output dataset</span></i>   
* <i><span style="color:#BB1AA5">Output included in **BernhardtMetabolism** package</span></i>

***

Note, this R markdown file is provided because some may find it instructive to review the code; however, because of the size and complexity of this portion of the workflow the underlying code is not provided and users are not able to independently reproduce the results.

```{r Setup envrionment, message = FALSE, include = FALSE, eval = FALSE}
#Before proceeding, load the necessary packages 
  library("StreamLightUtils")
  library("StreamLight")
  library("plyr")
  library("stringr")
  library("dplyr")
  library("mgcv")
  library("here")
```

```{r Create here file, message = FALSE, include = FALSE, eval = FALSE}
#Create a here file in the project location (first time only)
  if(file.exists(".here") == FALSE){here::set_here()}
```

 
# {.tabset .tabset-pills}

---

## Preparation of standardized lotic metabolism data

---

<font size="6">Preparation of standardized lotic metabolism data</font>

### 1. Acquiring stream metabolism data and generating basic information for lotic sites

Data used for this study comes from the the [StreamPULSE data portal](https://data.streampulse.org/). This portal houses data available from several sources. The steps used to acquire data are described below.

#### 1.1 Acquiring Appling et al. (2018) data

The [StreamPULSE data portal](https://data.streampulse.org/) includes stream metabolism data generated by [Appling et al. (2018)](https://www.nature.com/articles/sdata2018292). The data from [Appling et al. (2018)](https://www.nature.com/articles/sdata2018292) are also available as part of a USGS [ScienceBase data release](https://www.sciencebase.gov/catalog/item/59bff507e4b091459a5e0982) and the metabolism estimates from this data release are also available [here](https://www.sciencebase.gov/catalog/item/59eb9c0ae4b0026a55ffe389).

#### 1.2 Acquiring and unpacking StreamPULSE data

From the [StreamPULSE data portal](https://data.streampulse.org/) two sets of data were downloaded:

* Download a table of site information titled as "All basic site data" (all_basic_site_data.csv.zip) from https://data.streampulse.org/download_bulk and place in <i>.../data/StreamPULSE</i>
* Bulk download all site data ("All public model objects in RDS format") from https://data.streampulse.org/download_bulk and place in <i>.../data/StreamPULSE</i>

Data was acquired on 6/15/2021, and once downloaded the site information and metabolism data were unpacked.

```{r Unpack StreamPULSE, include = FALSE, eval = FALSE}
#===============================================================================
#Unpacking the StreamPULSE site information and metabolism data
#===============================================================================
#Unzip the downloaded StreamPULSE site information. For ease and transparency, 
#the acquisition date of the data pull is recorded in the workflow.
  #Date of StreamPULSE data acquisition
    acquire_date <- "6_15_2021"
  
  #Unzip the StreamPULSE site information
    unzip(
      here::here("data", "StreamPULSE", paste0("all_basic_site_data_", acquire_date, ".zip")),
      exdir = here::here("data", "StreamPULSE")
    )
    
#Unzip the StreamPULSE metabolism data (all_sp_model_objects.zip) 
#*Note, rename files for CO_Cameo2, CO_WindyGap2, NC_MC751 because they had incorrect formatting.
  unzip(
    here::here("data", "StreamPULSE", paste0("all_sp_model_objects_", acquire_date, ".zip")),
    exdir = here::here("data", "StreamPULSE")
  )    
```

```{r Rename, include = FALSE, eval = FALSE}
###TEMPORARY
#STOP AND REMEMBER TO manually RENAME FILES FOR CO_Cameo2, CO_WindyGap2, NC_MC751
###END TEMPORARY 
```    
    
#### 1.3. Generate site information for the CONUS

Basic site information was generated for all sites with modeled metabolism data in the continental United States using <span style="color:SlateBlue">**Lotic_site_basic.R**</span>. A few steps are done at this stage including:

##### Export a set of basic site information for sites

A set of basic site information (Site ID, site name, data source, latitude, longitude) were created in a standardized format.

```{r Fix latitudes, include = FALSE, eval = FALSE}
###TEMPORARY
#LATITUDE FOR CO_Forest4, CO_Forestsite, CO_Alpine3 HAD TO BE MANUALLY CHANGED FROM 29 to 39 because these sites would reside in northern Mexico otherwise and appears to be a data entry error
###END TEMPORARY 
```   


```{r Basic lotic site information, include = FALSE, eval = FALSE}
#Get a formatted table of basic site information for all lotic sites
#At this stage nwis_05406457 was removed because it and WI_BEC are duplicate sites
  source(here::here("R", "executables", "Lotic_site_basic.R")) 

#Output file: <i>../output/lotic_site_basic.rds</i>
```


##### Extract information on the timeframes of data availability

At each site, the start and end year of available metabolism data was calculated using <span style="color:SlateBlue">**Lotic_date_ranges.R**</span>. Additionally, the availability of data for individual site-years was generated. This information was largely used to inform downloading other data sources later in the workflow. 

```{r Calculate date ranges, include = FALSE, eval = FALSE}
#Find the date ranges for each site
  source(here::here("R", "executables", "Lotic_date_ranges.R"))
#Output file: <i>../output/lotic_timeframes.rds</i>
#Output file: <i>../output/sp_site_years.rds</i>
#Output file: <i>../output/powell_site_years.rds</i>
```

##### Get more detailed site information for CONUS sites

A more detailed set of site information for CONUS sites was obtained using <span style="color:SlateBlue">**Lotic_site_info.R**</span>.
Sites were filtered for the CONUS so they could be associated with various standardized products. After CONUS sites were selected a few things were done at this step:

- Where possible, each site was linked to the nearest NHDPlus_v2 stream reach. We identified the NHDPlus_v2 catchment id (COMID) closest to the latitude and longitude of the site using the **nhdplusTools** package in R, followed by manual inspection of flowlines using a combination of maps and imagery to ensure that each site was associated with the appropriate NHDPlus_v2 stream reach. Because NHDPlus_v2 flowlines are medium resolution, several smaller stream sites do not overlap with a flowline. In these cases where manual inspection indicated that small streams could not reliably be linked to a NHDPlus_v2 flowline, we indicated COMID as "NA." Once sites were assigned a COMID, they could be cross-referenced with other datasets.  
- Stream order information for each reach was pulled from the NHDplus_v2 flowline value-added attributes based on COMID. 

```{r Advanced site information, include = FALSE, eval = FALSE}
source(here::here("R", "executables", "Lotic_site_info.R"))
#Output file: <i>../output/lotic_site_info.rds</i>
```


Timeseries of downwelling shortwave radiation and leaf area index (LAI) were downloaded for two purposes. First, these are two of the inputs used by the [**StreamLight R package**](https://github.com/psavoy/StreamLight) to generate estimates of light reaching the stream surface. Second, they can be valuable in their own right to assess patterns of metabolism in regards to broad patterns in light availability or canopy cover.


#### 2.1 NLDAS downwelling shortwave radiation data
North American Data Assimilation Systems (NLDAS) downwelling shortwave radiation (W m^-2^) data was downloaded via the [data rods service](https://disc.gsfc.nasa.gov/information/tools?title=Hydrology%20Data%20Rods). The download and processing of NLDAS data was handled by the <span style="color:orange">**NLDAS_DL_bulk**</span> and <span style="color:orange">**NLDAS_proc**</span> functions from the [**StreamLightUtils R package**](https://github.com/psavoy/StreamLightUtils) 

```{r Downloading and processing NLDAS data, include = FALSE, eval = FALSE}
source(here::here("R", "executables", "NLDAS_data_prep.R"))
#Output file: <i>../output/NLDAS_processed.rds</i>
```


#### 2.2 MODIS data leaf area index data

The MODIS leaf area index (LAI) (m^2^ m^-2^) product (MCD15A2H-006) (500m, 8-day) was acquired through [AppEEARS](https://lpdaacsvc.cr.usgs.gov/appeears/) by generating a set of sites to request data for from 01-01-2005 to 09-01-2020. 

```{r AppEEARS request sites, include = FALSE, eval = FALSE}
#Get AppEEARS request sites
  source(here::here("R", "executables", "appeears_request_sites.R"))
```

After submitting and downloading the AppEEARS request the zipped files were unpacked and processed using the <span style="color:orange">**AppEEARS_unpack**</span> and <span style="color:orange">**AppEEARS_proc**</span> functions from **StreamLightUtils**. 

Not all sites had LAI data in the AppEEARS request, and for those sites data was then downloaded using the **MODISTools** package for a 1km buffer around the site location. The timeseries for each pixel was processed separately and then a representative median timeseries of filtered LAI was returned for the site.

Timeseries of LAI were smoothed and gap-filled to extract the seasonal phenological signal following the approach of [Gu et al. (2009)](https://link.springer.com/chapter/10.1007/978-1-4419-0026-5_2). The **StreamLightUtils** package implements this fitting method by using the **phenofit** R package [found here](https://github.com/kongdd/phenofit) (Kong 2020).

```{r Unpacking and processing LAI data, include = FALSE, eval = FALSE}
#After getting the AppEEARS request, download data for missing sites
  source(here::here("R", "executables", "LAI_get_missing.R"))
        
#Unpack & process AppEEARS data, combine with single pixel data 
  source(here::here("R", "executables", "LAI_data_prep.R"))
  LAI_data_prep(fit_method = "Gu")
  
#Output file: <i>../output/MOD_LAI_processed.rds</i>
```


### 3. Estimates of light at the stream surface

Estimates of light at the stream surface were generated using the [**StreamLight** package](https://github.com/psavoy/StreamLight), specifically the <span style="color:orange">**stream_light**</span> function. The model used in the <span style="color:orange">**stream_light**</span> function is described in [Savoy et al. (2021)](https://www.journals.uchicago.edu/doi/10.1086/714270) and an in-depth tutorial on how to use the model can be found in the [**StreamLight** documentation](https://psavoy.github.io/StreamLight/).

Running the model requires the creation of a parameter file for each site as well as driver files. The basic process used to produce each of these files is described below.

#### 3.1 Generate a parameter file for each site.

There are several site parameters required to run **<span style="color:DarkOrange">stream_light</span>**:

- *<span style="color:#009faf">Lat</span>* The site latitude
- *<span style="color:#009faf">Lon</span>* The site longitude
- *<span style="color:#009faf">channel_azimuth</span>* Channel azimuth
- *<span style="color:#009faf">bottom_width</span>* Bottom width (m) 
- *<span style="color:#009faf">BH</span>* Bank height (m)
- *<span style="color:#009faf">BS</span>* Bank slope
- *<span style="color:#009faf">WL</span>* Water level (m)
- *<span style="color:#009faf">TH</span>* Tree height (m)
- *<span style="color:#009faf">overhang</span>* Maximum canopy overhang (m)
- *<span style="color:#009faf">overhang_height</span>* Height of the maximum canopy overhang (m)
- *<span style="color:#009faf">x_LAD</span>*  Leaf angle distribution

##### Channel azimuth (*<span style="color:#009faf">channel_azimuth</span>*)

Channel azimuth was calculated as the circular mean of azimuths for each stream reach based on latitude and longitude of the site location using NHDPlus_v2 hydrography data.

##### Width (*<span style="color:#009faf">bottom_width</span>*)

Channel width was estimated for each stream reach using either field measurements or regional empirical equations that relate cumulative upstream drainage area with width. For the majority of sites, channel width was calculated as the median of width records available during the date range encompassed by the metabolism estimates, acquired either from the USGS (for sites co-located at USGS gages) or from the StreamPULSE Data Portal, respectively. For sites where no width records were available, we estimated channel width based on region-specific empirical equations, as in [Gomez-Velez et al. 2015](https://doi.org/10.1038/ngeo2567).  

##### Bank height (*<span style="color:#009faf">BH</span>*)

Without detailed information of bank heights a default value of 0.1m was used for all sites.

##### Bank slope (*<span style="color:#009faf">BS</span>*)

Without detailed information of bank slopes a default value of 100 was used for all sites.

##### Water level (*<span style="color:#009faf">WL</span>*)

Without detailed information of water level a default value of 0.1m was used for all sites.

##### Tree height (*<span style="color:#009faf">TH</span>*)

Tree heights were derived using 30m resolution global canopy height estimates from [Potapov et al. (2021)](https://www.sciencedirect.com/science/article/abs/pii/S0034425720305381?via%3Dihub). Heights were extracted by using the site coordinates and extending a 60m buffer into the riparian zone. The 90th percentile of canopy heights within the buffer for each site were returned. The 90th percentile of heights was used to capture tall canopy elements that could shade the channel.

```{r Generate parameter file, include = FALSE, eval = FALSE}
#Generate light model parameter files
  source(here::here("R", "executables", "StreamLight_params.R"))
#Output file: <i>../output/lotic_streamlight_params.rds</i>
```


#### 3.2 Generate driver files

Driver files for **StreamLight** were created using NLDAS downwelling shortwave radiation and MODIS LAI data processed earlier. The driver files were made using the <span style="color:orange">**make_driver**</span> function in the **StreamLightUtils** package.

The structure of the model driver is as follows:

- **<span style="color:#009688">"local_time"</span>**: A POSIXct object in local time
- **<span style="color:#009688">"offset"</span>**: The UTC offset for local time (hours)
- **<span style="color:#009688">"jday"</span>**: A unique identifier for each day that combines year and day of year information in the format YYYYddd
- **<span style="color:#009688">"DOY"</span>**: The day of year (1-365 or 366 for leap years)
- **<span style="color:#009688">"Hour"</span>**: Hour of the day (0-23)
- **<span style="color:#009688">"SW_inc"</span>**: Total incoming downwelling shortwave radiation (W m^-2^). **StreamLightUtils** provides tools to get hourly data from NLDAS.
- **<span style="color:#009688">"LAI"</span>**: MODIS leaf area index (m^2^ m^-2^). **StreamLightUtils** provides tools to generate interpolated to daily values using the **<span style="color:DarkOrange">AppEEARS_proc</span>** function.

```{r Generate driver files, include = FALSE, eval = FALSE}
#Generate driver files
  source(here::here("R", "executables", "StreamLight_drivers.R"))
#Output file: Individual driver files were output in the form "Site_ID"_driver.rds  in <i>../output/driver files</i>
```

#### 3.3 Run **<span style="color:DarkOrange">stream_light</span>** to estimate light at the stream surface

The **StreamLight** package was used to estimate photosynthetically active radiation reaching the stream surface at hourly timesteps. Hourly estimates at the stream surface were then aggregated to daily sums (mol m^-2^ d^-1^) so they could be incorporated into the metabolism synthesis data.

```{r Run StreamLight model, include = FALSE, eval = FALSE}
#Estimate light reaching the stream surface
  source(here::here("R", "executables", "StreamLight_estimates.R"))
```

```{r Daily light estimates, include = FALSE, eval = FALSE}
#Generate daily results of modeled light
  source(here::here("R", "executables", "StreamLight_daily.R"))
#Output file: <i>../output/modeled_light.rds</i>
```


### 4. Creating the standardized lotic dataset (<i><span style="color:#BB1AA5">lotic_standardized_full</span></i>)

<font size="5">Standardized timeseries</font>

Stream metabolism, riparian LAI, incoming light, and modeled estimates of light at the stream surface were compiled into a single set of standardized files using <span style="color:SlateBlue">**Lotic_standardized_metabolism.R**</span>, which leverages the <span style="color:orange">**synthesis_format**</span> function. At this point all negative GPP and positive ER values were replaced with NA (columns GPP and ER) but the raw data was retained (columns GPP_raw and ER_raw). In addition to stream data, incoming PAR, predicted stream surface PAR, and processed LAI generated in earlier steps are also included. To ensure that all files are formatted in the same manner, if any datasets are not present then the corresponding columns were populated with NA values.

```{r Compile StreamPULSE data, include = FALSE, eval = FALSE}
#Compile StreamPULSE data so a unified workflow can be used on StreamPULSE and Powell datasets
  source(here::here("R", "executables", "Compile_StreamPULSE.R"))
```

```{r Standardized files, include = FALSE, eval = FALSE}
#Generate standardized files
  source(here::here("R", "executables", "Lotic_standardized_metabolism.R"))
#Output file: <i>../output/lotic_standardized_metabolism.rds</i>
```


The final dataset contains the following columns:

* **<span style="color:#009688">"Site_ID":</span>** Unique site identifier
* **<span style="color:#009688">"Date":</span>** Date in YYYY-MM-DD format
* **<span style="color:#009688">"U_ID":</span>** Unique date identifier (format as year + DOY)
* **<span style="color:#009688">"Year":</span>** Year
* **<span style="color:#009688">"DOY":</span>** Day of year (1 to 365 or 366)
* **<span style="color:#009688">"GPP_raw":</span>** Stream GPP estimates (g O2 m^-2^ d^-1^) from Appling et al. (2018), raw data
* **<span style="color:#009688">"ER_raw":</span>** Stream ER estimates (g O2 m^-2^ d^-1^) from Appling et al. (2018), raw data
* **<span style="color:#009688">"GPP":</span>** Stream GPP estimates (g O2 m^-2^ d^-1^) from Appling et al. (2018), negative values replaced with NA
* **<span style="color:#009688">"ER":</span>** Stream ER estimates (g O2 m^-2^ d^-1^) from Appling et al. (2018), positive values replaced with NA
* **<span style="color:#009688">"K600":</span>** Model estimate of K600, the mean reaeration rate coefficient, scaled to a Schmidt number of 600, for this date. Value is the median of the post warmup MCMC distribution
* **<span style="color:#009688">"DO.obs":</span>** Mean dissolved oxygen concentration (mg O2 L^-1^) for the date (4am to 3:59am)
* **<span style="color:#009688">"DO.sat":</span>** Mean theoretical saturation concentration (mg O2 L^-1^) for the date (4am to 3:59am)
* **<span style="color:#009688">"temp.water":</span>** Mean water temperature (degreees C) for the date (4am to 3:59pm)
* **<span style="color:#009688">"discharge":</span>** Mean discharge (m^3^ s^-1^) for the date (4am to 3:59pm)
* **<span style="color:#009688">"PAR_sum":</span>** Daily sum of incoming above canopy PAR (mol m^-2^ d^-1^)
* **<span style="color:#009688">"Stream_PAR_sum":</span>** Daily sum of PAR estimated at the stream surface (mol m^-2^ d^-1^)
* **<span style="color:#009688">"LAI_proc":</span>** MODIS LAI data that has been processed and gap-filled (m^2^ m^-2^)

<font size="5">Basic site information (<i><span style="color:#BB1AA5">lotic_site_info_full</span></i>)</font>

```{r, include = FALSE, eval = FALSE}
#===============================================================================
#Create a final set of basic site information
#===============================================================================
full_final_table <-function(){
#Read in the standardized metabolism
  lotic_standardized_metabolism <- readRDS(here::here("output", "lotic_standardized_metabolism.rds"))
#Read in the lotic site information and make sure it matches the standardized dataset
  lsi1 <- readRDS(here::here("output", "lotic_site_info.rds"))
  lsi <- lsi1[lsi1[, "Site_ID"] %in% names(lotic_standardized_metabolism), ]
#Read in the table of StreamLight parameters
  params <- readRDS(here::here("output", "lotic_streamlight_params.rds"))
  
#Merge with select site parameters to create a set of basic site information 
  lotic_site_info_full2 <- merge(
    lsi, 
    params[, c("Site_ID", "Azimuth", "TH")],
    by = "Site_ID", 
    all.x = TRUE
  ) 
  
#Reorder and rename the columns
  lotic_site_info_full <- lotic_site_info_full2[, c("Site_ID", "Name", "Source",
    "Lat", "Lon", "epsg_crs", "COMID", "VPU", "StreamOrde","Azimuth", "TH", "Width", "Width_src", 
    "WS_area_km2", "WS_area_src")]
  
#Export the final output
  saveRDS(lotic_site_info_full, here::here("output", "lotic_site_info_full.rds"))
  
} #End full_final_table wrapper
full_final_table()
#Output file <i>../output/lotic_site_info_full.rds</i>
```
    

* **<span style="color:#009688">"Site_ID":</span>** Unique site identifier
* **<span style="color:#009688">"Name":</span>** Site long name
* **<span style="color:#009688">"Source":</span>** Site data source
* **<span style="color:#009688">"Lat":</span>** Site Latitude
* **<span style="color:#009688">"Lon":</span>** Site Longitude
* **<span style="color:#009688">"epsg_crs":</span>** Site coordinate reference system as EPSG code
* **<span style="color:#009688">"COMID":</span>** NHDPlus_v2 reach COMID
* **<span style="color:#009688">"VPU":</span>** Vector processing unit
* **<span style="color:#009688">"StreamOrde":</span>** Stream order (from NHDPlus_v2 flowlines)
* **<span style="color:#009688">"Azimuth":</span>** Channel azimuth calculated as the circular mean of azimuths for each stream reach based on latitude and longitude of the site location using NHDPlus_v2 hydrography data.
* **<span style="color:#009688">"TH":</span>** Tree height (m). Estimates were derived from Tree heights were derived using 30m resolution global canopy height estimates from [Potapov et al. (2021)](https://www.sciencedirect.com/science/article/abs/pii/S0034425720305381?via%3Dihub)
* **<span style="color:#009688">"Width":</span>** Channel width (m)
* **<span style="color:#009688">"Width_src":</span>** Source of channel width estimates. Values include: (NWIS field measurements, Regional geomorphic scaling coeff, StreamPULSE estimates)
* **<span style="color:#009688">"WS_area_km2":</span>** Watershed area (km^2^)
* **<span style="color:#009688">"WS_area_src":</span>** Source of watershed area estimates. Values include: (Appling2018_USGS2013, Appling2018_StreamStats, nwis_site_description, StreamStats, localuser_HBFLTER, localuser_UNHWQAL)


---

## Preparation of standardized terrestrial metabolism data

---

<font size="6">Preparation of standardized terrestrial metabolism data</font>

The FLUXNET2015 dataset contains data at multiple temporal resolutions. We compiled both daily data and annual data provided within the FLUXNET2015 dataset.

### 1. Compile daily FLUXNET data (<i><span style="color:#BB1AA5">fluxnet_standardized_full</span></i>)

FLUXNET 2015 daily data was compiled into a similar standardized format as the lotic sites using <span style="color:SlateBlue">**Lotic_standardized_metabolism.R**</span>. At this point all negative GPP and positive ER values were replaced with NA (columns GPP and ER) but the raw data was retained (columns GPP_raw and ER_raw). Important to note though is that some associated environmental variables are reported as daily means rather than sums (for example precip or SW), and we would advise against comparing the environmental data from the lotic and terrestrial sites. Please see individual column descriptions below for more detailed information.

```{r Standardized FLUXNET daily files, include = FALSE, eval = FALSE}
#Compile all of the daily FLUXNET data
  source(here::here("R", "executables", "fluxnet_daily_compile.R"))
#Output file: <i>../output/fluxnet_standardized.rds</i>
```


The final dataset contains the following columns:

* **<span style="color:#009688">"Date":</span>** Date in YYYY-MM-DD format
* **<span style="color:#009688">"U_ID":</span>** Unique date identifier (format as year + DOY)
* **<span style="color:#009688">"Year":</span>** Year
* **<span style="color:#009688">"DOY":</span>** Day of year (1:365 or 366)
* **<span style="color:#009688">"GPP_raw":</span>** FLUXNET GPP estimates (g C m^-2^ d^-1^) "GPP_NT_VUT_REF", raw data
* **<span style="color:#009688">"ER_raw":</span>** FLUXNET ER estimates (g C m^-2^ d^-1^) "RECO_NT_VUT_REF, raw data
* **<span style="color:#009688">"GPP":</span>** FLUXNET GPP estimates (g C m^-2^ d^-1^) "GPP_NT_VUT_REF", negative values replaced with NA
* **<span style="color:#009688">"ER":</span>** FLUXNET ER estimates (g C m^-2^ d^-1^) "RECO_NT_VUT_REF, positive values replaced with NA
* **<span style="color:#009688">"Net":</span>** FLUXNET NEE (changed to NEP by * -1) "NEE_VUT_REF"
* **<span style="color:#009688">"Temp":</span>** Average air temperature from half-hourly data (degrees C)
* **<span style="color:#009688">"Precip":</span>** Average precipitation from half-hourly data (mm)
* **<span style="color:#009688">"VPD":</span>** Average vapor pressure deficit from half-hourly data (hPa)
* **<span style="color:#009688">"SW":</span>** Average incoming shortwave radiation from half-hourly data (W m^-2)

### 2. Compile annual FLUXNET data (<i><span style="color:#BB1AA5">fluxnet_annual</span></i>)

Annual productivity data is already compiled and provided by FLUXNET. This data was compiled here using <span style="color:SlateBlue">**FLUXNET_annual_compile.R**</span>.

```{r Compile annual FLUXNET, include = FALSE, eval = FALSE}
#Compile all of the annnual FLUXNET data*    
  source(here::here("R", "executables", "fluxnet_annual_compile.R"))
#Output file: <i>../output/fluxnet_annual_compiled.rds</i>
```

The final dataset contains the following columns:

* **<span style="color:#009688">"Year":</span>** Year
* **<span style="color:#009688">"GPP":</span>** FLUXNET annual GPP (sum from daily estimates) (g C m^-2^ y^-1^) "GPP_NT_VUT_REF". Gross Primary Production, from Nighttime partitioning method, reference version selected from GPP versions using a model efficiency approach.
* **<span style="color:#009688">"ER":</span>** FLUXNET annual ER (sum from daily estimates) (g C m^-2^ y^-1^) "RECO_NT_VUT_REF". Ecosystem Respiration, from Nighttime partitioning method, reference selected from RECO versions using a model efficiency approach.
* **<span style="color:#009688">"Net":</span>** FLUXNET NEE (changed to NEP by * -1) "NEE_VUT_REF". I derived this from data of this description:Net Ecosystem Exchange, using Variable Ustar Threshold (VUT) for each year, reference selected on the basis of the model efficiency.
* **<span style="color:#009688">"Temp":</span>** Average air temperature from daily data (degrees C)
* **<span style="color:#009688">"Precip":</span>** Average precipition from daily data (mm)
* **<span style="color:#009688">"VPD":</span>** Average vapor pressure deficit from daily data (hPa)
* **<span style="color:#009688">"SW":</span>** Average incoming shortwave radiation from daily data (W m^-2^)

### 3. Basic FLUXNET site information (<i><span style="color:#BB1AA5">fluxnet_basic_info</span></i>)

```{r, include = FALSE, eval = FALSE}
#Output a table of basic site information for fluxnet sites
  fluxnet_sites <- data.table::fread("C:/research/Datasets/FLUXNET/FLUXNET sites.csv", data.table = FALSE)
  fluxnet_basic <- setNames(fluxnet_sites[, c("Site_ID", "Site_name", "Lat", "Lon")], c("Site_ID", "Name", "Lat", "Lon"))
  saveRDS(fluxnet_basic, here::here("output", "fluxnet_basic.rds"))
  
#Output file: <i>../output/fluxnet_basic.rds</i>
```


### 3. Get annual MODIS NPP data (<i><span style="color:#BB1AA5">modis_annual_NPP</span></i>) 

Annual MODIS NPP data (g C m^-2^ y^-1^) (MOD17A3HGF v006) was collected for each site-year of data through a combination of an AppEEARS request and downloading of single pixels. Similar to the LAI data, if a site was missing NPP data from the AppEEARS request, then annual NPP was downloaded for a 1km buffer around the site using the **MODISTools** package and the mean value was calculated. This was done for each site-year of missing data in the AppEEARS request. The annual MODIS NPP data was also converted from kg C m^-2^ y^-1^ to g C m^-2^ y^-1^ to provide consistent units across datasets.


```{r MODIS NPP data, include = FALSE, eval = FALSE}
#Unpack MODIS NPP (MOD17A3HGF v006) data from AppEEARS, download data for
#sites without AppEEARS data, and combine into annual NPP for each site-year
  source(here::here("R", "executables", "NPP_data_prep.R"))
#Output file: <i>../output/MODIS_annual_NPP.rds</i>
```