diff --git a/case-studies/ancillary_data/Base Map/Boundary_administration.cpg b/case-studies/ancillary_data/Base Map/Boundary_administration.cpg
new file mode 100644
index 0000000..3ad133c
--- /dev/null
+++ b/case-studies/ancillary_data/Base Map/Boundary_administration.cpg
@@ -0,0 +1 @@
+UTF-8
\ No newline at end of file
diff --git a/case-studies/ancillary_data/Base Map/Boundary_administration.dbf b/case-studies/ancillary_data/Base Map/Boundary_administration.dbf
new file mode 100644
index 0000000..496b118
Binary files /dev/null and b/case-studies/ancillary_data/Base Map/Boundary_administration.dbf differ
diff --git a/case-studies/ancillary_data/Base Map/Boundary_administration.prj b/case-studies/ancillary_data/Base Map/Boundary_administration.prj
new file mode 100644
index 0000000..f45cbad
--- /dev/null
+++ b/case-studies/ancillary_data/Base Map/Boundary_administration.prj
@@ -0,0 +1 @@
+GEOGCS["GCS_WGS_1984",DATUM["D_WGS_1984",SPHEROID["WGS_1984",6378137.0,298.257223563]],PRIMEM["Greenwich",0.0],UNIT["Degree",0.0174532925199433]]
\ No newline at end of file
diff --git a/case-studies/ancillary_data/Base Map/Boundary_administration.sbn b/case-studies/ancillary_data/Base Map/Boundary_administration.sbn
new file mode 100644
index 0000000..a0c7780
Binary files /dev/null and b/case-studies/ancillary_data/Base Map/Boundary_administration.sbn differ
diff --git a/case-studies/ancillary_data/Base Map/Boundary_administration.sbx b/case-studies/ancillary_data/Base Map/Boundary_administration.sbx
new file mode 100644
index 0000000..72ac7a6
Binary files /dev/null and b/case-studies/ancillary_data/Base Map/Boundary_administration.sbx differ
diff --git a/case-studies/ancillary_data/Base Map/Boundary_administration.shp b/case-studies/ancillary_data/Base Map/Boundary_administration.shp
new file mode 100644
index 0000000..be5639c
Binary files /dev/null and b/case-studies/ancillary_data/Base Map/Boundary_administration.shp differ
diff --git a/case-studies/ancillary_data/Base Map/Boundary_administration.shp.xml b/case-studies/ancillary_data/Base Map/Boundary_administration.shp.xml
new file mode 100644
index 0000000..d80f7ca
--- /dev/null
+++ b/case-studies/ancillary_data/Base Map/Boundary_administration.shp.xml
@@ -0,0 +1,3 @@
+
+
cloud medium probability
or thin cirrus
, etc. A more in-depth analysis of the pixel QA information should be performed to ensure that such issues are fixed, and/or do not substantially bias the results further below.Contents
+ + +In this notebook, we use a dataset of Sentinel-2 data to extract a time series of total suspended sediments (TSS) for the purpose of monitoring water quality. The region of interest here is centred over Lake Tempe in South Sulawesi, Indonesia.
+In addition to providing a general overview of TSS estimation, this notebook also demonstrates a number of technical and computational aspects related to working in an open data cube (ODC) environment such as EASI. In particular, this notebook will touch on the following aspects:
+In order to derive an estimate of TSS concentration (in mg/L) from the remote sensing data, we need a specific formula (algorithm) that characterises the relationship between the Sentinel-2 spectral band values and the true TSS measurements. Such a formula needs to be derived and calibrated on the basis of, among others, TSS measurements sampled during one or more field campaigns on the lake of interest.
+In this demonstration notebook, we rely on pre-existing research published in the following manuscript:
++E. Pandhadha et al., 2020. Total Suspended Solid (TSS) Estimation in Lake Tempe, South Sulawesi Using Sentinel-2B Imagery. +Journal of Engineering Technology and Applied Physics, Special Issue on Remote Sensing for Sustainable Environment, no. 1 (2020). DOI:10.33093/jetap.2020.x1.4
+
In this paper, the remote-sensing-based formula used to derive values for the $\text{TSS}$ parameter of interest is as follows:
+$$ +\text{TSS} = \alpha \cdot \text{NSMI} + \beta \\ +\text{NSMI} = \frac{ \text{red}+\text{green}-\text{blue} }{ \text{red}+\text{green}+\text{blue} } \\ +\alpha = 775.98 \\ +\beta = -93.606 +$$where $\text{red}$, $\text{green}$ and $\text{blue}$ correspond to the respective Sentinel-2 bands, and $\text{NSMI}$ represents the Normalised Suspended Material Index.
+Note also that in the above paper, the authors implement a pre-processing step aiming to remove pixels affected by sun-glint in the time series of Sentinel-2 data. This step is not implemented in this notebook.
+ +First, let's import the key Python packages and supporting functions required in this notebook.
+ +### System
+import sys
+
+### Datacube
+import datacube
+from datacube.utils import masking
+from odc.algo import enum_to_bool
+
+### Data tools
+import numpy as np
+import xarray as xr
+import pandas as pd
+from astropy.stats import sigma_clip
+import geopandas as gpd
+import rasterio.features
+
+### Plotting
+%matplotlib inline
+import matplotlib.pyplot as plt
+
+### EASI tools
+sys.path.append('../tools/')
+from datacube_utils import display_map, mostcommon_crs, xarray_object_size
+
And let's now also connect to the EASI database:
+ +dc = datacube.Datacube(app="Lake_Tempe_TSS")
+
### Region of interest
+min_longitude, max_longitude = (119.87, 120.04) # Lake Tempe
+min_latitude, max_latitude = (-4.03, -4.198)
+
display_map( x = (min_longitude,max_longitude),
+ y = (min_latitude,max_latitude) )
+
A quick look at the Sentinel-2 dataset on the ODC Explorer for the current EASI deployment indicates that data is available from 2017 onwards. For reasonable loading times, we will here only use three years' worth of satellite data over the region of interest.
+ +### Dates of interest:
+min_date = '2018-01-01'
+max_date = '2021-01-01'
+
The Sentinel-2 product used in this notebook to calculate TSS is labelled s2_l2a
.
product = 's2_l2a' # Sentinel-2 product
+
We can now initialise the main parameters of a datacube query, which we can then use to check the dataset's native projection (mostcommon_crs
) – as we don't need to re-project the dataset to another coordinate reference system, we will simply load up the data in its native projection.
We also set the dask_chunks
query parameter to ensure that the loading process makes use of Dask, which will return a (lazy-loaded) dataset that can be processed in a parallelised manner. The use of the .persist()
directive throughout this notebook essentially "forces" the loading / computation of the data contained in these Dask arrays.
# This code cell may generate some 'SQLAlchemy' warnings, but they can be safely ignored.
+
+query = { 'product': product,
+ 'lat': (min_latitude, max_latitude),
+ 'lon': (min_longitude, max_longitude),
+ 'time': (min_date, max_date) }
+
+### Dataset's native projection
+native_crs = mostcommon_crs(dc, query)
+print(f"The dataset's native CRS is \"{native_crs}\".")
+
+query.update({ 'output_crs': native_crs,
+ 'resolution': (30, 30),
+ 'group_by': 'solar_day',
+ 'dask_chunks': {'x': 1024, 'y': 1024} })
+
The dataset's native CRS is "EPSG:32750". ++
Finally, loading up all Sentinel-2 bands would lead to excessive memory requirements and computational overheads. We will thus only select the bands relevant to this analysis.
+The list of measurements (i.e. satellite bands and derived products) for the current product of interest can be displayed as follows.
+ +dc.list_measurements().loc[product]
+
+ | name | +dtype | +units | +nodata | +aliases | +flags_definition | +
---|---|---|---|---|---|---|
measurement | ++ | + | + | + | + | + |
B01 | +B01 | +uint16 | +1 | +0 | +[band_01, coastal_aerosol] | +NaN | +
B02 | +B02 | +uint16 | +1 | +0 | +[band_02, blue] | +NaN | +
B03 | +B03 | +uint16 | +1 | +0 | +[band_03, green] | +NaN | +
B04 | +B04 | +uint16 | +1 | +0 | +[band_04, red] | +NaN | +
B05 | +B05 | +uint16 | +1 | +0 | +[band_05, red_edge_1] | +NaN | +
B06 | +B06 | +uint16 | +1 | +0 | +[band_06, red_edge_2] | +NaN | +
B07 | +B07 | +uint16 | +1 | +0 | +[band_07, red_edge_3] | +NaN | +
B08 | +B08 | +uint16 | +1 | +0 | +[band_08, nir, nir_1] | +NaN | +
B8A | +B8A | +uint16 | +1 | +0 | +[band_8a, nir_narrow, nir_2] | +NaN | +
B09 | +B09 | +uint16 | +1 | +0 | +[band_09, water_vapour] | +NaN | +
B11 | +B11 | +uint16 | +1 | +0 | +[band_11, swir_1, swir_16] | +NaN | +
B12 | +B12 | +uint16 | +1 | +0 | +[band_12, swir_2, swir_22] | +NaN | +
SCL | +SCL | +uint8 | +1 | +0 | +[mask, qa] | +{'qa': {'bits': [0, 1, 2, 3, 4, 5, 6, 7], 'val... | +
AOT | +AOT | +uint16 | +1 | +0 | +[aerosol_optical_thickness] | +NaN | +
WVP | +WVP | +uint16 | +1 | +0 | +[scene_average_water_vapour] | +NaN | +
According to the TSS formula provided at the beginning of this notebook, we can select only those bands needed for the analysis. In addition, we also load the layer SCL
(or its known alias qa
) of pixel QA data, which will allow us to clean up the dataset, as well as the swir_2
band, which will be used further below to filter out non-water pixels.
query.update( {'measurements': ['red', 'green', 'blue', 'swir_2', 'qa']} )
+query
+
{'product': 's2_l2a', + 'lat': (-4.03, -4.198), + 'lon': (119.87, 120.04), + 'time': ('2018-01-01', '2021-01-01'), + 'output_crs': 'EPSG:32750', + 'resolution': (30, 30), + 'group_by': 'solar_day', + 'dask_chunks': {'x': 1024, 'y': 1024}, + 'measurements': ['red', 'green', 'blue', 'swir_2', 'qa']}+
%%time
+
+data = dc.load(**query)
+data = data.persist() # Dask data processing
+data
+
CPU times: user 57 s, sys: 30.9 s, total: 1min 27s +Wall time: 1min 53s ++
<xarray.Dataset> +Dimensions: (time: 198, y: 623, x: 632) +Coordinates: + * time (time) datetime64[ns] 2018-02-09T02:26:33 ... 2020-12-30T02:... + * y (y) float64 9.535e+06 9.535e+06 ... 9.554e+06 9.554e+06 + * x (x) float64 8.187e+05 8.187e+05 ... 8.376e+05 8.376e+05 + spatial_ref int32 32750 +Data variables: + red (time, y, x) uint16 dask.array<chunksize=(1, 623, 632), meta=np.ndarray> + green (time, y, x) uint16 dask.array<chunksize=(1, 623, 632), meta=np.ndarray> + blue (time, y, x) uint16 dask.array<chunksize=(1, 623, 632), meta=np.ndarray> + swir_2 (time, y, x) uint16 dask.array<chunksize=(1, 623, 632), meta=np.ndarray> + qa (time, y, x) uint8 dask.array<chunksize=(1, 623, 632), meta=np.ndarray> +Attributes: + crs: EPSG:32750 + grid_mapping: spatial_ref
As usual, we need to filter out various pixels from the remote sensing dataset. This includes the removal of invalid (nodata
) pixels, as well as those affected by various pixel quality issues. In the next cell, we create the various masks required for this clean-up operation.
### Valid mask (i.e. not 'nodata'), for each data layer
+valid_mask = masking.valid_data_mask(data).persist()
+
+### Mask of clear pixels
+bad_pixel_flags = {'no data', 'saturated or defective', 'cloud shadows', 'cloud high probability', 'cloud medium probability'}
+good_pixel_mask = ~enum_to_bool(data['qa'], bad_pixel_flags).persist()
+
cloud medium probability
or thin cirrus
, etc. A more in-depth analysis of the pixel QA information should be performed to ensure that such issues are fixed, and/or do not substantially bias the results further below.The Sentinel-2 masking and scaling operations are subsequently applied on a band-by-band basis, as done in the next cell.
+ +### Scaling factor for Sentinel-2 data
+scale = 0.0001 # divide by 10000
+offset = 0.0
+
+### Apply valid mask, good pixel mask, and scaling to each layer
+data['red'] = ( data['red'].where(valid_mask['red'] & good_pixel_mask) * scale + offset ).persist()
+data['blue'] = ( data['blue'].where(valid_mask['blue'] & good_pixel_mask) * scale + offset ).persist()
+data['green'] = ( data['green'].where(valid_mask['green'] & good_pixel_mask) * scale + offset ).persist()
+data['swir_2'] = ( data['swir_2'].where(valid_mask['swir_2'] & good_pixel_mask) * scale + offset ).persist()
+
### Dimensions of dataset
+print( xarray_object_size(data) )
+
Dataset size: 2.40 GB ++
And finally, we can remove any time slice from the dataset (if any) not containing at least one valid (non-NaN
) pixel, as a result of the above operations.
data = data.dropna('time', how='all')
+data = data.persist() # Dask data processing
+
We can now inspect the resulting data object further, e.g. by displaying the Xarray.DataArray
for one of the bands:
data.red
+
<xarray.DataArray 'red' (time: 198, y: 623, x: 632)> +dask.array<add, shape=(198, 623, 632), dtype=float64, chunksize=(1, 623, 632), chunktype=numpy.ndarray> +Coordinates: + * time (time) datetime64[ns] 2018-02-09T02:26:33 ... 2020-12-30T02:... + * y (y) float64 9.535e+06 9.535e+06 ... 9.554e+06 9.554e+06 + * x (x) float64 8.187e+05 8.187e+05 ... 8.376e+05 8.376e+05 + spatial_ref int32 32750
From this, we can gather that the pre-processed Sentinel-2 data is available as a Dask array over a region of about 600-by-600 pixels, and with about 200 time steps. In this data object, each Dask chunk has a size (1,623,632)
in the time
, x
and y
dimensions.
To improve the plots further below in this notebook, we will use a mask of the lake area in the region of interest. This mask can be derived, for instance, from an existing shape file.
+For this example, we use a polygon of the Lake Tempe boundary line, which can be accessed from the shapefile provided in the ancillary_data
folder in this repository.
shape_file = './ancillary_data/Base Map//Boundary_administration.shp'
+
+### Load the shapefile
+shp = gpd.read_file(shape_file)
+display(shp)
+shp.crs
+
+ | Zona | +Province | +Regency | +District | +Village | +geometry | +
---|---|---|---|---|---|---|
0 | +Zona 50S | +SULAWESI SELATAN | +SIDENRENG RAPPANG | +WATANG SIDENRENG | +SIDENRENG | +POLYGON Z ((119.85504 -3.91274 0.00000, 119.85... | +
1 | +Zona 50S | +SULAWESI SELATAN | +SIDENRENG RAPPANG | +WATANG SIDENRENG | +AKA-AKAE | +POLYGON Z ((119.87492 -3.91134 0.00000, 119.87... | +
2 | +Zona 50S | +SULAWESI SELATAN | +SIDENRENG RAPPANG | +WATANG SIDENRENG | +EMPAGAE | +POLYGON Z ((119.87786 -3.92084 0.00000, 119.87... | +
3 | +Zona 50S | +SULAWESI SELATAN | +SIDENRENG RAPPANG | +WATANG SIDENRENG | +TALUMAE | +POLYGON Z ((119.90144 -3.92196 0.00000, 119.89... | +
4 | +Zona 50S | +SULAWESI SELATAN | +SIDENRENG RAPPANG | +WATANG SIDENRENG | +MOJONG | +POLYGON Z ((119.91333 -3.95609 0.00000, 119.91... | +
... | +... | +... | +... | +... | +... | +... | +
553 | +Zona 50S | +SULAWESI SELATAN | +GOWA | +TOMBOLO PAO | +ERELEMBANG | +MULTIPOLYGON Z (((119.87627 -5.09185 0.00000, ... | +
554 | +Zona 51S | +SULAWESI SELATAN | +SIDENRENG RAPPANG | +DUAPITUE | +KALOSI | +POLYGON Z ((120.00922 -3.88076 0.00000, 120.00... | +
555 | +Zona 50S | +SULAWESI SELATAN | +SIDENRENG RAPPANG | +DUAPITUE | +KALOSI | +MULTIPOLYGON Z (((120.00000 -3.89772 0.00000, ... | +
556 | +Zona 51S | +SULAWESI SELATAN | +SIDENRENG RAPPANG | +DUAPITUE | +KAMPALE | +MULTIPOLYGON Z (((120.00000 -3.90325 0.00000, ... | +
557 | +Zona 50S | +SULAWESI SELATAN | +SIDENRENG RAPPANG | +DUAPITUE | +KAMPALE | +MULTIPOLYGON Z (((119.99819 -3.89662 0.00000, ... | +
558 rows × 6 columns
+<Geographic 2D CRS: EPSG:4326> +Name: WGS 84 +Axis Info [ellipsoidal]: +- Lat[north]: Geodetic latitude (degree) +- Lon[east]: Geodetic longitude (degree) +Area of Use: +- name: World. +- bounds: (-180.0, -90.0, 180.0, 90.0) +Datum: World Geodetic System 1984 ensemble +- Ellipsoid: WGS 84 +- Prime Meridian: Greenwich+
We can see here that the vector data within that shapefile is in the projection EPSG:4326
, which is different from that of our main Sentinel-2 dataset (EPSG:32750
). For compatibility, we can here re-project the shapefile data to the CRS of the Sentinel-2 dataset.
Subsequently, we will also filter the shapefile contents to only select those polygons associated with Lake Tempe.
+ +### Reproject to current coordinate reference system
+shp = shp.to_crs(native_crs)
+
+### Remove unwanted polygons
+print("Selected polygons are:")
+drop_list = []
+for ff in shp.iterrows():
+ tmp = ff[1].Village.lower()
+ if 'tempe' in tmp and 'danau' in tmp:
+ print(ff[0], ff[1].Village)
+ else:
+ drop_list.append(ff[0])
+
+shp.drop(drop_list, inplace=True)
+
+### Plot
+shp.boundary.plot(figsize=(8,8))
+plt.xlabel("x [metre]"); plt.ylabel("y [metre]")
+plt.title("Lake Tempe boundary");
+
Selected polygons are: +114 DANAU TEMPE +115 DANAU TEMPE ++
We can now create a raster mask from the vector data. The code below iterates over the polygons in the shapefile (in case multiple polygons are available), setting the raster mask values to 1
for all the pixels located within the footprint of each polygon, and 0
otherwise.
### Rasterise
+mask = rasterio.features.rasterize( ((feature['geometry'], 1) for feature in shp.iterfeatures()),
+ out_shape = (data.dims['y'],data.dims['x']),
+ transform = data.affine )
+
+### Convert the mask (numpy array) to an Xarray DataArray
+mask = xr.DataArray(mask, coords=(data.y, data.x))
+mask
+
<xarray.DataArray (y: 623, x: 632)> +array([[0, 0, 0, ..., 0, 0, 0], + [0, 0, 0, ..., 0, 0, 0], + [0, 0, 0, ..., 0, 0, 0], + ..., + [0, 0, 0, ..., 0, 0, 0], + [0, 0, 0, ..., 0, 0, 0], + [0, 0, 0, ..., 0, 0, 0]], dtype=uint8) +Coordinates: + * y (y) float64 9.535e+06 9.535e+06 9.535e+06 ... 9.554e+06 9.554e+06 + * x (x) float64 8.187e+05 8.187e+05 8.187e+05 ... 8.376e+05 8.376e+05
### Plot
+mask.plot(size=8).axes.set_aspect('equal')
+
Finally, we can use the mask we just created, apply it to the time series of Sentinel-2 data, and plot the result.
+ +### Masking
+data = data.where(mask).persist()
+
time_ind3 = np.linspace(1, data.sizes['time'], 3, dtype='int') - 1 # select some time slices to display
+
+### Plot the selected time slices (true-colour display)
+image_array = data[['red', 'green', 'blue']].isel(time=time_ind3).to_array()
+tmp = image_array.plot.imshow(robust=True, col='time', col_wrap=3, size=5)
+for ax in tmp.axes.flatten():
+ ax.set_aspect('equal')
+
We now have a cropped data time series containing only the pixels of interest over Lake Tempe.
+However, due to the varying extents of the lake over time (during wet / dry conditions), some time slices in the time series will contain a certain number of non-water (i.e. land) pixels. Further below, the TSS algorithm would thus also be applied to these land / vegetation pixels, thereby leading to some bias in the results.
+In order to address this issue, we could try to use the modified normalised difference water index (MNDWI) in order to filter out the non-water pixels.
+The MNDWI is calculated on the basis of the Sentinel-2 bands as per the following equation, with MNDWI values greater than 0 indicating water pixels:
+$$ +\text{MNDWI}= \frac{ \text{green}−\text{SWIR} }{ \text{green}+\text{SWIR} }. +$$So let's apply this formula to the Sentinel-2 dataset, and save the resulting MNDWI data back into the data
object as an additional band.
data['MNDWI'] = ( (data.green - data.swir_2) / (data.green + data.swir_2) ).persist()
+data
+
<xarray.Dataset> +Dimensions: (time: 198, y: 623, x: 632) +Coordinates: + * time (time) datetime64[ns] 2018-02-09T02:26:33 ... 2020-12-30T02:... + * y (y) float64 9.535e+06 9.535e+06 ... 9.554e+06 9.554e+06 + * x (x) float64 8.187e+05 8.187e+05 ... 8.376e+05 8.376e+05 + spatial_ref int32 32750 +Data variables: + red (time, y, x) float64 dask.array<chunksize=(1, 623, 632), meta=np.ndarray> + green (time, y, x) float64 dask.array<chunksize=(1, 623, 632), meta=np.ndarray> + blue (time, y, x) float64 dask.array<chunksize=(1, 623, 632), meta=np.ndarray> + swir_2 (time, y, x) float64 dask.array<chunksize=(1, 623, 632), meta=np.ndarray> + qa (time, y, x) float64 dask.array<chunksize=(1, 623, 632), meta=np.ndarray> + MNDWI (time, y, x) float64 dask.array<chunksize=(1, 623, 632), meta=np.ndarray> +Attributes: + crs: EPSG:32750 + grid_mapping: spatial_ref
As shown above, we now have the MNDWI band integrated as part of the data
(Dask) array. For insight, we can also plot a few MNDWI time slices to investigate the results further.
time_ind9 = np.linspace(1, data.sizes['time'], 9, dtype='int') - 1 # select some time slices to display
+tmp = data.MNDWI[time_ind9].plot(col='time', col_wrap=3, size=4)
+for ax in tmp.axes.flatten():
+ ax.set_aspect('equal')
+
This approach appears to provide good results for the current Sentinel-2 dataset, with various regions on the edge of the lake clearly identified as being non-water (MNDWI values below 0.0).
+We can now use the MNDWI information to remove the non-water pixels from the time series.
+ +data = data.where(data.MNDWI>0.0).persist()
+
### Plot some selected time slices (true-colour display)
+image_array = data[['red', 'green', 'blue']].isel(time=time_ind3).to_array()
+tmp = image_array.plot.imshow(robust=True, col='time', col_wrap=3, size=5);
+for ax in tmp.axes.flatten():
+ ax.set_aspect('equal')
+
### TSS calculation
+tmp = data.red + data.green
+nsmi = (tmp - data.blue) / (tmp + data.blue)
+data_tss = (775.98 * nsmi - 93.606).persist()
+data_tss
+
<xarray.DataArray (time: 198, y: 623, x: 632)> +dask.array<sub, shape=(198, 623, 632), dtype=float64, chunksize=(1, 623, 632), chunktype=numpy.ndarray> +Coordinates: + * time (time) datetime64[ns] 2018-02-09T02:26:33 ... 2020-12-30T02:... + * y (y) float64 9.535e+06 9.535e+06 ... 9.554e+06 9.554e+06 + * x (x) float64 8.187e+05 8.187e+05 ... 8.376e+05 8.376e+05 + spatial_ref int32 32750
tmp = data_tss[time_ind9].plot( col='time', col_wrap=3, cmap='rainbow', size=4, robust=True,
+ cbar_kwargs = dict(label="TSS [mg/L]") )
+for ax in tmp.axes.flatten():
+ ax.set_aspect('equal')
+
We can here clearly see that some pixel quality issues are still affecting the datasets, e.g. with residual areas of cloud shadow not having been removed successfully during the data clean-up process.
+While these "daily" TSS plots can be insightful, the abundance of missing (and corrupt) data, as well as the many time slices in the time series, make for a difficult assessment of the results. One approach to circumvent this is to first aggregate the data over coarser time spans, and then display the average TSS over these periods.
+Xarray
allows for a straightforward aggregation of the data according to Pandas indexing using the dateoffset
functionality. For instance, we could calculate the temporal mean of the data over yearly quarters (with the first quarter ending in January):
data_tss_quarter = data_tss.resample(time="QS-JAN").mean() # aggregation over each successive quarter
+data_tss_quarter = data_tss_quarter.persist()
+data_tss_quarter
+
<xarray.DataArray (time: 12, y: 623, x: 632)> +dask.array<stack, shape=(12, 623, 632), dtype=float64, chunksize=(1, 623, 632), chunktype=numpy.ndarray> +Coordinates: + * time (time) datetime64[ns] 2018-01-01 2018-04-01 ... 2020-10-01 + * y (y) float64 9.535e+06 9.535e+06 ... 9.554e+06 9.554e+06 + * x (x) float64 8.187e+05 8.187e+05 ... 8.376e+05 8.376e+05 + spatial_ref int32 32750
As we can see from the resulting Xarray object, there is a total of 12 quarters in the current time series, leading to 12 "time slices" in the resulting array.
+Then, as done before for the daily time series, we can select a few of the resulting quarters and plot the respective (temporally averaged) TSS maps.
+ +time_ind = np.linspace(1, data_tss_quarter.sizes['time'], 9, dtype='int') - 1 # some selected time slices to display
+
+### Main plot
+tmp = data_tss_quarter[time_ind].plot( col='time', col_wrap=3, cmap='rainbow', size=4, robust=True,
+ cbar_kwargs = dict(label="quarterly-averaged TSS [mg/L]") )
+for ax in tmp.axes.flatten():
+ ax.set_aspect('equal')
+
Here we can see that a clearer picture of the temporal TSS dynamics / characteristics is starting to emerge for each averaging period (quarter) along the selected time series.
+Another potential way to temporally aggregate the time series data is to calculate the average TSS values for all time slices in specific periods such as months, seasons, quarters, etc. In other words, while the previous plots present averaged results over successive quarters (resulting in a total of 12 quarters for the current time series), we could now average the data from all the time slices within, e.g., a given month or season (regardless of the year). This would provide an overview of the average TSS concentration in various months or seasons in any given year.
+For illustration, let's apply this approach to calculate the average TSS maps for each individual month in our time series – this is done by essentially collecting all the January time slices to calculate the January average, and so forth for all other months.
+ +data_tss_group = data_tss.groupby("time.month").mean().persist() # aggregation over each month
+data_tss_group = data_tss_group
+data_tss_group
+
<xarray.DataArray (month: 12, y: 623, x: 632)> +dask.array<stack, shape=(12, 623, 632), dtype=float64, chunksize=(1, 623, 632), chunktype=numpy.ndarray> +Coordinates: + * y (y) float64 9.535e+06 9.535e+06 ... 9.554e+06 9.554e+06 + * x (x) float64 8.187e+05 8.187e+05 ... 8.376e+05 8.376e+05 + spatial_ref int32 32750 + * month (month) int64 1 2 3 4 5 6 7 8 9 10 11 12
As can be seen here, the result of this operation is a DataArray with a new dimension (month
) and 12 coordinates along it – one "slice" for each individual month.
tmp = data_tss_group.plot( col='month', col_wrap=4, cmap='rainbow', size=4, robust=True,
+ cbar_kwargs = dict(label="monthly average TSS [mg/L]") )
+for ax in tmp.axes.flatten():
+ ax.set_aspect('equal')
+
This plot seems to indicate that, over the whole time series under current consideration, the months between January and April generally exhibit elevated levels of TSS over most of Lake Tempe. In contrast, lower values of TSS are recorded, on average, during the months of May, June and July.
+The overall TSS average across the entire time series, for each pixel, can be easily calculated as follows.
+ +data_tss_mean = data_tss.mean('time') # standard mean over entire time series
+data_tss_mean = data_tss_mean.persist()
+
### Plot
+fig = plt.figure(figsize=(11,7))
+data_tss_mean.plot(robust=True, cmap='rainbow', cbar_kwargs={'label':'mean TSS [mg/L]'})
+plt.gca().set_aspect('equal','box');
+
This map provides an overview of the overall TSS concentrations for the region of interest and over the entire specified time span.
+From the previous results in this notebook, however, we should expect that the above plot integrates a number of pixels for which the pixel QA clean-up process has failed. For instance, some pixels affected by cloud shadows may not have been successfully masked out, leading to erroneously low TSS values in some time slices. These problematic TSS values have subsequently been used in the computation of the above average plot, leading to potentially biased results.
+In order to minimise the impact from such outliers, we could make use of a more robust metric of temporal averaging instead of the simple .mean()
operation used earlier. For instance, one such metric would be to calculate the median TSS instead.
Here, we will use another approach, which is to define our own (custom) function and apply it to the Xarray / Dask array of TSS data (data_tss
). The aim here is to instruct Xarray
to take each pixel, apply the function to the corresponding TSS time series (along the time
dimension), and aggregate the results to produce a map with two x
and y
dimensions. And given the Dask arrays at hand, we would also like this process to occur in a parallelised fashion on all CPUs available in this JupyterLab environment.
In the next cell, we start by defining a robust_mean()
function, which operates as follow on a Numpy vector z
of input data:
Nan
s from the datasigma_clip
functionNaN
s and outliers removed) – if the filtered vector contains less than 10 data points, simply return NaN
instead.Further below, this function will be applied, in a parallelised fashion, to the Dask array of TSS values – essentially, robust_mean()
will receive as input z
the time series of TSS values for each pixel in turn. The returned value of the pixel's (robust) average TSS will then be used to build the resulting map of average mean TSS, calculated in a robust way.
In addition, we here also define another function (robust_cv()
) to calculate the coefficient of variation (CV) for the same time series of TSS data (at each pixel). As shown below, the CV is simply defined as the standard deviation of the input values, divided by the mean – here again, calculated in a robust manner on the basis of the filtered vector of data.
def robust_mean(z):
+ zf = z[~np.isnan(z)] # filter out NaNs
+ if len(zf)<10: return np.nan # use at least 10 values to compute the mean
+ zf = sigma_clip( zf, masked=False ) # remove outliers
+ return np.mean(zf) # mean of data without outliers
+
+def robust_cv(z):
+ zf = z[~np.isnan(z)] # filter out NaNs
+ if len(zf)<10: return np.nan # use at least 10 values to compute the CV
+ zf = sigma_clip( zf, masked=False ) # remove outliers
+ return np.std(zf) / np.mean(zf) # CV of data without outliers
+
In the following cell, the robust_mean
function is applied in a parallelised fashion to the dataset by making use of the apply_ufunc()
function in Xarray
– another way to carry out this operation would be to use xr.map_blocks()
.
### Re-chunk Dask array for efficient time-series processing
+data_tss = data_tss.chunk({'time':-1, 'x':32, 'y':32}).persist()
+
+### Parallelised processing
+data_tss_robMean = xr.apply_ufunc( robust_mean, data_tss, input_core_dims=[["time"]],
+ dask='parallelized', vectorize=True ) # robust mean, whole time series
+data_tss_robMean = data_tss_robMean.persist()
+
Our second custom function can now be applied in the same way to the TSS data.
+ +data_tss_robCV = xr.apply_ufunc( robust_cv, data_tss, input_core_dims=[["time"]],
+ dask='parallelized', vectorize=True ) # robust mean, whole time series
+data_tss_robCV = data_tss_robCV.persist()
+
### Plots
+fig, (ax1,ax2) = plt.subplots(1, 2, figsize=(20,7))
+
+data_tss_robMean.plot(robust=True, cmap='rainbow', cbar_kwargs={'label':'TSS mean'}, ax=ax1)
+ax1.set_title(f"Robust mean of TSS")
+ax1.set_aspect('equal','box')
+
+data_tss_robCV.plot(robust=True, cmap="rainbow", cbar_kwargs={'label':'TSS C.V.'}, ax=ax2)
+ax2.set_title(f"Robust C.V. of TSS")
+ax2.set_aspect('equal','box');
+
The plot of robust mean values (left-hand side) has not changed significantly compared to the previous plot of the (standard) average values. However, the fact that we have used a robust metric for the temporal averaging operation gives us more confidence in the validity of the plotted results, in particular with respect to the residual pixel quality issues affecting the considered TSS dataset.
+On the right-hand side, the CV map provides further insight into the temporal dynamics in the TSS dataset, highlighting a number of "hot spots" with elevated TSS variability near the edge of the lake (perhaps as a result of regular sediment contribution from specific rivers / tributaries to the lake waters).
+A final note here is with regards to the range of (average) TSS values displayed in the above plot (left-hand side). In the Pandhadha et al. paper cited at the beginning of this notebook, the authors report that the field measurements of TSS in Lake Tempe are ranging between 115 and 203 mg/L, which is (roughly) in the same order of magnitude as the values displayed in the plot. We can thus have some confidence that the TSS algorithm used in this notebook at least provides TSS data that appears to be sensible, though further validation work is required to determine whether the TSS algorithm used here provides accurate results for the range of values experienced in this notebook.
+At this point, a practitioner might want to extract the time series of TSS values at selected locations, display them, and potentially write them to file to be used as input to further processing or modelling work.
+We demonstrate this by first selecting a few points of interest, e.g. along a transect line across the region of interest.
+ +### Some pixels along a transect
+n_points = 5
+pixloc_y = np.linspace(9547000, 9545000, n_points)
+pixloc_x = np.linspace(825000, 832000, n_points)
+
### Plot
+fig = plt.figure(figsize=(8,8))
+plt.plot(pixloc_x, pixloc_y, marker='o', color='black', linestyle='none')
+shp.boundary.plot(ax=fig.axes[0], color='black');
+[plt.text(x,y,f"{p:4d}") for p,(y,x) in enumerate(zip(pixloc_y,pixloc_x))];
+
We can now extract the pixels' TSS data using Xarray
's vectorised indexing functionality, which will retrieve data at the grid cells nearest to the target x
and y
coordinates. For illustration purposes, here we make use of the time series of quarterly averaged TSS values.
points_x = xr.DataArray(pixloc_x, dims="points")
+points_y = xr.DataArray(pixloc_y, dims="points")
+
+### Extract data (quarterly dataset)
+points_dat = data_tss_quarter.sel(x=points_x, y=points_y, method="nearest")
+points_dat = points_dat.dropna('time', how='all')
+points_dat = points_dat.persist()
+
+### Plot
+points_dat.plot.line(x='time', marker='.', figsize=(15,5));
+plt.gca().set_title("Quarterly TSS values, selected pixels");
+
From this plot, we can clearly identify a period of decreased TSS levels at all point of interest, preceded by a series of higher TSS concentrations.
+The following code cell will save the sampled (quarterly) TSS data into a Pandas data frame:
+ +### Extract to Pandas
+tss_df = pd.DataFrame( data = points_dat.values,
+ index = points_dat.time.values,
+ columns = points_dat.points.values )
+
+### (Re)set DF index
+tss_df['month'] = tss_df.index.month
+tss_df['year'] = tss_df.index.year
+tss_df = tss_df.set_index(['year','month'])
+
+tss_df
+
+ | + | 0 | +1 | +2 | +3 | +4 | +
---|---|---|---|---|---|---|
year | +month | ++ | + | + | + | + |
2018 | +1 | +257.328187 | +255.746486 | +243.432876 | +258.141784 | +257.270748 | +
4 | +239.850518 | +245.384545 | +234.216527 | +228.154045 | +233.387142 | +|
7 | +243.093750 | +227.865623 | +241.973659 | +243.599509 | +243.826203 | +|
10 | +234.700149 | +235.814439 | +241.310443 | +234.036187 | +242.029936 | +|
2019 | +1 | +246.057517 | +244.055116 | +237.870309 | +253.824961 | +244.962170 | +
4 | +243.122051 | +231.340612 | +243.874545 | +249.701082 | +252.403634 | +|
7 | +240.431704 | +246.273665 | +242.032411 | +245.729081 | +235.939214 | +|
10 | +241.107097 | +235.867842 | +232.526281 | +235.926385 | +234.887090 | +|
2020 | +1 | +257.720366 | +260.185931 | +257.203648 | +248.134618 | +254.689527 | +
4 | +227.229722 | +226.268307 | +223.584198 | +223.522187 | +220.059563 | +|
7 | +196.855836 | +222.331840 | +197.541845 | +175.431366 | +180.029425 | +|
10 | +240.256243 | +223.326796 | +233.247971 | +249.078397 | +227.405634 | +
If desired, we can also "re-format" the data frame into a long (as opposed to wide) format:
+ +tss_df = tss_df.stack(dropna=False).rename_axis(['year','month','point'])
+tss_df = pd.DataFrame(tss_df).rename(columns={0:'TSS'})
+tss_df
+
+ | + | + | TSS | +
---|---|---|---|
year | +month | +point | ++ |
2018 | +1 | +0 | +257.328187 | +
1 | +255.746486 | +||
2 | +243.432876 | +||
3 | +258.141784 | +||
4 | +257.270748 | +||
4 | +0 | +239.850518 | +|
1 | +245.384545 | +||
2 | +234.216527 | +||
3 | +228.154045 | +||
4 | +233.387142 | +||
7 | +0 | +243.093750 | +|
1 | +227.865623 | +||
2 | +241.973659 | +||
3 | +243.599509 | +||
4 | +243.826203 | +||
10 | +0 | +234.700149 | +|
1 | +235.814439 | +||
2 | +241.310443 | +||
3 | +234.036187 | +||
4 | +242.029936 | +||
2019 | +1 | +0 | +246.057517 | +
1 | +244.055116 | +||
2 | +237.870309 | +||
3 | +253.824961 | +||
4 | +244.962170 | +||
4 | +0 | +243.122051 | +|
1 | +231.340612 | +||
2 | +243.874545 | +||
3 | +249.701082 | +||
4 | +252.403634 | +||
7 | +0 | +240.431704 | +|
1 | +246.273665 | +||
2 | +242.032411 | +||
3 | +245.729081 | +||
4 | +235.939214 | +||
10 | +0 | +241.107097 | +|
1 | +235.867842 | +||
2 | +232.526281 | +||
3 | +235.926385 | +||
4 | +234.887090 | +||
2020 | +1 | +0 | +257.720366 | +
1 | +260.185931 | +||
2 | +257.203648 | +||
3 | +248.134618 | +||
4 | +254.689527 | +||
4 | +0 | +227.229722 | +|
1 | +226.268307 | +||
2 | +223.584198 | +||
3 | +223.522187 | +||
4 | +220.059563 | +||
7 | +0 | +196.855836 | +|
1 | +222.331840 | +||
2 | +197.541845 | +||
3 | +175.431366 | +||
4 | +180.029425 | +||
10 | +0 | +240.256243 | +|
1 | +223.326796 | +||
2 | +233.247971 | +||
3 | +249.078397 | +||
4 | +227.405634 | +
And finally, saving the data (e.g. to .csv
or .pkl
file) can be achieved with the following code if desired.
### Uncomment if needed:
+# tss_df.to_csv('./TSS_pixel_data.csv')
+# tss_df.to_pickle(path='./TSS_pixel_data.pkl')
+
### End notebook
+