Skip to content

Commit

Permalink
Merge pull request #118 from esciencecenter-digital-skills/odc-stac
Browse files Browse the repository at this point in the history
Add data cube episode
  • Loading branch information
fnattino authored Jun 25, 2024
2 parents e90143e + c8dc566 commit 5f7e9bd
Show file tree
Hide file tree
Showing 5 changed files with 195 additions and 0 deletions.
195 changes: 195 additions & 0 deletions episodes/12-data-cube.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,195 @@
---
title: "Data cubes with ODC-STAC"
teaching: 45
exercises: 0
---

:::questions
- a
:::

:::objectives
- a
:::

## Introduction

In the previous episodes we worked with satellite images with a fixed boundary on how they have been collected, however in many cases you would want to have an image that covers your area of interest which often does not align with boundaries of the collected images. If the phenomena you are interested in covers two images you could manually mosaic them, but sometimes you are interested in multiple images that overlapping.

ODC-STAC offers functionality that allows you to get a mosaic-ed image based on the a bounding box or a polygon containing the area of interest. In this lesson we show how [odc-stac](https://odc-stac.readthedocs.io/en/latest/?badge=latest) can be employed to re-tile and stack satellite images in what are sometimes referred to as "data cubes".

## Create a data cube with ODC-STAC

As you might have noticed in the previous episodes, the satellite images we have used until now do actually not cover the whole island of Rhodes. They miss the southern part of the island. Using ODC-STAC we can obtain an image for the whole island. We use the administrative boundary of Rhodes to define our area of interest (AoI). This way we are sure to have the whole island.

More important, using ODC-STAC you can also load multiple images (lazy) into one datacube allowing you to perform all kind of interesting analyses as will be demonstrated below.

But first we need to upload the geometry of Rhodes. To do so we use geopandas and load the geometry we previously stored in a geopackage.

```python
import geopandas
rhodes = geopandas.read_file('rhodes.gpkg')
bbox = rhodes.total_bounds
```

Next, we search for satellite images that cover our AoI (i.e. Rhodes) in the [Sentinel-2 L2A](https://radiantearth.github.io/stac-browser/#/external/earth-search.aws.element84.com/v1/collections/sentinel-2-l2a) collection that is indexed in the [Earth Search STAC API](https://radiantearth.github.io/stac-browser/#/external/earth-search.aws.element84.com/v1/). Since we are interested the period right before and after the wild fire we include as dates the 1st of July until the 31st of August 2023 :

```python
import pystac_client

api_url = "https://earth-search.aws.element84.com/v1"
collection_id = "sentinel-2-c1-l2a"

client = pystac_client.Client.open(api_url)
search = client.search(
collections=[collection_id],
datetime="2023-07-01/2023-08-31",
bbox=bbox
)

item_collection = search.item_collection()
```

[odc-stac](https://odc-stac.readthedocs.io/en/latest/?badge=latest) can ingest directly our search results and create a Xarray DataSet object from the STAC metadata that are present in the `item_collection`. By specifying `groupby='solar_day'`, odc-stac automatically groups and merges images corresponding to the same date of acquisition. `chunks={...}` sets up the resulting data cube using Dask arrays, thus enabling lazy loading (and further operations). `use_overviews=True` tells odc-stac to direcly load lower-resolution versions of the images from the overviews, if these are available in Cloud Optimized Geotiffs (COGs). We set the resolution of the data cube using the `resolution` argument, and define the AoI using the bounding box (`bbox`). We decided to set the resolution to 20 in order to limit the size of the images a bit.

```python
import odc.stac
ds = odc.stac.load(
item_collection,
groupby='solar_day',
chunks={'x': 2048, 'y': 2048},
use_overviews=True,
resolution=20,
bbox=rhodes.total_bounds,
)
```

odc-stac builds a data cube representation from all the relevant files linked in `item_collection` as a Xarray DataSet. Let us have a look at it:

```python
print(ds)
```

```output
<xarray.Dataset> Size: 7GB
Dimensions: (y: 3255, x: 2567, time: 25)
Coordinates:
* y (y) float64 26kB 4.035e+06 4.035e+06 ... 3.97e+06 3.97e+06
* x (x) float64 21kB 5.613e+05 5.613e+05 ... 6.126e+05 6.126e+05
spatial_ref int32 4B 32635
* time (time) datetime64[ns] 200B 2023-07-01T09:10:15.805000 ... 20...
Data variables: (12/18)
red (time, y, x) uint16 418MB dask.array<chunksize=(1, 2048, 2048), meta=np.ndarray>
green (time, y, x) uint16 418MB dask.array<chunksize=(1, 2048, 2048), meta=np.ndarray>
blue (time, y, x) uint16 418MB dask.array<chunksize=(1, 2048, 2048), meta=np.ndarray>
visual (time, y, x) float32 836MB dask.array<chunksize=(1, 2048, 2048), meta=np.ndarray>
nir (time, y, x) uint16 418MB dask.array<chunksize=(1, 2048, 2048), meta=np.ndarray>
swir22 (time, y, x) uint16 418MB dask.array<chunksize=(1, 2048, 2048), meta=np.ndarray>
... ...
scl (time, y, x) uint8 209MB dask.array<chunksize=(1, 2048, 2048), meta=np.ndarray>
aot (time, y, x) uint16 418MB dask.array<chunksize=(1, 2048, 2048), meta=np.ndarray>
coastal (time, y, x) uint16 418MB dask.array<chunksize=(1, 2048, 2048), meta=np.ndarray>
nir09 (time, y, x) uint16 418MB dask.array<chunksize=(1, 2048, 2048), meta=np.ndarray>
cloud (time, y, x) uint8 209MB dask.array<chunksize=(1, 2048, 2048), meta=np.ndarray>
snow (time, y, x) uint8 209MB dask.array<chunksize=(1, 2048, 2048), meta=np.ndarray>
```

## Working with the data cube

Like we did in the previous episode, let us calculate the NDVI for our study area. To do so we need to focus on the variables: the red band (`red`), the near infrared band (`nir`) and the scene classification map (`scl`). We will use the former two to calculated the NDVI for the AoI. The latter, we use as [a classification mask](https://sentinels.copernicus.eu/web/sentinel/technical-guides/sentinel-2-msi/level-2a/algorithm-overview) provided together with Sentinel-2 L2A products. In this mask, each pixel is classified according to a set of labels (see Figure 3 in [classification mask documentation](https://sentinels.copernicus.eu/web/sentinel/technical-guides/sentinel-2-msi/level-2a/algorithm-overview) ).

First we define the bands that we are interested in:

```python
red = ds['red']
nir = ds['nir']
scl = ds['scl']
```

Next we will use the mask to drop pixels that are labeled as clouds and water. For this we use the classification map to mask out pixels recognized by the [Sentinel-2 processing algorithm](https://custom-scripts.sentinel-hub.com/custom-scripts/sentinel-2/scene-classification/) as cloud or water:

```python
# generate mask ("True" for pixel being cloud or water)
mask = scl.isin([
3, # CLOUD_SHADOWS
6, # WATER
8, # CLOUD_MEDIUM_PROBABILITY
9, # CLOUD_HIGH_PROBABILITY
10 # THIN_CIRRUS
])
red_masked = red.where(~mask)
nir_masked = nir.where(~mask)
```

Then, we calculate the NDVI:

```python
ndvi = (nir_masked - red_masked) / (nir_masked + red_masked)
```

We can visualize the calculated NDVI for the AoI at two given dates (before and after the wildfires) by selecting the date:

```python
ndvi_before = ndvi.sel(time="2023-07-13")
ndvi_before.plot()
```

![NDVI before the wildfire](fig/E12/NDVI-before.png){alt="NDVI before the wildfire"}

```python
ndvi_after = ndvi.sel(time="2023-08-27")
ndvi_after.plot()
```

![NDVI after the wildfire](fig/E12/NDVI-after.png){alt="NDVI after the wildfire"}

Another feature of having the data available in a datacube is that you can for instance questions multiple layers. If you want for instance to see how the NDVI changed over time for a specific point you can do the following. Let us first define a point in the region where we know it was affected by the wildfile. To check that it is after the fire we plot it:

```python
x = 585_000
y = 3_995_000

import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ndvi_after.plot(ax=ax)
ax.scatter(x, y, marker="o", c="k")
```

![NDVI plot with selected point](fig/E12/NDVI-after_point.png){alt="NDVI plot with selected point"}

Now let us extract the NDVI values computed at that point for the full time series:

```python
ndvi_xy = ndvi.sel(x=x, y=y, method="nearest")
print(ndvi_xy)
```

```output
<xarray.DataArray (time: 25)> Size: 100B
dask.array<getitem, shape=(25,), dtype=float32, chunksize=(1,), chunktype=numpy.ndarray>
Coordinates:
y float64 8B 3.995e+06
x float64 8B 5.85e+05
spatial_ref int32 4B 32635
* time (time) datetime64[ns] 200B 2023-07-01T09:10:15.805000 ... 20...
```

We now trigger computation. Note that we run this in parallel (probably not much of an effect here, but definitely helpful for larger calculations):

```python
%%time
ndvi_xy = ndvi_xy.compute(scheduler="threads", num_workers=4)
```

```output
CPU times: user 15.6 s, sys: 4.9 s, total: 20.5 s
Wall time: 1min 8s
```

The result is a time series representing the NDVI value computed for the selected point for all the available scenes in the time range. We drop the NaN values, and plot the final result:

```python
ndvi_xy.dropna(dim="time").plot()
```

![NDVI time series](fig/E12/NDVI-time-series.png){alt="NDVI time series"}
Binary file added episodes/fig/E12/NDVI-after.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added episodes/fig/E12/NDVI-after_point.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added episodes/fig/E12/NDVI-before.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added episodes/fig/E12/NDVI-time-series.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 5f7e9bd

Please sign in to comment.