Skip to content

Data retrieval

Eduardo Bezerra edited this page Sep 4, 2023 · 2 revisions

Introduction

In the context of a machine learning pipeline, data retrieval is the step of collecting data to fit machine learning models. In the context of AtmoSeer, after retrieveing data for a WSoI, we can go ahead and execute other steps of the pipeline to produce a forecasting model trained only on the variables observed by that weather station. In such case, the only data source used to train the forecasting model would be the weather station of interest. However, AtmoSeer also allows to merge one or more data sources before training the forecasting model. This is useful in situations where we know other data sources have the potential to provide useful features to train the forecasting model.

At the moment, AtmoSeer is able to collect data from the following data sources (besides meteorological and gauge stations):

  • SBGL sounding station
  • ERA5 reanalysis data

The following sections describe how data from these sources can be retrieved (ingested) do be further used downstream in a AtmoSeer pipeline.

SBGL sounding station

As an example, we are going to retrieve observations from another data source, the SGBL sounding station. The SBGL airport has an upper-air (or sounding) station for producing regularly atmospheric soundings twice a day, the atmospheric profile of pressure, air and dewpoint temperature, relative humidity, and wind (direction and speed), from the surface up to more than 25 km. The command to retrieve observations from such station is the following:

python src/retrieve_as.py --station_id SBGL --start_year 2023 --end_year 2023

A graphical perspective of the retrieve_as.py script is shown below.

graph TD;
    retrieve_as.py-->./data/as/SBGL_2023_2023.parquet.gzip;
    start_year=2007-->retrieve_as.py;
    end_year=2023-->retrieve_as.py;
    station_id=SBGL-->retrieve_as.py
Loading

After the sounding data is retrieved, some pre-processing is needed. In particular, we have to run the script, that computes several instability indices from the retrieved observations. The AtmoSeer command to generate these indices is the following:

python src/gen_sounding_indices.py --input_file ./data/as/SBGL_1997_2023.parquet.gzip --output_file ./data/as/SBGL_indices_1997_2023.parquet.gzip

ERA5 reanalysis data

In Meteorology, reanalysis is the combination of observations with model information to reconstruct past weather and climate. ERA5 atmospheric reanalysis provides hourly estimates of a large number of atmospheric, land, and oceanic climate data that spans from 1940 to today, covering the Earth on a 30 km grid and resolves the atmosphere using 137 levels from the surface up to a height of 80 km. In AtmoSeer, reanalysis data can be used as a data source to generate features to be used to train forecasting models.

AtmoSeer provides the script retrieve_ERA5.py that can be used to retrieve ERA5 data programatically through the CDS API (https://cds.climate.copernicus.eu/api-how-to). At the moment, this script retrieves ERA5 for only three pressure levels: 200hPa, 700hPa, and 1000hPa. For each pressure level, values for the following variables are retrieved: "geopotential", "relative_humidity", "temperature", "u_component_of_wind", "v_component_of_wind".

For example, the command below retrieves reanalysis hourly data from 2021 to 2023.

python src/retrieve_ERA5.py -b 2021 -e 2023

The retrieve_ERA5.py script generates a single NetCDF file containing the retrieved data.

Clone this wiki locally