Data augmentation pipeline

Data augmentation is a technique used to artificially increase the size of a dataset by creating new data points from existing data. This technique is typically used to improve the performance of a machine learning model by making it more robust to noise and variations in the data.

In fact, one of the challenges in developing reliable precipitation nowcasting models is the inherent data imbalance between precipitation and non-precipitation instances. Data imbalance occurs when the number of samples in one class significantly outweighs the number of samples in another class, leading to biased model training and reduced prediction accuracy. In the context of precipitation nowcasting, the majority of observations for the precipitation variable will be zero (i.e., no observed rain). As a consequence, the great majority of instances in the corresponding dataset used to train a nowcasting model will typically represent non-precipitation events.

The pie chart below gives illustrates the problem of umbalanced observations. This chart presents the percentages of different levels of precipitation measures by a system of 33 rainfall gauge stations over a period of approximately two decades in the Rio de Janeiro municipality.

pie title Rainfall statistics
    "No rain/Light rain (less than 5mm/h)" : 97.31
    "Moderate rain (between 5mm/h and 25mm/h)" : 2.19
    "Heavy rain (between 25mm/h and 50mm/h)" : 0.33
    "Extreme rain (more than 50 mm/h)" : 0.17

If we disregard the "No rain/Light rain" observations, the resulting pie chart is the following:

pie title Rainfall statistics
    "Moderate rain (between 5mm/h and 25mm/h)" : 2.19
    "Heavy rain (between 25mm/h and 50mm/h)" : 0.33
    "Extreme rain (more than 50 mm/h)" : 0.17

Data imbalance makes it difficult for the model to learn the complex patterns associated with precipitation accurately. Consequently, this data imbalance issue poses a significant obstacle to the development of robust and reliable nowcasting models.

In AtmoSeer, data augmentation is an attempt to mitigate to problem of unbalancing between precipitation and non-precipitation events. Concretely, given a set of weather stations near to the WSoI, the script augment_datasets.py can be used to merge observations from these neighboring weather stations with observations made by the WSoI. Hence, the model trained for the WSoI's location will be fit using data from such WSoI and from its neighboring stations.

As an example, the command below builds a augmented dataset for the station identified by A652. This augmented dataset is the union of that WSoI with three of its neighboring stations (identfied by the codes A621, A636, and A627).

python src/augment_datasets.py -s A652 -p A652 -i A621 A636 A627

When building a nowcasting model with AtmoSeer, be aware that the data augmentation step is optional. If you choose to execute this step, then it should be executed after preprocesing the WSoI and its neighboring weather stations, and before the model training step. This is depicted in the diagram below.

graph TD;
    subgraph 1
        A621.parquet.gzip-->prepA[preprocess_ws.py]-->A621_preprocessed.parquet.gzip-->bldA[build_datasets.py]-->./data/datasets/A621.pickle
    end

    subgraph 2
        A636.parquet.gzip-->prepB[preprocess_ws.py]-->A636_preprocessed.parquet.gzip-->bldB[build_datasets.py]-->./data/datasets/A636.pickle
    end

    subgraph 3
        A627.parquet.gzip-->prepC[preprocess_ws.py]-->A627_preprocessed.parquet.gzip-->bldC[build_datasets.py]-->./data/datasets/A627.pickle
    end

    subgraph 4
        A652.parquet.gzip-->prepD[preprocess_ws.py]-->A652_preprocessed.parquet.gzip-->bldD[build_datasets.py]-->./data/datasets/A652.pickle
    end

    ./data/datasets/A621.pickle-->augment_datasets.py;
    ./data/datasets/A636.pickle-->augment_datasets.py;
    ./data/datasets/A627.pickle-->augment_datasets.py;
    ./data/datasets/A652.pickle-->augment_datasets.py;

    augment_datasets.py-->./data/datasets/A652_A621_A636_A627.pickle;

    ./data/datasets/A652_A621_A636_A627.pickle-->train_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data augmentation pipeline

Clone this wiki locally