Average cumulative mass balance of “reference” Glaciers worldwide from 1945-2014 sourced from US EPA and the World Glacier Monitoring Service (WGMS). This is cumulative change in mass balance of a set of “reference” glaciers worldwide beginning in 1945. The values represents the average of all the glaciers that were measured. Negative values indicate a net loss of ice and snow compared with the base year of 1945. For consistency, measurements are in meters of water equivalent, which represent changes in the average thickness of a glacier.
-
Extract : The
extract
function in the ETL pipeline downloads the CSV file containing average cumulative mass balance data of "reference" glaciers worldwide from 1945-2014. The function retrieves this data from the specified URL, saves it to the Databricks File System (DBFS), and then reads it into a PySpark DataFrame for further processing in the ETL pipeline. -
Transform: The
transform_data
function in the ETL process for glacier data filters the original DataFrame (df
) into two subsets: 'nintys' representing data from the 1990s and 'modern' representing data from the 2000s onwards. These subsets are stored as temporary views for further analysis. The function then returns DataFrames (nintys_df
andmodern_df
) corresponding to these filtered time periods, facilitating the extraction of relevant information for downstream processing in the ETL pipeline. -
Load: The
load
function in the glacier ETL process reads glacier data from a CSV file located in the Databricks File System (DBFS). The function uses PySpark to read the CSV file, allowing for schema customization using a predefined structure. After loading the data into a DataFrame, it is displayed, and then saved as a table named 'glaciers' in Spark SQL for further analysis, providing a structured and queryable representation of the glacier dataset.
The ETL pipeline utilizes Delta Lake for data storage. The provided code defines and overwrites a Delta table named 'glaciers' with the specified schema, including columns for 'Year,' 'Meancumulativemassbalance,' and 'Numberofobservations.' This Delta table serves as a persistent and versioned storage solution, enabling efficient data management and querying capabilities for glacier-related information in a structured format.
The following code is used to start a spark session and then use it to load the csv into a readable format and then saved in the dbfs storage in databricks.
import requests
from pyspark.sql import DataFrame
df = spark.read.format("csv").option("header", "true").load("file:/dbfs/glaciers.csv")
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
# Define the schema for the CSV file
schema = StructType(
[
StructField("Year", StringType(), True),
StructField("Meancumulativemassbalance", StringType(), True),
StructField("Numberofobservations", StringType(), True),
]
)
# Read the CSV file with the specified schema
df = (
spark.read.format("csv")
.option("header", "true")
.schema(schema)
.load("file:/dbfs/glaciers.csv")
)
# Show the contents of the DataFrame
display(df)
The following is the Mean Cumulative Mass Balance for the glaciers in the dataset. This data is in ascending order detailing the increase of mass for each glacier.
-
Prioritize Mitigation Strategies: Concentrate mitigation efforts on glaciers showing persistent negative 'Meancumulativemassbalance,' indicating sustained ice and snow loss. Allocate resources strategically based on the severity of mass balance trends.
-
Temporal Strategy Planning: Analyze temporal patterns by considering 'Year' data to identify periods of accelerated mass loss. Tailor conservation strategies to address specific temporal trends, allowing for more effective intervention.
-
Enhance Data Collection in Critical Years: Intensify data collection efforts during years with extreme changes in 'Meancumulativemassbalance.' This targeted approach ensures a more detailed understanding of factors contributing to significant fluctuations and supports informed decision-making.