This project demonstrates how to use Apache Airflow for orchestrating a daily ETL pipeline that fetches weather data from the free OpenWeather API, cleans and transforms the data, and stores it in a PostgreSQL database. It provides a scalable and reproducible example that can be extended for machine learning tasks such as training predictive models on weather-related datasets.
This repository offers a sample Airflow project integrating a daily weather data fetch from OpenWeather’s free API into a PostgreSQL database. The data is cleaned, validated, and ready for further downstream tasks, such as ML model training or dashboard visualization.
To run this project:
- Follow the steps in the Setup Instructions section to configure your environment, database, and connections.
- Download or clone the DAG and related codes in the exact same structure as described in the Project Structure section.
- Use the Airflow UI to trigger and monitor the pipeline execution.
Below are some screenshots showcasing the Airflow UI and the weather_data_pipeline
DAG in action. These visuals provide an overview of how the pipeline is orchestrated within Apache Airflow.
Airflow Dashboard | Pipeline Execution Detail |
---|---|
Pipeline Graph View | Pipeline XCom |
- Airflow Dashboard (AirFlow01): Displays the main Airflow UI with an active
weather_data_pipeline
DAG. - Pipeline Execution Logs (AirFlow02): Showcases task logs and data retrieved from the OpenWeather API.
- Pipeline Graph View (AirFlow03): Visualizes the flow of tasks in the pipeline, including
load_locations
,fetch_weather_data
, andinsert_weather_data
. - Pipeline DAG Details (AirFlow04): Summarizes DAG execution details such as run duration, task statuses, and DAG configuration.
- Automated Orchestration: Daily scheduled runs controlled by Airflow.
- Data Cleaning & Validation: Ensures consistent, reliable data for analysis.
- Modular Design: Separate files for DAGs, database connection, and data processing.
- Easily Extensible: Add more cities, transformations, or ML tasks as needed.
┌─────────────────┐
│ OpenWeather API │
└──────┬──────────┘
│ (Fetch JSON)
v
┌───────────────────┐
│ Airflow DAG │
│ (weather_pipeline)│
└──────┬────────────┘
│ (Cleanup & Transform)
v
┌─────────────────┐
│ DataProcessor │
└──────┬──────────┘
│ (Insert into DB)
v
┌─────────────────┐
│ DatabaseEngine │
└──────┬──────────┘
│ (SQLAlchemy engine)
v
┌─────────────────┐
│ PostgreSQL DB │
└─────────────────┘
Important:
Apache Airflow is not fully supported on Windows. It is recommended to run Airflow on a Linux-based system.
- If you're using Windows, you can install and run Airflow using the Windows Subsystem for Linux (WSL) feature.
- Alternatively, you can use a Linux virtual machine (VM) or Docker for running Airflow on Windows.
To set up the project, please follow the instructions in the Setup folder in the following order listed below.
-
Airflow Setup:
Step-by-step instructions for installing and configuring Apache Airflow. -
Database Setup:
Guide for setting up the PostgreSQL database, including creating schemas and tables. -
Folder Setup:
Instructions for verifying and setting up the required folder structure for your Airflow project. -
IDE Setup:
Guide for installing and configuring Visual Studio Code (or another IDE) for developing Airflow DAGs and helper scripts. -
Airflow Database Connection Setup:
Detailed steps for securely setting up database connections within Airflow. -
API Connection Setup:
Instructions for obtaining an OpenWeather API key and securely adding it as an Airflow connection.
The repository is organized as follows:
AirFlow-ML-Data-Integration/
├── dags/ # Airflow DAGs for ETL orchestration
├── data_processing/ # Scripts for cleaning and transforming data
├── database_engine/ # Database connection and query helpers
├── Setup/ # Setup instructions (Airflow, database, etc.)
│ ├── Airflow-Setup.md # Airflow setup steps
│ ├── Database-Setup.md # Database setup steps
├── tests/ # Unit tests for the pipeline
├── README.md # Main project readme
└── requirements.txt # Python dependencies
Contributions are welcome and appreciated! To contribute to this project, please follow these steps:
-
Fork the Repository: Click the "Fork" button on the GitHub page of this repository to create a copy under your own account.
-
Create a New Branch: git checkout -b feature/your-feature-name Choose a clear, descriptive name for your branch that reflects the changes you’re making.
-
Make Your Changes:
- Add or modify code, tests, or documentation as needed.
- Ensure that your code adheres to the style and format defined by this project (PEP 8 for Python).
- If you are adding new features, include tests or update existing tests to maintain coverage and confirm that your additions work as intended.
-
Run Tests:
Example test command
pytest tests/
Make sure all tests pass and there are no regressions.
-
Commit Your Changes:
git add . git commit -m "Add your commit message here"
Write clear and concise commit messages that explain what your changes do.
-
Push and Open a Pull Request:
git push origin feature/your-feature-name
Go to your forked repository on GitHub and open a Pull Request (PR) against the main branch of this repository. Describe your changes, why they’re needed, and how to test them.
-
Code Review and Feedback:
- Be open to feedback and make the requested changes where applicable.
-
Merge: Once your PR is approved, it will be merged into the main branch.
Note: If you’re unsure about any aspect of your contribution or would like to propose an idea before coding, feel free to open an issue first. Constructive discussion helps ensure we move in a direction that benefits the entire community.