Skip to content

This project implements a real-time data pipeline for EURO 2024 football data, utilizing Apache Kafka for streaming, Apache Pinot for fast querying, and Apache Superset for data visualization. The pipeline extracts data from a JSON-based API and orchestrates the workflow using Apache Airflow.

License

Notifications You must be signed in to change notification settings

evanmathew/euro-2024-kafka-pinot-pipeline

Repository files navigation

⚽Real Time Data Warehouse Streaming

This repository contains the code and configuration for the EURO 2024 Real-Time Data Streaming and Visualization project. The project leverages a real-time data pipeline to stream football data (teams, players, matches, groups, and events) from an API into Kafka. The data is processed and stored in Apache Pinot for fast analytics and visualized using Apache Superset.

🛠 Project Overview

The primary goal of this project is to demonstrate how to build and manage a real-time data warehouse using modern tools and technologies. The data includes dynamic football statistics like live match events, player performance, and group standings for EURO 2024.

🚀 Key Components

  • Apache Kafka: Message broker used for real-time streaming of data.
  • Apache Pinot: OLAP data store for fast querying and analytics.
  • Apache Superset: Data visualization platform to create dashboards and monitor statistics.
  • Apache Airflow: Workflow orchestration for automated data extraction and streaming.

📋 Architecture

The project consists of the following pipeline:

  1. API Extraction: Data is fetched from a JSON-based API.
  2. Kafka Streaming: Extracted data is streamed into Kafka topics.
  3. Processing with Pinot: Kafka feeds the data into Apache Pinot for storage and querying.
  4. Visualization in Superset: Real-time dashboards are created to visualize football statistics.

📂 Repository Structure

📦 Real Time Data Warehouse Streaming
├── dags/                    # Airflow DAGs for orchestration
│   ├── euro2024_data_streaming.py
├── pipelines                # ETL Pipline Definition Script
│   ├── euro2024_function.py
│   ├── create_table_schema.py
├── schemas/                 # Schemas definitions 
│   ├── groups_schema.json          # Schema for group data
│   ├── matches_schema.json         # Schema for match data
│   ├── players_schema.json         # Schema for player data
│   └── teams_schema.json           # Schema for team data
├── table-configs/           # Table definitions 
│   ├── groups_table.json           # Table configs for group data
│   ├── matches_table.json          # Table configs for match data
│   ├── players_table.json          # Table configs for player data
│   └── teams_table.json            # Table configs for team data
├── utils/                   # Utility files (constants, helper functions)
│   ├── constants.py
├── superset                 # Open Source Data Visualizer
│   ├── dockerfile                  # Docker File for Apache Superset
│   ├── superset_config.py          # Superset Image Config File
│   └── superset-init.sh            # To Initiate Bash Script For Superset  
├── Dockerfile               # Docker setup for containerized deployment
├── requirements.txt         # Python dependencies
├── .env
├── airflow.cfg
└── README.md                # Project documentation

⚙️ Installation and Setup

🛑 Prerequisites

  • Create a RapidAPI Account
    • Visit the EURO 2024 API on RapidAPI.
    • Sign up or log in to your RapidAPI account.
    • Subscribe to the API Gateway and obtain your API key.
  • Add the RapidAPI Key
    • Navigate to the constants.py file in the utils/ directory.
    • Add your RapidAPI key in the following format:
      RAPIDAPI_KEY = "your-rapidapi-key"
      
  • Docker and Docker Compose installed.
  • Apache Kafka, Pinot, and Superset configured.
  • Airflow environment set up with Python 3.10.

Steps to Run the Project

  1. Clone the repository:

    git clone https://github.com/evanmathew/euro-2024-kafka-pinot-pipeline.git
    cd euro-2024-kafka-pinot-pipeline
  2. Install dependencies:

    pip install -r requirements.txt
  3. Build the Docker File:

    docker build -t euro-2024-kafka-pinot-pipeline .
  4. Start the Docker containers:

    docker-compose up
  5. Launch Airflow:

    • Visit http://localhost:8080 to view and manage DAGs.
  6. Connect Pinot Database with Superset:

    • Visit http://localhost:8088 in your browser.
    • Login with 'admin' as both password and user
    • Add Database > Chosse Apache Pinot as database
    • SQLAlchemy URI pinot://pinot-broker:8099/query/sql?controller=http://pinot-controller:9000
    • Then you can start designing the Chart & Dashboards

💡Usage

  • Airflow DAGs orchestrate the extraction of data from the API and stream it to Kafka.
  • Pinot stores the streamed data for fast querying.
  • Superset Dashboards visualize the real-time statistics, offering insights into player performance, match events, and group standings.

📊 Topics and Schemas

  • players: Contains player statistics like goals, assists, and appearances.
  • teams: Stores team details such as coach, captain, and championships.
  • matches: Holds match details including scores, lineups, and winners.
  • groups: Tracks group standings, points, and goal differences.

🏗 Architecture

Below is the high-level architecture of the real-time data streaming pipeline:

Architecture Diagram


Airflow DAG Pipeline

The following diagram shows the Airflow DAG for orchestrating the data pipeline:

DAG Pipeline Diagram


Contributing

Feel free to open issues or submit pull requests for any feature improvements or bug fixes.

License

This project is licensed under the MIT License - see the LICENSE file for details.


Happy Streaming!

About

This project implements a real-time data pipeline for EURO 2024 football data, utilizing Apache Kafka for streaming, Apache Pinot for fast querying, and Apache Superset for data visualization. The pipeline extracts data from a JSON-based API and orchestrates the workflow using Apache Airflow.

Topics

Resources

License

Stars

Watchers

Forks