This repository contains the code and configuration for the EURO 2024 Real-Time Data Streaming and Visualization project. The project leverages a real-time data pipeline to stream football data (teams, players, matches, groups, and events) from an API into Kafka. The data is processed and stored in Apache Pinot for fast analytics and visualized using Apache Superset.
The primary goal of this project is to demonstrate how to build and manage a real-time data warehouse using modern tools and technologies. The data includes dynamic football statistics like live match events, player performance, and group standings for EURO 2024.
- Apache Kafka: Message broker used for real-time streaming of data.
- Apache Pinot: OLAP data store for fast querying and analytics.
- Apache Superset: Data visualization platform to create dashboards and monitor statistics.
- Apache Airflow: Workflow orchestration for automated data extraction and streaming.
The project consists of the following pipeline:
- API Extraction: Data is fetched from a JSON-based API.
- Kafka Streaming: Extracted data is streamed into Kafka topics.
- Processing with Pinot: Kafka feeds the data into Apache Pinot for storage and querying.
- Visualization in Superset: Real-time dashboards are created to visualize football statistics.
📦 Real Time Data Warehouse Streaming
├── dags/ # Airflow DAGs for orchestration
│ ├── euro2024_data_streaming.py
├── pipelines # ETL Pipline Definition Script
│ ├── euro2024_function.py
│ ├── create_table_schema.py
├── schemas/ # Schemas definitions
│ ├── groups_schema.json # Schema for group data
│ ├── matches_schema.json # Schema for match data
│ ├── players_schema.json # Schema for player data
│ └── teams_schema.json # Schema for team data
├── table-configs/ # Table definitions
│ ├── groups_table.json # Table configs for group data
│ ├── matches_table.json # Table configs for match data
│ ├── players_table.json # Table configs for player data
│ └── teams_table.json # Table configs for team data
├── utils/ # Utility files (constants, helper functions)
│ ├── constants.py
├── superset # Open Source Data Visualizer
│ ├── dockerfile # Docker File for Apache Superset
│ ├── superset_config.py # Superset Image Config File
│ └── superset-init.sh # To Initiate Bash Script For Superset
├── Dockerfile # Docker setup for containerized deployment
├── requirements.txt # Python dependencies
├── .env
├── airflow.cfg
└── README.md # Project documentation
- Create a RapidAPI Account
- Visit the EURO 2024 API on RapidAPI.
- Sign up or log in to your RapidAPI account.
- Subscribe to the API Gateway and obtain your API key.
- Add the RapidAPI Key
- Navigate to the constants.py file in the utils/ directory.
- Add your RapidAPI key in the following format:
RAPIDAPI_KEY = "your-rapidapi-key"
- Docker and Docker Compose installed.
- Apache Kafka, Pinot, and Superset configured.
- Airflow environment set up with Python 3.10.
-
Clone the repository:
git clone https://github.com/evanmathew/euro-2024-kafka-pinot-pipeline.git cd euro-2024-kafka-pinot-pipeline
-
Install dependencies:
pip install -r requirements.txt
-
Build the Docker File:
docker build -t euro-2024-kafka-pinot-pipeline .
-
Start the Docker containers:
docker-compose up
-
Launch Airflow:
- Visit
http://localhost:8080
to view and manage DAGs.
- Visit
-
Connect Pinot Database with Superset:
- Visit
http://localhost:8088
in your browser. - Login with 'admin' as both password and user
- Add Database > Chosse Apache Pinot as database
- SQLAlchemy URI
pinot://pinot-broker:8099/query/sql?controller=http://pinot-controller:9000
- Then you can start designing the
Chart
&Dashboards
- Visit
- Airflow DAGs orchestrate the extraction of data from the API and stream it to Kafka.
- Pinot stores the streamed data for fast querying.
- Superset Dashboards visualize the real-time statistics, offering insights into player performance, match events, and group standings.
- players: Contains player statistics like goals, assists, and appearances.
- teams: Stores team details such as coach, captain, and championships.
- matches: Holds match details including scores, lineups, and winners.
- groups: Tracks group standings, points, and goal differences.
Below is the high-level architecture of the real-time data streaming pipeline:
The following diagram shows the Airflow DAG for orchestrating the data pipeline:
Feel free to open issues or submit pull requests for any feature improvements or bug fixes.
This project is licensed under the MIT License - see the LICENSE file for details.
Happy Streaming!