Real-Time Data Pipeline

The Real-Time Data Pipeline leverages cutting-edge technologies like Kafka, Zookeeper, Spark, and AWS (S3, Glue, Athena, RedShift) to simulate real-time IoT data processing. This project models a taxi driving from Birmingham to London, streaming real-time data throughout the journey. The data is ingested, processed, and stored for further analysis, showcasing an end-to-end data pipeline.

Key Features

Real-Time Data Streaming: Simulates IoT data generation from a moving taxi.
Data Ingestion: Utilizes Kafka and Zookeeper for reliable data transmission.
Data Storage and Processing:
- S3: Stores raw and processed data.
- AWS Glue: Extracts and transforms data for analysis.
- RedShift: Supports querying and visualizing the data.
Scalability: Modular and scalable architecture orchestrated with Docker Compose.

Prerequisites

Before you begin, ensure you have the following:

AWS Credentials:
- An AWS account with access keys for S3, Glue, and RedShift.
Docker: Installed and configured on your system.
Python: Required to run the real-time data simulation.

Installation

git clone https://github.com/ChihTsungLu/Real-Time-Data-Pipeline.git
Set Up Docker Compose:
- Ensure Docker is running on your machine.
- Configure docker-compose.yml as needed.

Configuration

AWS_ACCESS_KEY = "YOUR_AWS_ACCESS_KEY" AWS_SECRET_KEY = "YOUR_AWS_SECRET_KEY"
Ensure your AWS IAM role has appropriate permissions for S3, Glue, and RedShift.

Running the Pipeline

docker-compose up
python main.py
Monitor the pipeline:
- Check Kafka topics for incoming messages.
- Verify data in S3 and RedShift.

Technologies Used

Kafka: Real-time data streaming.
Zookeeper: Manages Kafka clusters.
Spark: Processes streamed data.
AWS S3: Stores raw and processed data.
AWS Glue: Data transformation and ETL.
AWS RedShift: Data warehousing and analytics.
Docker Compose: Orchestrates the environment.

Future Enhancements

Add support for multiple IoT devices.
Integrate monitoring tools like Prometheus and Grafana.
Implement data quality checks during ETL.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
jobs		jobs
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
requirement.txt		requirement.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Real-Time Data Pipeline

Key Features

Prerequisites

Installation

Configuration

Running the Pipeline

Technologies Used

Future Enhancements

About

Releases

Packages

Languages

ChihTsungLu/Real-Time-Data-Pipeline

Folders and files

Latest commit

History

Repository files navigation

Real-Time Data Pipeline

Key Features

Prerequisites

Installation

Configuration

Running the Pipeline

Technologies Used

Future Enhancements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages