The Real-Time Data Pipeline leverages cutting-edge technologies like Kafka, Zookeeper, Spark, and AWS (S3, Glue, Athena, RedShift) to simulate real-time IoT data processing. This project models a taxi driving from Birmingham to London, streaming real-time data throughout the journey. The data is ingested, processed, and stored for further analysis, showcasing an end-to-end data pipeline.
-
Real-Time Data Streaming: Simulates IoT data generation from a moving taxi.
-
Data Ingestion: Utilizes Kafka and Zookeeper for reliable data transmission.
-
Data Storage and Processing:
-
S3: Stores raw and processed data.
-
AWS Glue: Extracts and transforms data for analysis.
-
RedShift: Supports querying and visualizing the data.
-
-
Scalability: Modular and scalable architecture orchestrated with Docker Compose.
Before you begin, ensure you have the following:
-
AWS Credentials:
- An AWS account with access keys for S3, Glue, and RedShift.
-
Docker: Installed and configured on your system.
-
Python: Required to run the real-time data simulation.
-
git clone https://github.com/ChihTsungLu/Real-Time-Data-Pipeline.git
-
Set Up Docker Compose:
-
Ensure Docker is running on your machine.
-
Configure docker-compose.yml as needed.
-
-
AWS_ACCESS_KEY = "YOUR_AWS_ACCESS_KEY" AWS_SECRET_KEY = "YOUR_AWS_SECRET_KEY"
-
Ensure your AWS IAM role has appropriate permissions for S3, Glue, and RedShift.
-
docker-compose up
-
python main.py
-
Monitor the pipeline:
-
Check Kafka topics for incoming messages.
-
Verify data in S3 and RedShift.
-
-
Kafka: Real-time data streaming.
-
Zookeeper: Manages Kafka clusters.
-
Spark: Processes streamed data.
-
AWS S3: Stores raw and processed data.
-
AWS Glue: Data transformation and ETL.
-
AWS RedShift: Data warehousing and analytics.
-
Docker Compose: Orchestrates the environment.
-
Add support for multiple IoT devices.
-
Integrate monitoring tools like Prometheus and Grafana.
-
Implement data quality checks during ETL.