Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
zihanxiao23 authored Dec 9, 2024
1 parent f5939c2 commit 30f5c64
Showing 1 changed file with 149 additions and 21 deletions.
170 changes: 149 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,28 +1,156 @@
# Microservices
# Microservices for Kafka and Spark Stream Processing

## Requirements:
- For this project, we'll need to have Docker and Kubernetes installed.
- Follow the official documentation to install it.
## Overview

## Setting up the Kubernetes Cluster:
- After Docker and Kubernetes (I am using minikube) are installed we'll set up our cluster.
- For this project, we'll need Kafka and Spark. To install it in the most simplar manner one can use helm.
- Follow this links to install them into your kubernetes cluster.
- <a href="https://github.com/bitnami/charts/tree/main/bitnami/kafka"> Kafka</a>
- <a href="https://github.com/bitnami/charts/tree/main/bitnami/spark"> Spark</a>
This project implements a microservices-based architecture for stream processing using Kafka and Spark. It provides endpoints for real-time data processing, analysis, and transformation. The microservices are containerized using Docker and deployed using Kubernetes, with support for distributed data pipelines.

- After everything is set up use `kubectl get all` to see all the things that are running on your kubernetes cluster.
---

- After Kafka and Spark are successfully installed you're now good to go.
## Features
- Real-time stream processing using Apache Spark and Kafka.
- Comprehensive logging for monitoring and debugging.
- Containerized microservices for portability and scalability.
- Load testing using Locust to ensure reliability and stability.
- Quantitative assessment of system performance (latency, throughput).

## Get started:
- Now you can start creating kafka topics, write spark-jobs as per your requirements.
- You can find all the jobs written inside `src` directory.
- Bitnami image specifications are inside `images` directory.
- Sample data are placed inside `data` directory. They are named according to their topic for which they're used.
- Execute command inside the pods terminal.
---

```bash
# To execute a spark job
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0 [PATH_TO_FILE]
## Requirements
1. Install **Docker** and **Kubernetes**:
- Follow the official [Docker installation guide](https://docs.docker.com/get-docker/) and [Kubernetes installation guide](https://kubernetes.io/docs/tasks/tools/).
2. Install **Helm** to deploy Kafka and Spark:
- Kafka: [Bitnami Kafka Chart](https://github.com/bitnami/charts/tree/main/bitnami/kafka)
- Spark: [Bitnami Spark Chart](https://github.com/bitnami/charts/tree/main/bitnami/spark)
3. Install **Locust** for load testing:
```bash
pip install locust
```

---

## Setting up the Kubernetes Cluster
1. Start Minikube:
```bash
minikube start
```
2. Deploy Kafka and Spark using Helm:
```bash
helm repo add bitnami https://charts.bitnami.com/bitnami
helm install kafka bitnami/kafka
helm install spark bitnami/spark
```
3. Verify deployment:
```bash
kubectl get all
```

---

## Running the Microservices
1. Build the Docker image:
```bash
docker build -t microservice .
```
2. Run the container:
```bash
docker run -p 5000:5000 microservice
```
3. Test the `/process` endpoint:
```bash
curl -X POST http://localhost:5000/process \
-H "Content-Type: application/json" \
-d '[{"id": 1, "gender": "M", "salary": 5000}, {"id": 2, "gender": "F", "salary": 6000}]'
```

---

## Load Testing
1. Run the load test using Locust:
```bash
locust -f src/load_test.py --headless -u 100 -r 10 --run-time 1m --host http://localhost:5000
```
2. Sample results (excerpt):
```
Requests: 727
Avg Latency: 7121ms
Min Latency: 4906ms
Max Latency: 15603ms
Median Latency: 6400ms
Percentiles (95th): 11000ms
```

---

## Quantitative Assessment
The system was tested with 100 concurrent users and a ramp-up rate of 10 users per second. Below are the key metrics from the load tests:

| Metric | Value |
|---------------------|----------------------|
| Total Requests | 727 |
| Average Latency | 7121ms |
| Minimum Latency | 4906ms |
| Maximum Latency | 15603ms |
| Median Latency | 6400ms |
| 95th Percentile | 11000ms |
| Error Rate | 0% (0 failures) |

### Observations
- The average latency increased with the number of concurrent requests but remained under acceptable limits for most cases.
- No failures were recorded, indicating good reliability.
- Optimization opportunities exist to reduce peak latencies (e.g., refactoring Spark jobs or optimizing Kafka configurations).

---

## Limitations
1. **Latency**: Average latency increases with high concurrency, especially for complex Spark jobs.
2. **Scalability**: Currently limited to a single-node Kafka and Spark setup.
3. **Monitoring**: Requires integration with tools like Prometheus or Grafana for better performance visualization.

---

## Potential Areas for Improvement
1. **Scaling**: Move to a multi-node cluster to improve scalability and reduce bottlenecks.
2. **Caching**: Use distributed caching (e.g., Redis) to speed up frequently accessed computations.
3. **Advanced Metrics**: Collect more detailed performance metrics using monitoring tools.
4. **CI/CD**: Extend the GitHub Actions pipeline to include integration tests and deployment to Kubernetes.

---

## AI Pair Programming Tools Used
1. **GitHub Copilot**:
- Assisted in generating initial code for Kafka-Spark integration.
- Suggested efficient ways to group and aggregate data streams.
2. **TabNine**:
- Provided code completions for Flask APIs and Spark transformations.
- Enhanced the quality of SQL-like Spark operations.

---

## Directory Structure
```
project-root/
├── .devcontainer/ # Docker configuration for GitHub Codespaces
│ ├── Dockerfile
│ ├── devcontainer.json
├── src/ # Source code for microservices
│ ├── app.py # Flask API for data processing
│ ├── load_test.py # Locust script for load testing
│ ├── count_gender.py # Spark job: Count records by gender
│ ├── count_month.py # Spark job: Count records by birth month
│ ├── final_analysis.py # Spark job: Comprehensive analysis
│ ├── read_emp_data.py # Spark job: Read employee data
│ ├── read_json_from_topic.py # Spark job: Read data from Kafka
├── data/ # Sample data for testing
│ ├── emp_data.json
│ ├── json_topic.json
├── images/ # Helm chart specifications for Kafka and Spark
│ ├── kafka.yml
│ ├── spark.yml
├── requirements.txt # Python dependencies
├── README.md # Project documentation
```

0 comments on commit 30f5c64

Please sign in to comment.