Update README.md

nogibjj · Dec 9, 2024 · 30f5c64 · 30f5c64
1 parent f5939c2
commit 30f5c64
Showing 1 changed file with 149 additions and 21 deletions.
diff --git a/README.md b/README.md
@@ -1,28 +1,156 @@
-# Microservices
+# Microservices for Kafka and Spark Stream Processing
 
-## Requirements:
-- For this project, we'll need to have Docker and Kubernetes installed.
-- Follow the official documentation to install it.
+## Overview
 
-## Setting up the Kubernetes Cluster:
-- After Docker and Kubernetes (I am using minikube) are installed we'll set up our cluster.
-- For this project, we'll need Kafka and Spark. To install it in the most simplar manner one can use helm.
-- Follow this links to install them into your kubernetes cluster.
-    - <a href="https://github.com/bitnami/charts/tree/main/bitnami/kafka"> Kafka</a>
-    - <a href="https://github.com/bitnami/charts/tree/main/bitnami/spark"> Spark</a> 
+This project implements a microservices-based architecture for stream processing using Kafka and Spark. It provides endpoints for real-time data processing, analysis, and transformation. The microservices are containerized using Docker and deployed using Kubernetes, with support for distributed data pipelines.
 
-- After everything is set up use `kubectl get all` to see all the things that are running on your kubernetes cluster.
+---
 
-- After Kafka and Spark are successfully installed you're now good to go.
+## Features
+- Real-time stream processing using Apache Spark and Kafka.
+- Comprehensive logging for monitoring and debugging.
+- Containerized microservices for portability and scalability.
+- Load testing using Locust to ensure reliability and stability.
+- Quantitative assessment of system performance (latency, throughput).
 
-## Get started:
-- Now you can start creating kafka topics, write spark-jobs as per your requirements.
-- You can find all the jobs written inside `src` directory.
-- Bitnami image specifications are inside `images` directory.
-- Sample data are placed inside `data` directory. They are named according to their topic for which they're used.
-- Execute command inside the pods terminal.
+---
 
-```bash
-# To execute a spark job
-spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0 [PATH_TO_FILE]
+## Requirements
+1. Install **Docker** and **Kubernetes**:
+   - Follow the official [Docker installation guide](https://docs.docker.com/get-docker/) and [Kubernetes installation guide](https://kubernetes.io/docs/tasks/tools/).
+2. Install **Helm** to deploy Kafka and Spark:
+   - Kafka: [Bitnami Kafka Chart](https://github.com/bitnami/charts/tree/main/bitnami/kafka)
+   - Spark: [Bitnami Spark Chart](https://github.com/bitnami/charts/tree/main/bitnami/spark)
+3. Install **Locust** for load testing:
+   ```bash
+   pip install locust
+   ```
+
+---
+
+## Setting up the Kubernetes Cluster
+1. Start Minikube:
+   ```bash
+   minikube start
+   ```
+2. Deploy Kafka and Spark using Helm:
+   ```bash
+   helm repo add bitnami https://charts.bitnami.com/bitnami
+   helm install kafka bitnami/kafka
+   helm install spark bitnami/spark
+   ```
+3. Verify deployment:
+   ```bash
+   kubectl get all
+   ```
+
+---
+
+## Running the Microservices
+1. Build the Docker image:
+   ```bash
+   docker build -t microservice .
+   ```
+2. Run the container:
+   ```bash
+   docker run -p 5000:5000 microservice
+   ```
+3. Test the `/process` endpoint:
+   ```bash
+   curl -X POST http://localhost:5000/process \
+        -H "Content-Type: application/json" \
+        -d '[{"id": 1, "gender": "M", "salary": 5000}, {"id": 2, "gender": "F", "salary": 6000}]'
+   ```
+
+---
+
+## Load Testing
+1. Run the load test using Locust:
+   ```bash
+   locust -f src/load_test.py --headless -u 100 -r 10 --run-time 1m --host http://localhost:5000
+   ```
+2. Sample results (excerpt):
+   ```
+   Requests: 727
+   Avg Latency: 7121ms
+   Min Latency: 4906ms
+   Max Latency: 15603ms
+   Median Latency: 6400ms
+   Percentiles (95th): 11000ms
+   ```
+
+---
+
+## Quantitative Assessment
+The system was tested with 100 concurrent users and a ramp-up rate of 10 users per second. Below are the key metrics from the load tests:
+
+| Metric              | Value                |
+|---------------------|----------------------|
+| Total Requests      | 727                 |
+| Average Latency     | 7121ms              |
+| Minimum Latency     | 4906ms              |
+| Maximum Latency     | 15603ms             |
+| Median Latency      | 6400ms              |
+| 95th Percentile     | 11000ms             |
+| Error Rate          | 0% (0 failures)     |
+
+### Observations
+- The average latency increased with the number of concurrent requests but remained under acceptable limits for most cases.
+- No failures were recorded, indicating good reliability.
+- Optimization opportunities exist to reduce peak latencies (e.g., refactoring Spark jobs or optimizing Kafka configurations).
+
+---
+
+## Limitations
+1. **Latency**: Average latency increases with high concurrency, especially for complex Spark jobs.
+2. **Scalability**: Currently limited to a single-node Kafka and Spark setup.
+3. **Monitoring**: Requires integration with tools like Prometheus or Grafana for better performance visualization.
+
+---
+
+## Potential Areas for Improvement
+1. **Scaling**: Move to a multi-node cluster to improve scalability and reduce bottlenecks.
+2. **Caching**: Use distributed caching (e.g., Redis) to speed up frequently accessed computations.
+3. **Advanced Metrics**: Collect more detailed performance metrics using monitoring tools.
+4. **CI/CD**: Extend the GitHub Actions pipeline to include integration tests and deployment to Kubernetes.
+
+---
+
+## AI Pair Programming Tools Used
+1. **GitHub Copilot**:
+   - Assisted in generating initial code for Kafka-Spark integration.
+   - Suggested efficient ways to group and aggregate data streams.
+2. **TabNine**:
+   - Provided code completions for Flask APIs and Spark transformations.
+   - Enhanced the quality of SQL-like Spark operations.
+
+---
+
+## Directory Structure
+```
+project-root/
+│
+├── .devcontainer/             # Docker configuration for GitHub Codespaces
+│   ├── Dockerfile
+│   ├── devcontainer.json
+│
+├── src/                       # Source code for microservices
+│   ├── app.py                 # Flask API for data processing
+│   ├── load_test.py           # Locust script for load testing
+│   ├── count_gender.py        # Spark job: Count records by gender
+│   ├── count_month.py         # Spark job: Count records by birth month
+│   ├── final_analysis.py      # Spark job: Comprehensive analysis
+│   ├── read_emp_data.py       # Spark job: Read employee data
+│   ├── read_json_from_topic.py # Spark job: Read data from Kafka
+│
+├── data/                      # Sample data for testing
+│   ├── emp_data.json
+│   ├── json_topic.json
+│
+├── images/                    # Helm chart specifications for Kafka and Spark
+│   ├── kafka.yml
+│   ├── spark.yml
+│
+├── requirements.txt           # Python dependencies
+├── README.md                  # Project documentation
 ```