Distributed Training Platform on Kubernetes Cluster

This project aims to establish a distributed training platform running on a Kubernetes cluster, utilizing the technology stack of Ray, MLflow, React, and Flask to implement an end-to-end machine learning workflow.

Technology Components

Ray: Used for distributed training of PyTorch models, achieving efficient utilization of computational resources.
MLflow: Employed for model saving and hyperparameter logging, providing model management and experiment tracking functionalities.
React: Utilized for building a user-friendly frontend web interface, offering an intuitive user experience.
Flask: Utilized for constructing the backend API, handling user requests and communicating with MLflow and Ray.

Functionality

This project will allow users to submit training jobs through a simple frontend interface, with the system automatically allocating resources on the Kubernetes cluster, utilizing Ray for distributed training, and leveraging MLflow for experiment tracking and model saving. Through the user interface provided by React, users can monitor training progress, view experiment results.

Purpose

This project aims to provide a comprehensive, scalable machine learning development and deployment solution, enabling users to conduct model training and management more easily.

Preparation

Before running the project, ensure you have the necessary dependencies and Kubernetes components installed. Follow the steps below for installation:

1. Install kubeadm(Version 1.29.0-1.1)

Refer to the official Kubernetes documentation for detailed instructions on installing kubeadm:

Install kubeadm

2. Install cri-o(Version 1.29.0)

Follow the installation guide provided by the cri-o project on GitHub:

Install cri-o Note: If you intend to enable GPU usage within containers, ensure to install the k8s-device-plugin.

To configure the NVIDIA runtime for cri-o, execute the following commands on each k8s node:

sudo nvidia-container-runtime configure --runtime=crio
sudo systemctl restart crio

Additionally, on the master node, execute the following command to deploy the NVIDIA device plugin

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.5/nvidia-device-plugin.yml

3. Install CNI (Calico)

Deploy Calico for networking by applying the manifest:

kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.27.2/manifests/calico.yaml

4. Install Ingress Controller (nginx-ingress)

Deploy nginx-ingress controller using the provided manifest:

kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v1.10.0/deploy/static/provider/cloud/deploy.yaml

5. Rebuilding Images

Once you have installed these components, you can proceed with running the project on your Kubernetes cluster.

How to Run

To run the project, follow the steps below:

1. Deploy Backend Server

Use the following command to deploy the backend server:

kubectl apply -f stable/k8s_yaml/backend.yaml

2. Deploy Frontend

Run the following command to deploy the frontend:

kubectl apply -f stable/k8s_yaml/frontend.yaml

3. Deploy MLflow Server

Deploy the MLflow server using the command:

kubectl apply -f stable/k8s_yaml/mlflow.yaml

4. Start Port-Forwarding

tart port-forwarding to access the application:

kubectl port-forward --namespace=ingress-nginx service/ingress-nginx-controller 8080:80

5. Connect to Frontend Webpage

To connect to the frontend webpage, use the following URL:

http://third-party-platform.localdev.me:8080

Execution Results

Upon logging in, users are directed to the dashboard interface, where they can access different functionalities based on their user status:

Login Interface

Functionality: Users can enter their username and click the login button to access the dashboard page.
Image: Image A depicts the login interface.

Image A: Login Interface

Dashboard Interface

Functionality: The dashboard is divided into two sections, displaying different content based on the user's status:
New User: The left section displays the "Create Distributed Training Environment" feature. This feature allows creating a Ray distributed environment in Kubernetes cluster specifically for the new user.
Existing User: The right section displays the "Submit Training" feature and opens the Ray Dashboard (personal use) and MLflow Dashboard (shared). Users can choose to use default provided models or upload their own model data for training submission. The submitted training will be sent to Ray, and upon completion, the model and hyperparameters will be sent to MLflow for storage.
Images:
- Image B shows the "Create Distributed Training Environment" feature on the left side of the dashboard interface.
- Image C shows the "Submit Training" feature on the right side of the dashboard interface.

Image B: Dashboard Interface - Left Section

Image C: Dashboard Interface - Right Section

Note: Training Tasks

All training tasks use the same basic stock prediction model. If you wish to modify the model, please refer to here .

Demo Video

Link

Future Directions

Improve Ingress Configuration: Configure Ingress to use a real domain name, enabling the service to be accessed beyond local environments, thus enhancing its availability and accessibility.
Optimize GPU Resource Management Strategy: Future plans include implementing more effective management of GPU resources. Currently, the plan is to use nodeSelector to restrict GPU Worker Pods to deploy only on nodes with GPUs and utilize podAffinity to minimize the chances of deploying GPU Worker Pods on the same node, thereby improving the utilization and allocation efficiency of GPU resources. This enhancement will be verified and implemented in future developments.

Note

Please note that the current setup is for CPU testing. Future iterations of the project will include support for GPU-based distributed training.

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
stable		stable
test_react		test_react
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed Training Platform on Kubernetes Cluster

Technology Components

Functionality

Purpose

Table of Contents

Preparation

1. Install kubeadm(Version 1.29.0-1.1)

2. Install cri-o(Version 1.29.0)

3. Install CNI (Calico)

4. Install Ingress Controller (nginx-ingress)

5. Rebuilding Images

How to Run

1. Deploy Backend Server

2. Deploy Frontend

3. Deploy MLflow Server

4. Start Port-Forwarding

5. Connect to Frontend Webpage

Execution Results

Login Interface

Image A: Login Interface

Dashboard Interface

Image B: Dashboard Interface - Left Section

Image C: Dashboard Interface - Right Section

Note: Training Tasks

Demo Video

Future Directions

Note

About

Releases

Packages

Languages

mean-world/ray_test

Folders and files

Latest commit

History

Repository files navigation

Distributed Training Platform on Kubernetes Cluster

Technology Components

Functionality

Purpose

Table of Contents

Preparation

1. Install kubeadm(Version 1.29.0-1.1)

2. Install cri-o(Version 1.29.0)

3. Install CNI (Calico)

4. Install Ingress Controller (nginx-ingress)

5. Rebuilding Images

How to Run

1. Deploy Backend Server

2. Deploy Frontend

3. Deploy MLflow Server

4. Start Port-Forwarding

5. Connect to Frontend Webpage

Execution Results

Login Interface

Image A: Login Interface

Dashboard Interface

Image B: Dashboard Interface - Left Section

Image C: Dashboard Interface - Right Section

Note: Training Tasks

Demo Video

Future Directions

Note

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages