This project aims to establish a distributed training platform running on a Kubernetes cluster, utilizing the technology stack of Ray, MLflow, React, and Flask to implement an end-to-end machine learning workflow.
- Ray: Used for distributed training of PyTorch models, achieving efficient utilization of computational resources.
- MLflow: Employed for model saving and hyperparameter logging, providing model management and experiment tracking functionalities.
- React: Utilized for building a user-friendly frontend web interface, offering an intuitive user experience.
- Flask: Utilized for constructing the backend API, handling user requests and communicating with MLflow and Ray.
This project will allow users to submit training jobs through a simple frontend interface, with the system automatically allocating resources on the Kubernetes cluster, utilizing Ray for distributed training, and leveraging MLflow for experiment tracking and model saving. Through the user interface provided by React, users can monitor training progress, view experiment results.
This project aims to provide a comprehensive, scalable machine learning development and deployment solution, enabling users to conduct model training and management more easily.
Before running the project, ensure you have the necessary dependencies and Kubernetes components installed. Follow the steps below for installation:
Refer to the official Kubernetes documentation for detailed instructions on installing kubeadm:
Follow the installation guide provided by the cri-o project on GitHub:
Install cri-o Note: If you intend to enable GPU usage within containers, ensure to install the k8s-device-plugin.
To configure the NVIDIA runtime for cri-o, execute the following commands on each k8s node:
sudo nvidia-container-runtime configure --runtime=crio
sudo systemctl restart crio
Additionally, on the master node, execute the following command to deploy the NVIDIA device plugin
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.5/nvidia-device-plugin.yml
Deploy Calico for networking by applying the manifest:
kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.27.2/manifests/calico.yaml
Deploy nginx-ingress controller using the provided manifest:
kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v1.10.0/deploy/static/provider/cloud/deploy.yaml
Once you have installed these components, you can proceed with running the project on your Kubernetes cluster.
To run the project, follow the steps below:
Use the following command to deploy the backend server:
kubectl apply -f stable/k8s_yaml/backend.yaml
Run the following command to deploy the frontend:
kubectl apply -f stable/k8s_yaml/frontend.yaml
Deploy the MLflow server using the command:
kubectl apply -f stable/k8s_yaml/mlflow.yaml
tart port-forwarding to access the application:
kubectl port-forward --namespace=ingress-nginx service/ingress-nginx-controller 8080:80
To connect to the frontend webpage, use the following URL:
http://third-party-platform.localdev.me:8080
Upon logging in, users are directed to the dashboard interface, where they can access different functionalities based on their user status:
- Functionality: Users can enter their username and click the login button to access the dashboard page.
- Image: Image A depicts the login interface.
- Functionality: The dashboard is divided into two sections, displaying different content based on the user's status:
- New User: The left section displays the "Create Distributed Training Environment" feature. This feature allows creating a Ray distributed environment in Kubernetes cluster specifically for the new user.
- Existing User: The right section displays the "Submit Training" feature and opens the Ray Dashboard (personal use) and MLflow Dashboard (shared). Users can choose to use default provided models or upload their own model data for training submission. The submitted training will be sent to Ray, and upon completion, the model and hyperparameters will be sent to MLflow for storage.
- Images:
- Image B shows the "Create Distributed Training Environment" feature on the left side of the dashboard interface.
- Image C shows the "Submit Training" feature on the right side of the dashboard interface.
All training tasks use the same basic stock prediction model. If you wish to modify the model, please refer to here .
-
Improve Ingress Configuration: Configure Ingress to use a real domain name, enabling the service to be accessed beyond local environments, thus enhancing its availability and accessibility.
-
Optimize GPU Resource Management Strategy: Future plans include implementing more effective management of GPU resources. Currently, the plan is to use nodeSelector to restrict GPU Worker Pods to deploy only on nodes with GPUs and utilize podAffinity to minimize the chances of deploying GPU Worker Pods on the same node, thereby improving the utilization and allocation efficiency of GPU resources. This enhancement will be verified and implemented in future developments.
Please note that the current setup is for CPU testing. Future iterations of the project will include support for GPU-based distributed training.