Welcome! By completing this workshop you will learn how to run distributed data parallel model training on AWS EKS using PyTorch. The only prerequisite for this workshop is access to an AWS account. The steps included here will walk you through creating and AWS EKS cluster, a shared data volume, building a model training container image, downloading and pre-processing data, running distributed training of an image classification model, and finally running the model with new images to test it.
The workshop architecture at a high level can be visualized by the diagram below.
Fig. 1 - Workshop Infrastructure Architecture
The workshop is designed to introduce the concepts of deploying this architecture and running small-scale distributed training for educational purposes, however the same architecture can be applied for training at large scale by adjusting the number and type of nodes used in the EKS cluster, using accelerators (NVIDIA GPUs, AWS Trainium, Intel Habana Gaudi), and high-performance shared storage like FSx for Lustre. Further information and scripts that help deploy distributed training on EKS using GPUs and FSx can be found in the aws-do-eks open-source project.
This workshop is organized in a number of sequential steps. The scripts that belong to each step are organized in folders with corresponding names. To execute a step, we will change the current directory accordingly and execute scripts in their designated order. The prerequisites section is required, but there are no scripts associated with it. We will complete setting up prerequisites by following instructions. Steps 1 through 6 are required to complete the workshop. Step 7-Cleanup is optional.
Before we get started, we need to set up an AWS account and Cloud9 IDE from which we will execute all the steps in the workshop. You will not be required to install anything on your computer. All of the steps in the workshop will be completed on the cloud through your browser. To set up your account and IDE, please follow the instructions in SETUP.md.
Fig. 1.0 - Step 1 - Create EKS cluster
In this step we will execute scripts to create a managed Kubernetes cluster using the Amazon Elastic Kubernetes Service (EKS). Later we will use this cluster to run our distributed model training job.
In the last part of your prerequisites setup, you cloned the workshop code into your Cloud9 IDE. To build our distributed training infrastructure on EKS, we will start by changing the current directory to 1-create-cluster
.
cd 1-create-cluster
Many of the scripts provided in the workshop use the AWS CLI to access the AWS APIs in the account. That is why the AWS CLI needs to be configured with the credentials (access key id and secret access key) we saved previously. The configuration of the EKS cluster is specified by a .yaml file which we will also generate in this step.
Execute:
./1-1-configure.sh
Output:
The config profile (workshop) could not be found
Configuring AWS client ...
AWS Access Key ID [None]: ************
AWS Secret Access Key [None]: ****************************************
Default region name [None]: us-west-2
Default output format [None]: json
Generating cluster configuration eks.yaml ...
By default, Cloud9 uses AWS managed temporary credentials, which we override with the script. If the managed temporary credentials setting has not been disabled, as soon as the script completes, Cloud9 will display the following dialog.
Fig. 1.1 Cloud9 credentials dialogs
Please click Cancel in this dialog, immediatlely another dialog appears. Please click Permanently disable in the second dialog. If these dialogs do not appear, then AWS managed temporary credentials have already been disabled in your Cloud9 IDE and you may proceed to the next step.
The Cloud9 IDE comes with Docker pre-installed. In order to provision an EKS cluster, we will install eksctl. To be able to execute commands against Kubernetes, we will install kubectl. We will also install other miscellaneous utilities like kubectx, kubetail, jq, yq, and will set up some shorthand command aliases (ll='ls -alh', k=kubectl, kc=kubectx, kn=kubens, kt=kubetail, ks=kubeshell) for convenience.
Execute:
./1-2-install-tools.sh
Output:
Installing eksctl ...
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 18.6M 100 18.6M 0 0 19.1M 0 --:--:-- --:--:-- --:--:-- 31.5M
0.66.0
Installing kubectl ...
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 57.4M 100 57.4M 0 0 93.6M 0 --:--:-- --:--:-- --:--:-- 93.6M
Client Version: version.Info{Major:"1", Minor:"19+", GitVersion:"v1.19.6-eks-49a6c0", GitCommit:"49a6c0bf091506e7bafcdb1b142351b69363355a", GitTreeState:"clean", BuildDate:"2020-12-23T22:13:28Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
...
Setting up aliases ...
Done setting up tools.
We will use eksctl
and the generated eks.yaml
configuration to launch a new EKS cluster.
Execute:
./1-3-create-cluster.sh
Output:
Creating EKS cluster ...
... using configuration from ./eks.yaml ...
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: do-eks
version: "1.26"
region: us-west-2
availabilityZones:
- us-west-2a
- us-west-2b
iam:
withOIDC: true
managedNodeGroups:
- name: wks-node
instanceType: c5.4xlarge
instancePrefix: workshop
privateNetworking: true
availabilityZones: ["us-west-2a","us-west-2b"]
efaEnabled: false
minSize: 0
desiredCapacity: 2
maxSize: 10
volumeSize: 900
iam:
withAddonPolicies:
cloudWatch: true
autoScaler: true
ebs: true
Sat Jun 4 06:06:16 UTC 2022
eksctl create cluster -f ./eks.yaml
2022-06-04 06:06:16 [ℹ] eksctl version 0.66.0
2022-06-04 06:06:16 [ℹ] using region us-west-2
2022-06-04 06:06:16 [ℹ] subnets for us-west-2a - public:192.168.0.0/19 private:192.168.64.0/19
2022-06-04 06:06:16 [ℹ] subnets for us-west-2b - public:192.168.32.0/19 private:192.168.96.0/19
2022-06-04 06:06:16 [ℹ] nodegroup "wks-node" will use "" [AmazonLinux2/1.21]
2022-06-04 06:06:16 [ℹ] using Kubernetes version 1.21
2022-06-04 06:06:16 [ℹ] creating EKS cluster "do-eks" in "us-west-2" region with managed nodes
2022-06-04 06:06:16 [ℹ] 1 nodegroup (wks-node) was included (based on the include/exclude rules)
2022-06-04 06:06:16 [ℹ] will create a CloudFormation stack for cluster itself and 0 nodegroup stack(s)
2022-06-04 06:06:16 [ℹ] will create a CloudFormation stack for cluster itself and 1 managed nodegroup stack(s)
2022-06-04 06:06:16 [ℹ] if you encounter any issues, check CloudFormation console or try 'eksctl utils describe-stacks --region=us-west-2 --cluster=do-eks'
2022-06-04 06:06:16 [ℹ] CloudWatch logging will not be enabled for cluster "do-eks" in "us-west-2"
2022-06-04 06:06:16 [ℹ] you can enable it with 'eksctl utils update-cluster-logging --enable-types={SPECIFY-YOUR-LOG-TYPES-HERE (e.g. all)} --region=us-west-2 --cluster=do-eks'
2022-06-04 06:06:16 [ℹ] Kubernetes API endpoint access will use default of {publicAccess=true, privateAccess=false} for cluster "do-eks" in "us-west-2"
2022-06-04 06:06:16 [ℹ] 2 sequential tasks: { create cluster control plane "do-eks", 3 sequential sub-tasks: { 4 sequential sub-tasks: { wait for control plane to become ready, associate IAM OIDC provider, 2 sequential sub-tasks: { create IAM role for serviceaccount "kube-system/aws-node", create serviceaccount "kube-system/aws-node" }, restart daemonset "kube-system/aws-node" }, 1 task: { create addons }, create managed nodegroup "wks-node" } }
2022-06-04 06:06:16 [ℹ] building cluster stack "eksctl-do-eks-cluster"
2022-06-04 06:06:16 [ℹ] deploying stack "eksctl-do-eks-cluster"
...
2022-06-04 06:27:59 [ℹ] waiting for CloudFormation stack "eksctl-do-eks-nodegroup-wks-node"
2022-06-04 06:27:59 [ℹ] waiting for the control plane availability...
2022-06-04 06:27:59 [✔] saved kubeconfig as "/home/ec2-user/.kube/config"
2022-06-04 06:27:59 [ℹ] no tasks
2022-06-04 06:27:59 [✔] all EKS cluster resources for "do-eks" have been created
2022-06-04 06:30:01 [ℹ] kubectl command should work with "/home/ec2-user/.kube/config", try 'kubectl get nodes'
2022-06-04 06:30:01 [✔] EKS cluster "do-eks" in "us-west-2" region is ready
Sat Jun 4 06:30:01 UTC 2022
Done creating EKS cluster
Updating kubeconfig ...
Added new context arn:aws:eks:us-west-2:620266777012:cluster/do-eks to /home/ec2-user/.kube/config
Displaying cluster nodes ...
NAME STATUS ROLES AGE VERSION
ip-192-168-111-138.us-west-2.compute.internal Ready <none> 3m3s v1.21.12-eks-5308cf7
ip-192-168-90-82.us-west-2.compute.internal Ready <none> 3m3s v1.21.12-eks-5308cf7
The eksctl
command uses Cloud Formation behind the scenes. In addition to the command output, provisioning progress can be seen in CloudFormation.
Please expect that creation of the cluster may take up to 30 min.
We are going to use TorchElastic Job Controller for Kubernetes to launch a distributed training job using an ElasticJob custom resource. We will also use Kubernetes Metrics Server to monitor node resource utilization in the cluster during training. To deploy both to the EKS cluster, execute:
./1-4-deploy-packages.sh
Output:
Deploying Kubernetes Metrics Server ...
serviceaccount/metrics-server created
clusterrole.rbac.authorization.k8s.io/system:aggregated-metrics-reader created
clusterrole.rbac.authorization.k8s.io/system:metrics-server created
rolebinding.rbac.authorization.k8s.io/metrics-server-auth-reader created
clusterrolebinding.rbac.authorization.k8s.io/metrics-server:system:auth-delegator created
clusterrolebinding.rbac.authorization.k8s.io/system:metrics-server created
service/metrics-server created
deployment.apps/metrics-server created
apiservice.apiregistration.k8s.io/v1beta1.metrics.k8s.io created
Deploying Kubeflow Training Operator ...
~/update-workshop/1-create-cluster/kubeflow-training-operator ~/update-workshop/1-create-cluster
namespace/kubeflow created
customresourcedefinition.apiextensions.k8s.io/mpijobs.kubeflow.org created
customresourcedefinition.apiextensions.k8s.io/mxjobs.kubeflow.org created
customresourcedefinition.apiextensions.k8s.io/pytorchjobs.kubeflow.org created
customresourcedefinition.apiextensions.k8s.io/tfjobs.kubeflow.org created
customresourcedefinition.apiextensions.k8s.io/xgboostjobs.kubeflow.org created
serviceaccount/training-operator created
clusterrole.rbac.authorization.k8s.io/training-operator created
clusterrolebinding.rbac.authorization.k8s.io/training-operator created
service/training-operator created
deployment.apps/training-operator created
clusterrole.rbac.authorization.k8s.io/hpa-access created
clusterrolebinding.rbac.authorization.k8s.io/training-operator-hpa-access created
~/update-workshop/1-create-cluster
Deploying etcd ...
service/etcd-service created
deployment.apps/etcd created
The EKS cluster is now provisioned and prepared to run distributed training jobs.
Fig. 2.0 - Step 2 - Create shared volume
With distributed data parallel training, all workers need to have access to the training data. We can achieve that by creating a shared volume which can be mounted in each of the worker pods.
To create a shared volume, we will use the scripts in the directory for step 2.
cd ../2-create-volume
First we will use the AWS CLI to provision an EFS file system.
Execute:
./2-1-create-efs.sh
Output:
Cluster name do-eks
VPC vpc-0ecd59e0bf1426491
Creating security group ...
{
"GroupId": "sg-0ab73460e1a1b3e67"
}
eks-efs-group NFS access to EFS from EKS worker nodes sg-0ab73460e1a1b3e67
...
Creating EFS volume ...
fs-0b15155937d1c6b83
subnet-07767ca17e93fe901 subnet-04859dc111ed82685
Creating mount target in subnet-07767ca17e93fe901 in security group sg-0ab73460e1a1b3e67 for efs fs-0b15155937d1c6b83
...
Done.
The EFS file system is now created and configured so that it can be accessed from the EKS cluster.
In order to create a Kubernetes persistent volume claim (PVC) agains the EFS file system, we need to deploy the EFS container storage interface (CSI) driver to the cluster, then create a storage class and a persistent volume (PV). To do that, execute:
./2-2-create-pvc.sh
Output:
Checking EFS File System ...
EFS volume id fs-0b15155937d1c6b83
Deploying EFS CSI Driver ...
serviceaccount/efs-csi-controller-sa created
serviceaccount/efs-csi-node-sa created
clusterrole.rbac.authorization.k8s.io/efs-csi-external-provisioner-role created
clusterrolebinding.rbac.authorization.k8s.io/efs-csi-provisioner-binding created
deployment.apps/efs-csi-controller created
daemonset.apps/efs-csi-node created
csidriver.storage.k8s.io/efs.csi.aws.com configured
efs-csi-controller-66fcf64846-4dcbv 0/3 ContainerCreating 0 6s
efs-csi-controller-66fcf64846-df6p9 0/3 ContainerCreating 0 6s
efs-csi-node-7cnkt 0/3 ContainerCreating 0 6s
efs-csi-node-9ljw2 0/3 ContainerCreating 0 6s
Generating efs-sc.yaml ...
Applying efs-sc.yaml ...
storageclass.storage.k8s.io/efs-sc created
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
efs-sc efs.csi.aws.com Delete Immediate false 0s
gp2 (default) kubernetes.io/aws-ebs Delete WaitForFirstConsumer false 94m
Generating efs-pv.yaml ...
Applying efs-pv.yaml ...
persistentvolume/efs-pv created
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
efs-pv 5Gi RWX Retain Available efs-sc 11s
Creating persistent volume claim efs-pvc ...
persistentvolumeclaim/efs-pvc created
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
efs-pvc Bound efs-pv 5Gi RWX efs-sc 1s
Done.
Fig. 3.0 - Step 3 - Build deep learning container
In this step, we will build a container that has code to train our PyTorch model.
To do that we need to change the current directory to 3-build-container
.
cd ../3-build-container
Please note that this folder contains a Dockerfile, python and shell scripts. We will only need to execute the scripts that start with 3-*
.
To build the container image, execute:
./3-1-build.sh
Output:
inflating: aws/dist/awscli/data/dax/2017-04-19/completions-1.json
creating: aws/dist/awscli/data/health/2016-08-04/
16650K .......... .......... .......... .......... .......... 98% 29.1M 0s
16700K .......... .......... .......... .......... .......... 99% 23.6M 0s
16750K .......... .......... .......... .......... .......... 99% 16.3M 0s
16800K .......... .......... .......... .......... .......... 99% 25.4M 0s
16850K .......... .......... ..... 100% 268M=1.3s
2022-06-04 07:56:41 (12.3 MB/s) - '/tmp/etcd-v3.4.3/etcd-v3.4.3-linux-amd64.tar.gz' saved [17280028/17280028]
------------------------
etcdctl version: 3.4.3
API version: 3.4
------------------------
Finished installing etcd v3.4.3. To use: /usr/local/bin/(etcd | etcdctl)
Removing intermediate container 71951321d43d
...
tep 12/15 : ADD cifar10-model-train.py /workspace/
---> 622630ffa5b7
Step 13/15 : ADD cifar10-model-test.py /workspace/
---> 33974972d759
Step 14/15 : ADD cnn_model.py /workspace/
---> 8d1492e4f0a1
Step 15/15 : ADD data-prep.sh /workspace/
---> b1ec9d533050
Successfully built b1ec9d533050
Successfully tagged 620266777012.dkr.ecr.us-west-2.amazonaws.com/pytorch-cpu:latest
After it is built, the image needs to be pushed to ECR so it can be used by Kubernetes nodes.
Execute:
./3-2-push.sh
Output:
Logging in to 620266777012.dkr.ecr.us-west-2.amazonaws.com/ ...
WARNING! Your password will be stored unencrypted in /home/ec2-user/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credentials-store
Login Succeeded
Pushing pytorch-cpu:latest to registry ...
The push refers to repository [620266777012.dkr.ecr.us-west-2.amazonaws.com/pytorch-cpu]
85fb7c19f7ba: Pushed
1915f933c51f: Pushed
69f193e41d27: Pushed
fac272423a4b: Pushed
3c8419b41ef5: Pushed
0f550fa492fc: Pushed
ff0f8f83e19d: Pushed
11c114e08199: Pushed
e9b65af3368a: Pushed
bf8cedc62fb3: Layer already exists
latest: digest: sha256:a7bc0842b2681a84ebbfeda35096d8d8f09baffdb0e8ce9d42d6b3f9d983ac6d size: 3459
Fig. 4.0 - Step 4 - Download data
In this step we will run a pod which mounts the persistent volume and downloads the CIFAR-10 dataset on it.
We will execute the scripts from directory 4-get-data
.
cd ../4-get-data
The CIFAR-10 condists of images with size 32x32 pixels, grouped in 10 classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck) with 6,000 images per class. To download this dataset and save it to the shared volume, execute:
./4-1-get-data.sh
Output:
Generating pod manifest ...
Creating efs-data-prep pod ...
pod/efs-data-prep-pod created
efs-data-prep-pod 0/1 ContainerCreating 0 3s
The data-prep pod status changes from ContainerCreating, to Running, to Complete. To show the current status, execute:
./4-2-show-status.sh
Output:
Describing data prep pod ...
Name: efs-data-prep-pod
Namespace: default
Priority: 0
...
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 3m23s default-scheduler Successfully assigned default/efs-data-prep-pod to ip-192-168-111-138.us-west-2.compute.internal
Normal Pulling 3m16s kubelet Pulling image "620266777012.dkr.ecr.us-west-2.amazonaws.com/pytorch-cpu:latest"
Normal Pulled 2m57s kubelet Successfully pulled image "620266777012.dkr.ecr.us-west-2.amazonaws.com/pytorch-cpu:latest" in 19.458971841s
Normal Created 2m43s kubelet Created container efs-data-prep-pod
Normal Started 2m43s kubelet Started container efs-data-prep-pod
Showing status of data prep pod ...
efs-data-prep-pod 0/1 Completed 0 3m23s
When the pod enters the Running or Completed status, you can display its log by executing:
./4-3-show-log.sh
Output:
Shared path - /efs-shared
--2022-06-05 06:50:53-- https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
Resolving www.cs.toronto.edu (www.cs.toronto.edu)... 128.100.3.30
Connecting to www.cs.toronto.edu (www.cs.toronto.edu)|128.100.3.30|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 170498071 (163M) [application/x-gzip]
Saving to: 'cifar-10-python.tar.gz'
0K .......... .......... .......... .......... .......... 0% 350K 7m56s
50K .......... .......... .......... .......... .......... 0% 695K 5m58s
100K .......... .......... .......... .......... .......... 0% 693K 5m18s
150K .......... .......... .......... .......... .......... 0% 18.6M 4m1s
200K .......... .......... .......... .......... .......... 0% 21.5M 3m14s
250K .......... .......... .......... .......... .......... 0% 732K 3m20s
300K .......... .......... .......... .......... .......... 0% 66.4M 2m51s
350K .......... .......... .......... .......... .......... 0% 63.0M 2m30s
400K .......... .......... .......... .......... .......... 0% 18.6M 2m15s
450K .......... .......... .......... .......... .......... 0% 60.7M 2m1s
500K .......... .......... .......... .......... .......... 0% 80.4M 1m50s
550K .......... .......... .......... .......... .......... 0% 745K 2m0s
600K .......... .......... .......... .......... .......... 0% 78.2M 1m51s
...
166250K .......... .......... .......... .......... .......... 99% 118M 0s
166300K .......... .......... .......... .......... .......... 99% 129M 0s
166350K .......... .......... .......... .......... .......... 99% 4.00M 0s
166400K .......... .......... .......... .......... .......... 99% 100M 0s
166450K .......... .......... .......... .......... .......... 99% 137M 0s
166500K .. 100% 3858G=4.8s
2022-06-05 06:50:59 (33.9 MB/s) - 'cifar-10-python.tar.gz' saved [170498071/170498071]
The last message showing the dataset was saved, indicates a successful download.
Fig. 5.0 - Step 5 - Distributed data-parallel model training
Next we will execute the model training scripts from directory 5-train-model
.
cd ../5-train-model
The Kubernetes manifests in this workshop are generated from templates, based on the configuration stored in file ./env
. To generate the PyTorchJob manifest for our distributed training, execute:
./5-1-generate-pytorchjob.sh
Output:
Generating PyTorchJob manifest ...
apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
name: cifar10-train
spec:
elasticPolicy:
rdzvBackend: etcd
rdzvHost: etcd-service
rdzvPort: 2379
minReplicas: 1
maxReplicas: 128
maxRestarts: 100
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 80
pytorchReplicaSpecs:
Worker:
replicas: 2
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: 999701187340.dkr.ecr.us-west-2.amazonaws.com/pytorch-cpu:latest
imagePullPolicy: IfNotPresent
env:
- name: PROCESSOR
value: "cpu"
command:
- python3
- -m
- torch.distributed.run
- /workspace/cifar10-model-train.py
- "--epochs=10"
- "--batch-size=128"
- "--workers=15"
- "--model-file=/efs-shared/cifar10-model.pth"
- "/efs-shared/cifar-10-batches-py/"
volumeMounts:
- name: efs-pv
mountPath: /efs-shared
# The following enables the worker pods to use increased shared memory
# which is required when specifying more than 0 data loader workers
- name: dshm
mountPath: /dev/shm
volumes:
- name: efs-pv
persistentVolumeClaim:
claimName: efs-pvc
- name: dshm
emptyDir:
medium: Memory
The manifest specifies an elastic job named cifar10-train. The job is configured to communicate with rendez-vous end point etcd-service:2379
which is the etcd service we launched in the same namespace. It is also configured to run two workers, each of them on a separate node. Each worker will execute the torchrun
command and run training for 10 epochs.
Next we will launch the PyTorchJob by applying the generated manifest.
Execute:
./5-2-launch-pytorchjob.sh
Output:
Launching PyTorchJob ...
pytorchjob.kubeflow.org/cifar10-train created
Each launched worker is represented by a pod in the cluster. To see the status of the worker pods, execute:
./5-3-show-status.sh
Output:
cifar10-train-worker-0 1/1 Running 0 47s 192.168.109.172 ip-192-168-111-138.us-west-2.compute.internal <none> <none>
cifar10-train-worker-1 1/1 Running 0 47s 192.168.93.104 ip-192-168-90-82.us-west-2.compute.internal <none> <none>
Once the training starts, you will be able to see the CPU utilization of the two nodes rise.
Execute:
./5-4-show-utilization.sh
Output:
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
ip-192-168-111-138.us-west-2.compute.internal 18246m 50% 2306Mi 3%
ip-192-168-90-82.us-west-2.compute.internal 17936m 50% 2322Mi 3%
After the worker pods have been created, we can see their combined logs using the kubetail tool.
Execute:
./5-5-show-logs.sh
Output:
Will tail 2 logs...
cifar10-train-worker-0
cifar10-train-worker-1
[cifar10-train-worker-0] INFO 2022-06-06 21:38:22,775 Keep-alive key /torchelastic/p2p/run_cifar10-train/rdzv/v_1/rank_0 is not renewed.
[cifar10-train-worker-0] INFO 2022-06-06 21:38:22,775 Rendevous version 1 is incomplete.
[cifar10-train-worker-0] INFO 2022-06-06 21:38:22,775 Attempting to destroy it.
[cifar10-train-worker-0] INFO 2022-06-06 21:38:22,776 Destroyed rendezvous version 1 successfully.
[cifar10-train-worker-0] INFO 2022-06-06 21:38:22,776 Previously existing rendezvous state changed. Will re-try joining.
[cifar10-train-worker-0] INFO 2022-06-06 21:38:22,776 Attempting to join next rendezvous
[cifar10-train-worker-0] INFO 2022-06-06 21:38:22,780 New rendezvous state created: {'status': 'joinable', 'version': '2', 'participants': []}
[cifar10-train-worker-0] INFO 2022-06-06 21:38:22,869 Joined rendezvous version 2 as rank 0. Full state: {'status': 'joinable', 'version': '2', 'participants': [0]}
[cifar10-train-worker-0] INFO 2022-06-06 21:38:22,869 Rank 0 is responsible for join last call.
[cifar10-train-worker-1] INFO 2022-06-06 21:38:22,776 Keep-alive key /torchelastic/p2p/run_cifar10-train/rdzv/v_1/rank_0 is not renewed.
[cifar10-train-worker-1] INFO 2022-06-06 21:38:22,776 Rendevous version 1 is incomplete.
[cifar10-train-worker-1] INFO 2022-06-06 21:38:22,777 Attempting to destroy it.
[cifar10-train-worker-1] INFO 2022-06-06 21:38:22,778 Rendezvous attempt failed, will retry. Reason: Compare failed : [{"status": "final", "version": "1", "participants": [0], "keep_alives": ["/torchelastic/p2p/run_cifar10-train/rdzv/v_1/rank_0"], "num_workers_waiting": 2} != {"status": "setup"}]
[cifar10-train-worker-1] INFO 2022-06-06 21:38:23,779 Attempting to join next rendezvous
[cifar10-train-worker-1] INFO 2022-06-06 21:38:23,784 Observed existing rendezvous state: {'status': 'joinable', 'version': '2', 'participants': [0]}
[cifar10-train-worker-1] INFO 2022-06-06 21:38:23,816 Joined rendezvous version 2 as rank 1. Full state: {'status': 'joinable', 'version': '2', 'participants': [0, 1]}
[cifar10-train-worker-1] INFO 2022-06-06 21:38:23,816 Waiting for remaining peers.
[cifar10-train-worker-0] INFO 2022-06-06 21:38:53,867 Rank 0 finished join last call.
[cifar10-train-worker-1] INFO 2022-06-06 21:38:53,869 All peers arrived. Confirming membership.
[cifar10-train-worker-0] INFO 2022-06-06 21:38:53,867 Waiting for remaining peers.
[cifar10-train-worker-0] INFO 2022-06-06 21:38:53,867 All peers arrived. Confirming membership.
[cifar10-train-worker-0] INFO 2022-06-06 21:38:53,890 Waiting for confirmations from all peers.
[cifar10-train-worker-1] INFO 2022-06-06 21:38:53,913 Waiting for confirmations from all peers.
[cifar10-train-worker-0] INFO 2022-06-06 21:38:53,913 Rendezvous version 2 is complete. Final state: {'status': 'final', 'version': '2', 'participants': [0, 1], 'keep_alives': ['/torchelastic/p2p/run_cifar10-train/rdzv/v_2/rank_0', '/torchelastic/p2p/run_cifar10-train/rdzv/v_2/rank_1'], 'num_workers_waiting': 0}
[cifar10-train-worker-1] INFO 2022-06-06 21:38:53,915 Rendezvous version 2 is complete. Final state: {'status': 'final', 'version': '2', 'participants': [0, 1], 'keep_alives': ['/torchelastic/p2p/run_cifar10-train/rdzv/v_2/rank_0', '/torchelastic/p2p/run_cifar10-train/rdzv/v_2/rank_1'], 'num_workers_waiting': 0}
[cifar10-train-worker-1] INFO 2022-06-06 21:38:53,915 Creating EtcdStore as the c10d::Store implementation
[cifar10-train-worker-0] INFO 2022-06-06 21:38:53,913 Creating EtcdStore as the c10d::Store implementation
[cifar10-train-worker-0] reading /efs-shared/cifar-10-batches-py/
[cifar10-train-worker-1] reading /efs-shared/cifar-10-batches-py/
[cifar10-train-worker-1] [1, 5] loss: 2.335
[cifar10-train-worker-0] [1, 5] loss: 2.323
[cifar10-train-worker-1] [1, 10] loss: 2.247
[cifar10-train-worker-0] [1, 10] loss: 2.225
[cifar10-train-worker-1] [1, 15] loss: 2.168
[cifar10-train-worker-0] [1, 15] loss: 2.163
[cifar10-train-worker-1] [1, 20] loss: 2.061
[cifar10-train-worker-0] [1, 20] loss: 2.077
[cifar10-train-worker-1] [1, 25] loss: 2.011
[cifar10-train-worker-0] [1, 25] loss: 2.010
[cifar10-train-worker-1] [1, 30] loss: 1.963
[cifar10-train-worker-0] [1, 30] loss: 1.938
...
[cifar10-train-worker-1] [6, 180] loss: 0.496
[cifar10-train-worker-0] [6, 185] loss: 0.499
[cifar10-train-worker-1] [6, 185] loss: 0.503
[cifar10-train-worker-1] [6, 190] loss: 0.504
[cifar10-train-worker-0] [6, 190] loss: 0.594
[cifar10-train-worker-0] [6, 195] loss: 0.536
[cifar10-train-worker-1] [6, 195] loss: 0.522
[cifar10-train-worker-0] [7, 5] loss: 0.470
[cifar10-train-worker-1] [7, 5] loss: 0.464
[cifar10-train-worker-0] [7, 10] loss: 0.510
[cifar10-train-worker-1] [7, 10] loss: 0.465
[cifar10-train-worker-0] [7, 15] loss: 0.525
[cifar10-train-worker-1] [7, 15] loss: 0.489
[cifar10-train-worker-0] [7, 20] loss: 0.479
[cifar10-train-worker-1] [7, 20] loss: 0.478
[cifar10-train-worker-0] [7, 25] loss: 0.523
[cifar10-train-worker-1] [7, 25] loss: 0.520
...
[cifar10-train-worker-0] [10, 190] loss: 0.247
[cifar10-train-worker-1] [10, 190] loss: 0.185
[cifar10-train-worker-0] [10, 195] loss: 0.200
[cifar10-train-worker-1] [10, 195] loss: 0.202
[cifar10-train-worker-0] saving model: /efs-shared/cifar10-model.pth
[cifar10-train-worker-1] saving model: /efs-shared/cifar10-model.pth
[cifar10-train-worker-1] Finished Training
[cifar10-train-worker-0] Finished Training
In the beginning of the logs you will see the workers registering with the rendez-vous endpoint to coordinate their work, then they will train collaboratively over 10 epochs. Each epoch has 400 iterations. Since we are training with two workers, the work is split in two and each of the workers executes only 200 iterations from the epoch. As the training progresses, you will see the loss
decrease, which indicates that the model is converging. At the end of the 10th epoch, we save the model to the shared volume.
Press Ctrl-C
to stop tailing the logs at any time.
If you wish to run another instance of the elastic job, please delete the current job first.
Execute:
./5-6-delete-pytorchjob.sh
Output:
Deleting PyTorchJob ...
pytorchjob.kubeflow.org "cifar10-train" deleted
Note: when starting a new job instance if the workers fail to start with errors indicating failure to connect to the rendez-vous service, please delete the etcd pod as well before starting the elastic job.
Fig. 6.0 - Step 6 - Test model
This step will be executed from directory 6-test-model
.
cd ../6-test-model
We are going to use a standard Kubernetes job manifest (as opposed to an ElasticJob manifest, which we used for training) since we do not need to run the test in a distributed manner. To generate the job manifest, execute:
./6-1-generate-job.sh
Output:
Generating test job manifest ...
apiVersion: batch/v1
kind: Job
metadata:
name: cifar10-test
spec:
template:
spec:
restartPolicy: Never
nodeSelector:
beta.kubernetes.io/instance-type: c5.4xlarge
containers:
- name: test
image: 042407962002.dkr.ecr.us-west-2.amazonaws.com/pytorch-cpu:latest
imagePullPolicy: Always
command: ["python3"]
args:
- "/workspace/cifar10-model-test.py"
- "--model-file=/efs-shared/cifar10-model.pth"
- "/efs-shared/cifar-10-batches-py/"
volumeMounts:
- name: efs-pv
mountPath: /efs-shared
volumes:
- name: efs-pv
persistentVolumeClaim:
claimName: efs-pvc
As evident from the manifest above, we will create a single pod named cifar10-test and execute the cifar10-model-test.py
script in it, passing the model file that we saved from the training step.
The test job will take 10,000 images that were not used during training and use the model to classify them. Then it will calculate accuracy measurements.
Execute:
./6-2-launch-job.sh
Output:
Launching test job ...
job.batch/cifar10-test created
When the job manifest is applied, a pod is created and runs to completion. To see the pod status, execute:
./6-3-show-status.sh
Output:
Showing test job status ...
cifar10-test-tlnjn 1/1 Running 0 4s
The results from the test are written to the pod log.
Execute:
./6-4-show-log.sh
Output:
Showing cifar10-test log ...
reading /efs-shared/cifar-10-batches-py/
loading model /efs-shared/cifar10-model.pth
Accuracy of the network on the 10000 test images: 74 %
Accuracy for class: plane is 77.5 %
Accuracy for class: car is 85.2 %
Accuracy for class: bird is 64.4 %
Accuracy for class: cat is 53.3 %
Accuracy for class: deer is 71.3 %
Accuracy for class: dog is 66.6 %
Accuracy for class: frog is 85.9 %
Accuracy for class: horse is 80.5 %
Accuracy for class: ship is 82.8 %
Accuracy for class: truck is 82.4 %
Finished Testing
As we can see the model classified the images into 10 different categories with overall accuracy of 74%.
In the event that the test job needs to be run again for a different model, the old job needs to be deleted first.
Execute:
./6-5-delete-job.sh
Output:
Deleting test job ...
job.batch "cifar10-test" deleted
We have run distributed training on two nodes. Edit the autoscaling group to set the desired number of nodes to 4, then modify the configuration file .env
to reflect the new number of nodes and re-run the training job. You will notice that the time to run 10 epochs decreases as the workload gets distributed among more nodes.
Fig. 7.0 - Step 7 - Cleanup
Optionally you can execute the scripts in the cleanup folder to delete the shared storage volume and the EKS cluster you created for this workshop.
cd ../7-cleanup
The EFS file system needs to be deleted first since it is associated with subnets within the VPC used by the EKS cluster.
Execute:
./7-1-delete-efs.sh
Output:
Deleting EFS mount targets for File System fs-070041b9153fa56b8 ...
Deleting mount target fsmt-02128c3560394ce31
Deleting mount target fsmt-0f80225b4ba7580b0
Deleting EFS file system fs-070041b9153fa56b8 ...
Done.
Note: If an error occurs during the deletion of the file system, please wait for a minute and run the script again. The EFS file system can only be deleted after the mount targets are fully deleted.
Performing this step deletes all of the remaining infrastructure that was used in this workshop. This includes the node groups, cluster, NAT gateways, subnets, and VPC.
Execute:
./7-2-delete-cluster.sh
Output:
Deleting cluster do-eks. Proceed? [Y/n]: Y
Confirmed ...
2022-06-07 02:03:19 [ℹ] eksctl version 0.66.0
2022-06-07 02:03:19 [ℹ] using region us-west-2
2022-06-07 02:03:19 [ℹ] deleting EKS cluster "do-eks"
2022-06-07 02:03:20 [ℹ] deleted 0 Fargate profile(s)
2022-06-07 02:03:20 [✔] kubeconfig has been updated
2022-06-07 02:03:20 [ℹ] cleaning up AWS load balancers created by Kubernetes objects of Kind Service or Ingress
2022-06-07 02:03:27 [ℹ] 3 sequential tasks: { delete nodegroup "wks-node", 2 sequential sub-tasks: { 2 sequential sub-tasks: { delete IAM role for serviceaccount "kube-system/aws-node", delete serviceaccount "kube-system/aws-node" }, delete IAM OIDC provider }, delete cluster control plane "do-eks" [async] }
2022-06-07 02:03:27 [ℹ] will delete stack "eksctl-do-eks-nodegroup-wks-node"
2022-06-07 02:03:27 [ℹ] waiting for stack "eksctl-do-eks-nodegroup-wks-node" to get deleted
2022-06-07 02:03:27 [ℹ] waiting for CloudFormation stack "eksctl-do-eks-nodegroup-wks-node"
...
2022-06-07 02:07:17 [ℹ] waiting for CloudFormation stack "eksctl-do-eks-nodegroup-wks-node"
2022-06-07 02:07:17 [ℹ] will delete stack "eksctl-do-eks-addon-iamserviceaccount-kube-system-aws-node"
2022-06-07 02:07:17 [ℹ] waiting for stack "eksctl-do-eks-addon-iamserviceaccount-kube-system-aws-node" to get deleted
2022-06-07 02:07:17 [ℹ] waiting for CloudFormation stack "eksctl-do-eks-addon-iamserviceaccount-kube-system-aws-node"
2022-06-07 02:07:34 [ℹ] waiting for CloudFormation stack "eksctl-do-eks-addon-iamserviceaccount-kube-system-aws-node"
2022-06-07 02:07:34 [ℹ] deleted serviceaccount "kube-system/aws-node"
2022-06-07 02:07:34 [ℹ] will delete stack "eksctl-do-eks-cluster"
2022-06-07 02:07:34 [✔] all cluster resources were deleted
Please note that the cluster will be fully deleted when the Cloud Formation stack completes its removal
Only after the process in Cloud Formation is finished, you will be able to create a new cluster with the same name
Congratulations on completing this Distributed Model Training workshop! You now have experience with building and running a distributed model training architecture on AWS EKS. The techniques demonstrated here are generic and can be applied for your own distributed model trainig needs and at larger scale.
This repository is released under the MIT-0 License. See the LICENSE file for details.