Distributed Training using TensorFlow and Horovod on Amazon EKS with Synthetic Data

This document explains how to perform distributed training on Amazon EKS using TensorFlow and Horovod with synthetic dataset.

Prerequisite

Create EKS cluster using GPU
Install Kubeflow

Steps

Create namespace:

NAMESPACE=kubeflow-dist-train; kubectl create namespace ${NAMESPACE}

Create ksonnet app:

APP_NAME=kubeflow-tf-hvd; ks init ${APP_NAME}; cd ${APP_NAME}

Set as default namespace:

ks env set default --namespace ${NAMESPACE}

Create secret for ssh access between nodes

SECRET=openmpi-secret; mkdir -p .tmp; yes | ssh-keygen -N "" -f .tmp/id_rsa
kubectl delete secret ${SECRET} -n ${NAMESPACE} || true
kubectl create secret generic ${SECRET} -n ${NAMESPACE} --from-file=id_rsa=.tmp/id_rsa --from-file=id_rsa.pub=.tmp/id_rsa.pub --from-file=authorized_keys=.tmp/id_rsa.pub

Install Kubeflow openmpi component

VERSION=master
ks registry add kubeflow github.com/kubeflow/kubeflow/tree/${VERSION}/kubeflow
ks pkg install kubeflow/openmpi@${VERSION}

Build a Docker image for Horovod using Dockerfile from training/distributed_training/Dockerfile and the command docker image build -t arungupta/horovod .. Alternatively, you can use the image that already exists on Docker Hub:
```
IMAGE=rgaut/horovod:latest
IMAGE=armandmcqueen/horovod_benchmark:v1
```
Define the number of workers (number of machines) and number of GPU available per machine
```
WORKERS=2; GPU=4
```

Formulate the MPI command based on official document from Horovod

EXEC="mpiexec -np 8 --hostfile /kubeflow/openmpi/assets/hostfile --allow-run-as-root --display-map --tag-output --timestamp-output -mca btl_tcp_if_exclude lo,docker0 --mca plm_rsh_no_tree_spawn 1 -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib sh -c 'NCCL_SOCKET_IFNAME=eth0 NCCL_MIN_NRINGS=8 NCCL_DEBUG=INFO python3.6 /examples/official-benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --num_batches=100 --model vgg16 --batch_size 64 --variable_update horovod --horovod_device gpu --use_fp16'"

If more than one GPU is used, then you may have to replace NCCL_SOCKET_IFNAME=eth0 with NCCL_SOCKET_IFNAME=^docker0.

MPI command needs some explanations. -np represents a total number process which will be equal to WORKERS * GPU.

Generate the config

COMPONENT=openmpi
ks generate openmpi ${COMPONENT} --image ${IMAGE} --secret ${SECRET} --workers ${WORKERS} --gpu ${GPU} --exec "${EXEC}"

Deploy the config to your cluster
```
ks apply default
```

Check the pod status

kubectl get pod -n ${NAMESPACE} -o wide

Save the log

mkdir -p results
kubectl logs -n ${NAMESPACE} -f ${COMPONENT}-master > results/benchmark_1.out

Here is a sample output.

To iterate quickly. Remove pods, recreate openmpi component, restart from generate openmpi command
```
ks delete default
ks component rm openmpi 
```

[Optional] EXEC command for 4 workers 8 GPU

WORKERS=4
GPU=8
EXEC="mpiexec -np 32 --hostfile /kubeflow/openmpi/assets/hostfile --allow-run-as-root --display-map --tag-output --timestamp-output -mca btl_tcp_if_exclude lo,docker0 --mca plm_rsh_no_tree_spawn 1 -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib sh -c 'NCCL_SOCKET_IFNAME=eth0 NCCL_MIN_NRINGS=8 NCCL_DEBUG=INFO python3.6 /examples/official-benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --num_batches=100 --model vgg16 --batch_size 64 --variable_update horovod --horovod_device gpu --use_fp16'"

Then follow the steps from 10-13.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tensorflow-horovod.md

tensorflow-horovod.md

Distributed Training using TensorFlow and Horovod on Amazon EKS with Synthetic Data

Prerequisite

Steps

Files

tensorflow-horovod.md

Latest commit

History

tensorflow-horovod.md

File metadata and controls

Distributed Training using TensorFlow and Horovod on Amazon EKS with Synthetic Data

Prerequisite

Steps