-
Notifications
You must be signed in to change notification settings - Fork 247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pre-Training LLama3.1 on AWS Trainium using Ray and PyTorch Lightning #725
base: main
Are you sure you want to change the base?
Changes from all commits
6b03601
eedcded
694b777
b69c12f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,149 @@ | ||
# This RayCluster configuration deploys a distributed training environment for Llama3 | ||
# using AWS Neuron SDK and RayTrain on Amazon EKS. | ||
|
||
# ---------------------------------------------------------------------- | ||
# NOTE: For detailed deployment instructions, refer to the DoEKS website (https://awslabs.github.io/data-on-eks/docs/category/training-on-eks). | ||
# ---------------------------------------------------------------------- | ||
|
||
# ---------------------------------------------------------------------- | ||
# NOTE: We are using the default namespace for this deployment because the fsx-claim PVC is created under the default namespace by the Terraform blueprint. | ||
# If you want to deploy the cluster in a dedicated namespace, ensure that the FSX for Lustre file system is also created in the same namespace since PVCs are namespace-bound. | ||
# ---------------------------------------------------------------------- | ||
|
||
# Docs for Volcano with KubeRay: https://docs.ray.io/en/master/cluster/kubernetes/k8s-ecosystem/volcano.html | ||
--- | ||
apiVersion: scheduling.volcano.sh/v1beta1 | ||
kind: Queue | ||
metadata: | ||
name: llama3-training-queue | ||
namespace: default | ||
spec: | ||
weight: 1 | ||
capability: | ||
cpu: '4000' | ||
memory: 12000Gi | ||
|
||
--- | ||
apiVersion: ray.io/v1 | ||
kind: RayCluster | ||
metadata: | ||
name: kuberay-trn1 | ||
namespace: default | ||
labels: | ||
ray.io/scheduler-name: volcano | ||
volcano.sh/queue-name: llama3-training-queue | ||
spec: | ||
rayVersion: 2.22.0 | ||
headGroupSpec: | ||
# Head Node Configuration | ||
# This section defines the specification for the Ray head pod. | ||
# The head node manages the cluster and provides services like the dashboard and GCS. | ||
template: | ||
spec: | ||
containers: | ||
- name: ray-head | ||
image: <AWS_ACCOUNT_ID>.dkr.ecr.us-east-2.amazonaws.com/rayptl_pretrain_llama3.1:latest # Replace AWS Account ID | ||
imagePullPolicy: Always # Pull the latest image each time | ||
lifecycle: | ||
preStop: | ||
exec: | ||
command: ["/bin/sh", "-c", "ray stop"] # Graceful shutdown of Ray processes | ||
ports: | ||
- containerPort: 8265 | ||
name: dashboard # Expose Ray dashboard | ||
- containerPort: 6379 | ||
name: redis # Expose Redis port | ||
- containerPort: 10001 | ||
name: object-manager # Expose object manager port | ||
resources: | ||
requests: | ||
cpu: 6 | ||
memory: 30Gi | ||
volumeMounts: | ||
- mountPath: /tmp/ray | ||
name: log-volume # Mount for Ray logs | ||
- name: persistent-storage # Mount shared filesystem (FSx for Lustre) | ||
mountPath: /shared | ||
# Node Selector for Karpenter | ||
# Karpenter will provision this head pod on a node with the specified labels. | ||
nodeSelector: | ||
instanceType: mixed-x86 | ||
provisionerType: Karpenter | ||
volumes: | ||
- name: log-volume | ||
emptyDir: {} | ||
- name: persistent-storage | ||
persistentVolumeClaim: | ||
claimName: fsx-claim # Reference the PVC for shared storage | ||
rayStartParams: | ||
dashboard-host: 0.0.0.0 # Make dashboard accessible | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. pleasse set |
||
|
||
workerGroupSpecs: | ||
# Worker Node Configuration | ||
# This section defines the specification for the Ray worker pods. | ||
# Worker nodes execute tasks and participate in distributed training. | ||
- groupName: workergroup | ||
replicas: 16 # Number of worker replicas | ||
minReplicas: 16 # Minimum number of worker replicas | ||
maxReplicas: 16 # Maximum number of worker replicas (no scaling in this case) | ||
rayStartParams: {} | ||
template: | ||
spec: | ||
containers: | ||
- name: ray-worker | ||
image: <AWS_ACCOUNT_ID>.dkr.ecr.us-east-2.amazonaws.com/rayptl_pretrain_llama3.1:latest # Replace AWS Account ID | ||
imagePullPolicy: Always # Pull the latest image each time | ||
lifecycle: | ||
preStop: | ||
exec: | ||
command: ["/bin/sh", "-c", "ray stop"] | ||
ports: | ||
- containerPort: 8265 | ||
name: dashboard | ||
- containerPort: 6379 | ||
name: redis | ||
- containerPort: 10001 | ||
name: object-manager | ||
resources: | ||
limits: | ||
aws.amazon.com/neuron: '16' # Request AWS Neuron cores | ||
vpc.amazonaws.com/efa: '8' # Request AWS EFA devices | ||
memory: 440Gi | ||
requests: | ||
aws.amazon.com/neuron: '16' | ||
vpc.amazonaws.com/efa: '8' | ||
cpu: '120' | ||
memory: 440Gi | ||
volumeMounts: | ||
- name: persistent-storage | ||
mountPath: /shared # Mount shared filesystem (FSx for Lustre) | ||
- name: dshm | ||
mountPath: /dev/shm # Mount for shared memory | ||
- mountPath: /tmp/ray | ||
name: log-volume # Mount for Ray logs | ||
# Node Selector for Managed Node Group (with Cluster Autoscaler) | ||
# These workers will run on Trn1 instances provisioned by the cluster autoscaler. | ||
# This is necessary as Karpenter doesn't currently support EFA (required for Neuron distributed training). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this true? aws/karpenter-provider-aws#5068 I think you can request the resource. |
||
nodeSelector: | ||
instance-type: trn1-32xl | ||
provisioner: cluster-autoscaler | ||
|
||
# Tolerations for Trn1 and Dedicated Nodes | ||
tolerations: | ||
- key: "aws.amazon.com/neuron" | ||
operator: "Exists" | ||
effect: "NoSchedule" | ||
- key: "hub.jupyter.org/dedicated" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we are trying to use the jupyter taints to keep other pods off of jupyter nodes, we shouldn't add a toleration for it |
||
operator: "Equal" | ||
value: "user" | ||
effect: "NoSchedule" | ||
volumes: | ||
# Persistent Volume Claim (PVC) to access the FSx for Lustre filesystem | ||
- name: persistent-storage | ||
persistentVolumeClaim: | ||
claimName: fsx-claim | ||
- name: dshm | ||
emptyDir: | ||
medium: Memory | ||
- name: log-volume | ||
emptyDir: {} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
# ---------------------------------------------------------------------------- | ||
# RayJob: llama3-generate-pretraining-test-data | ||
# | ||
# Description: | ||
# This RayJob is responsible for generating pre-training test data required for | ||
# the Llama3 model training. It sources data from the specified dataset, processes | ||
# it, and prepares it for use in subsequent training stages. The job runs a Python | ||
# script (`get_dataset.py`) that performs these data preparation steps. | ||
|
||
# Usage: | ||
# Apply this configuration to your Kubernetes cluster using `kubectl apply -f download_fineweb_dataset.yaml`. | ||
# Ensure that the Ray cluster (`kuberay-trn1`) is running and accessible in the specified namespace. | ||
# ---------------------------------------------------------------------------- | ||
|
||
apiVersion: ray.io/v1 | ||
kind: RayJob | ||
metadata: | ||
name: grammarly-generate-pretraining-test-data | ||
namespace: default | ||
spec: | ||
submissionMode: K8sJobMode | ||
entrypoint: "mkdir -p /shared/llama3.1_config/ /shared/fineweb_llama3.1_tokenized /shared/checkpoint /shared/tblogs && cp config.json /shared/llama3.1_config/ && python3 get_dataset.py" | ||
runtimeEnvYAML: | | ||
working_dir: /grammarly_pretraining | ||
env_vars: | ||
PYTHONUNBUFFERED: '0' | ||
resources: | ||
requests: | ||
cpu: "6" | ||
memory: "30Gi" | ||
clusterSelector: | ||
ray.io/cluster: kuberay-trn1 | ||
rayClusterNamespace: default # Replace with the namespace where your RayCluster is deployed | ||
ttlSecondsAfterFinished: 60 # Time to live for the pod after completion (in seconds) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
# ---------------------------------------------------------------------------- | ||
# RayJob: llama3-pretraining-job | ||
# | ||
# Description: | ||
# This RayJob is responsible for neuron parallel compile step of the Llama3 model. | ||
# This step is a pre-requisite for training the language model with the prepared dataset. | ||
|
||
# Usage: | ||
# Apply this configuration to your Kubernetes cluster using `kubectl apply -f 3-parallel-compile-trn1-rayjob.yaml`. | ||
# Ensure that the Ray cluster (`kuberay-trn1`) is running and accessible in the specified namespace. | ||
# Uncomment the fields if you want the job to shut down after finishing or if you want to set a maximum runtime. | ||
# ---------------------------------------------------------------------------- | ||
|
||
--- | ||
apiVersion: ray.io/v1 | ||
kind: RayJob | ||
metadata: | ||
name: llama3.1-parallel-compile-job | ||
spec: | ||
submissionMode: K8sJobMode | ||
entrypoint: "NEURON_NUM_DEVICES=32 bash run_llama3.1_8b.sh -r 2 -n 16 -l 4e-4 -s 8192 -p 1" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It might be worth extracting these variables to ENVVARs, that way you can do some hyperparameter tuning or otherwise update by mounting them directly from a configfile in one place, just a thought. |
||
runtimeEnvYAML: | | ||
working_dir: /grammarly_pretraining | ||
clusterSelector: | ||
ray.io/cluster: kuberay-trn1 | ||
rayClusterNamespace: default # Replace with the namespace where your RayCluster is deployed | ||
shutdownAfterJobFinishes: true | ||
ttlSecondsAfterFinished: 60 # Time to live for the pod after completion (in seconds) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
# ---------------------------------------------------------------------------- | ||
# RayJob: llama3-pretraining-job | ||
# | ||
# Description: | ||
# This RayJob is responsible for the main pretraining step of the Llama3 model. It runs a | ||
# Python script (`ray_train_llama3.py`) to perform the pretraining using AWS Neuron devices. | ||
# This step is critical for training the language model with the prepared dataset. | ||
|
||
# Usage: | ||
# Apply this configuration to your Kubernetes cluster using `kubectl apply -f 4-grammarly-train-trn1-rayjob.yaml`. | ||
# Ensure that the Ray cluster (`kuberay-trn1`) is running and accessible in the specified namespace. | ||
# Uncomment the fields if you want the job to shut down after finishing or if you want to set a maximum runtime. | ||
# ---------------------------------------------------------------------------- | ||
|
||
--- | ||
apiVersion: ray.io/v1 | ||
kind: RayJob | ||
metadata: | ||
name: llama3.1-training-job | ||
spec: | ||
submissionMode: K8sJobMode | ||
entrypoint: "NEURON_NUM_DEVICES=32 bash run_llama3.1_8b.sh -w 500 -n 16 -l 4e-4 -s 8192" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same here with the envvars |
||
runtimeEnvYAML: | | ||
working_dir: /grammarly_pretraining | ||
clusterSelector: | ||
ray.io/cluster: kuberay-trn1 | ||
rayClusterNamespace: default # Replace with the namespace where your RayCluster is deployed | ||
shutdownAfterJobFinishes: true | ||
ttlSecondsAfterFinished: 60 # Time to live for the pod after completion (in seconds) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
ARG REGION | ||
|
||
# Base image: PyTorch training image for NeuronX | ||
FROM public.ecr.aws/neuron/pytorch-training-neuronx:2.1.2-neuronx-py310-sdk2.20.0-ubuntu20.04 | ||
|
||
# Install Ray for distributed computing | ||
RUN pip3 install aiohttp \ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Instead of running multiple pip installs as separate layers, please condense them into a lower layer. you can separate out dependencies you may want to update more frequently so you don't have to invalidate the entire layer, but we can probably be a little cleaner here |
||
&& rm -rf /root/.cache | ||
|
||
# Install additional Python dependencies | ||
RUN pip3 install wget awscli regex boto3 pyarrow \ | ||
&& rm -rf /root/.cache/ | ||
|
||
# Copy the Llama2 training code into the container | ||
# (Separate layer to rebuild only if the code changes) | ||
COPY ./llama3.1_pretraining /llama3.1_pretraining | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This can be mounted as a configmap, which makes this much faster to iterate on. Then you also don't have to recreate your image each time and can reuse it across jobs |
||
|
||
# Make shell scripts executable | ||
RUN chmod +x /llama3.1_pretraining/run_llama3.1_8b.sh | ||
|
||
# Set the working directory | ||
WORKDIR /llama3.1_pretraining | ||
|
||
# Installing the requirements | ||
RUN pip install -r requirements.txt --extra-index-url https://pip.repos.neuron.amazonaws.com | ||
|
||
# Installing the requirements | ||
# RUN pip install transformers==4.32.1 --no-warn-conflicts | ||
|
||
# # Installing neuronx-cc 2.0+ | ||
RUN pip install neuronx-cc==2.* --extra-index-url https://pip.repos.neuron.amazonaws.com -U | ||
|
||
# Replacing libneuronxla with torch_neuronx for neuron parallel compile in Ray config file | ||
COPY ./config.py /usr/local/lib/python3.10/site-packages/ray/train/torch/xla/config.py |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,83 @@ | ||
#!/bin/bash | ||
|
||
# This script automates the process of building a multi-architecture Docker image | ||
# and pushing it to an Amazon ECR (Elastic Container Registry) repository. | ||
# The script performs the following steps: | ||
# 1. Ensures the script is run on an x86_64-based instance. | ||
# 2. Checks if Docker is installed and running. | ||
# 3. Verifies that AWS CLI is installed and configured. | ||
# 4. Prompts the user for the desired AWS region. | ||
# 5. Checks if the specified ECR repository exists, and creates it if it does not. | ||
# 6. Logs into Amazon ECR. | ||
# 7. Creates a Docker Buildx builder instance to support multi-architecture builds. | ||
# 8. Builds and pushes a multi-architecture Docker image to the specified ECR repository. | ||
|
||
# Note: It is preferable to use AWS Cloud9 IDE with at least 100GB of storage for creating this image | ||
# to avoid storage issues during the build process. You can use your local Mac or Windows machine, | ||
# but ensure that you have enough memory and storage allocated for Docker to build this image. | ||
# For more help, reach out to Perplexity or Google. | ||
|
||
# Replace with your desired repository name | ||
ECR_REPO_NAME="rayptl_pretrain_llama3.1" | ||
|
||
# # Check that we are running on an x86_64 instance to avoid issues with docker build | ||
# arch=$(uname -m) | ||
# if [[ ! "$arch" = "x86_64" ]]; then | ||
# echo "Error: please run this script on an x86_64-based instance" | ||
# exit 1 | ||
# fi | ||
|
||
# Check if docker is installed | ||
junk=$(which docker 2>&1 > /dev/null) | ||
if [[ "$?" -ne 0 ]]; then | ||
echo "Error: please install docker and try again. ex: for AL2023 you can run:" | ||
echo " sudo yum install docker -y" | ||
echo " sudo systemctl start docker" | ||
echo " sudo usermod -aG docker ec2-user" | ||
echo " newgrp docker" | ||
exit 1 | ||
fi | ||
|
||
# Check that AWS CLI is installed and configured | ||
junk=$(aws sts get-caller-identity) | ||
if [[ "$?" -ne 0 ]]; then | ||
echo "Error: please make sure that the AWS CLI is installed and configured using 'aws configure'." | ||
exit 1 | ||
fi | ||
|
||
# Prompt user for desired region | ||
read -p "Enter the ECR region (ex: us-east-2): " region | ||
echo $region > .eks_region | ||
|
||
# Check if the ECR repository exists | ||
if aws ecr describe-repositories --repository-names "$ECR_REPO_NAME" --region "$region" >/dev/null 2>&1; then | ||
echo "ECR repository '$ECR_REPO_NAME' already exists." | ||
|
||
# Get the ECR_REPO_URI for the existing repository | ||
ECR_REPO_URI=$(aws ecr describe-repositories --repository-name "$ECR_REPO_NAME" --query 'repositories[0].repositoryUri' --region "$region" --output text) | ||
echo "Repository URL: $ECR_REPO_URI" | ||
else | ||
# Create the ECR repository | ||
aws ecr create-repository --repository-name "$ECR_REPO_NAME" --region "$region" | ||
|
||
# Get the ECR_REPO_URI for the newly created repository | ||
ECR_REPO_URI=$(aws ecr describe-repositories --repository-name "$ECR_REPO_NAME" --query 'repositories[0].repositoryUri' --region "$region" --output text) | ||
echo "ECR repository '$ECR_REPO_NAME' created successfully." | ||
echo "Repository URL: $ECR_REPO_URI" | ||
fi | ||
|
||
# Store ECR REPO URI for later use | ||
echo $ECR_REPO_URI > .ecr_repo_uri | ||
|
||
# Login to ECR | ||
echo -e "\nLogging in to ECR" | ||
aws ecr get-login-password --region "$region" | docker login --username AWS --password-stdin $ECR_REPO_URI | ||
aws ecr get-login-password --region "$region" | docker login --username AWS --password-stdin 763104351884.dkr.ecr.${region}.amazonaws.com/pytorch-training-neuronx | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You have the region here set dynamically, but it is hardcoded in the |
||
|
||
# Create and use a new builder instance for multi-arch builds | ||
docker buildx create --use --name mybuilder --driver docker-container | ||
docker buildx inspect mybuilder --bootstrap | ||
|
||
echo -e "\nBuilding kuberay_trn1 docker image" \ | ||
&& docker buildx build --platform linux/amd64 -t $ECR_REPO_URI:latest --build-arg REGION=$region . --push \ | ||
&& echo -e "\nImage successfully pushed to ECR" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these necessary? The pod won't land on a non-cpu node due to taints, and the keys/values may be different in other deployments.