-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Added Llama2 training examples to trainium-inferentia module #388
Conversation
simplified docker build for neuronx-nemo-megatron container added scripts for cli pod launch, precompilation, training added script for tensorboard deployment
Llama updates
bug fix - always store ecr repo uri
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a few small suggestions, but aside from that, it looks great! 👍🏼
ai-ml/trainium-inferentia/main.tf
Outdated
data "external" "eks_azs" { | ||
program = ["bash", "${path.module}/get_eks_azs.sh"] | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the inclusion of this shell script essential? While it's clear that az3
and az4
in us-west-2
are required, this setup might not hold for different regions, causing users to adjust the script accordingly.
My suggestion is to introduce a new azs
variable immediately after the region variable, where you can directly specify the availability zones for the Trn1 instances.
To guide users, we should detail this requirement on our website and in the blog post, emphasizing the need to update the availability zones in tandem with the region within the variables prior to deploying the solution.
Incorporating a link to the Trn1
node availability regions in our documentation would be beneficial for users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exporting the region variable and running that script as part of terraform installation is ideal than asking users to specify Availability zones.
ai-ml/trainium-inferentia/main.tf
Outdated
@@ -38,8 +46,8 @@ data "aws_ecrpublic_authorization_token" "token" { | |||
provider = aws.ecr | |||
} | |||
|
|||
locals { | |||
name = var.name | |||
/* locals { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we remove the commented lines from the code?
ai-ml/trainium-inferentia/outputs.tf
Outdated
/* output "configure_kubectl" { | ||
description = "Configure kubectl: make sure you're logged in with the correct AWS profile and run the following command to update your kubeconfig" | ||
value = "aws eks --region ${var.region} update-kubeconfig --name ${var.name}" | ||
} */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the reason for this change?
Could we remove the commented lines from the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was having issues using hardcoded eks cluster name. If I had to deploy another set of cluster I had issues because KMS would not allow creation of another AWS managed key with the same name. That is why I have used random string to the cluster name thats hardcoded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove the commented lines
@@ -1,6 +1,6 @@ | |||
variable "name" { | |||
description = "Name of the VPC and EKS Cluster" | |||
default = "trainium-inferentia" | |||
default = "tr-inf" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When making code alterations, particularly regarding renaming the name
, it's crucial to verify that these changes do not disrupt existing docs or blogs. Please ensure you check the current blueprint document for any dependencies before proceeding. you can search the name to find the trainium-inferentia
to find the occurances.
For example. you need to change this doc to reflect the new name https://github.com/awslabs/data-on-eks/blob/main/website/docs/gen-ai/inference/Llama2.md
and update this doc as well https://github.com/awslabs/data-on-eks/blob/main/website/docs/blueprints/ai-ml/trainium.md by replacing the trainium
cluster name with new name
Should this be trn1-inf2
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will update the name to reflect trn1-inf2 and also update the name in the documentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: Replace [cluster-name] with your actual EKS cluster name
Added this to the documentation because we cannot hardcode cluster name in the docs as we have added random string to the cluster name
Verify the Amazon EKS Cluster | ||
|
||
```bash | ||
aws eks --region us-west-2 describe-cluster --name <cluster-name> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change the cluster-name to the actual name defined in the varaibles
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I cannot use the same name is I am appending a random string as discussed above
|
||
```bash | ||
# Creates k8s config file to authenticate with EKS | ||
aws eks --region us-west-2 update-kubeconfig --name <cluster-name> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
…er's chosen region
ai-ml/trainium-inferentia/main.tf
Outdated
@@ -39,7 +47,7 @@ data "aws_ecrpublic_authorization_token" "token" { | |||
} | |||
|
|||
locals { | |||
name = var.name | |||
name = "${var.name}-${random_string.this.result}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this and put the name back to var.name
and keep the name as trainium-inferentia
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
ai-ml/trainium-inferentia/main.tf
Outdated
resource "random_string" "this" { | ||
length = 5 | ||
special = false | ||
upper = false | ||
lower = true | ||
numeric = true | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed the random string
@@ -0,0 +1,43 @@ | |||
#!/bin/bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed this file
@@ -1,10 +1,11 @@ | |||
variable "name" { | |||
description = "Name of the VPC and EKS Cluster" | |||
default = "trainium-inferentia" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keep the same name is its being used in multiple blueprints
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reverted
default = 0 | ||
} | ||
|
||
variable "trn1_32xl_max_size" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove all the max variables and this can be hardcoded
default = 0 | ||
} | ||
|
||
variable "inf2_24xl_max_size" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove all the max variables and this can be hardcoded
default = 0 | ||
} | ||
|
||
variable "inf2_48xl_max_size" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove all the max variables and this can be hardcoded
```bash | ||
aws eks --region us-west-2 describe-cluster --name trainium-inferentia |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put this change back
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reverted
``` | ||
|
||
```bash | ||
# Creates k8s config file to authenticate with EKS | ||
aws eks --region us-west-2 update-kubeconfig --name trainium-inferentia |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put this change back
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reverted
@@ -148,7 +150,7 @@ Users can also modify the Dockerfile to suit their specific requirements and pus | |||
|
|||
**Ensure the cluster is configured locally** | |||
```bash | |||
aws eks --region us-west-2 update-kubeconfig --name trainium-inferentia |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put this change back
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍🏼
What does this PR do?
Showcases how to run distributed training on Trainium using Llama2
🛑 Please open an issue first to discuss any significant work and flesh out details/direction - we would hate for your time to be wasted.
Consult the CONTRIBUTING guide for submitting pull-requests.
Motivation
Llama2 distributed pre-training example using Trainium on EKS
More
website/docs
orwebsite/blog
section for this featurepre-commit run -a
with this PR. Link for installing pre-commit locallyFor Moderators
Additional Notes