Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Added Llama2 training examples to trainium-inferentia module #388

Merged
merged 44 commits into from
Jan 24, 2024
Merged

feat: Added Llama2 training examples to trainium-inferentia module #388

merged 44 commits into from
Jan 24, 2024

Conversation

sanjeevrg89
Copy link
Contributor

What does this PR do?

Showcases how to run distributed training on Trainium using Llama2

🛑 Please open an issue first to discuss any significant work and flesh out details/direction - we would hate for your time to be wasted.
Consult the CONTRIBUTING guide for submitting pull-requests.

  1. Added random name to the EKS cluster instead of hard-coded trainium-inferentia. Reason to do this -> Its hard to run multiple EKS clusters with same name and clean up is challenging
  2. Added llama2 distributed training examples (Scott Perry and I worked on this module)
  3. Automated all the llama2 training examples using shell scripts
  4. Automated pre-reqs steps and created a shell script as we are asking users to use EC2 instance or Cloud 9 for Llama2 example
  5. Modified Availability zones script under main.tf to properly pick trn1 instances
  6. Added MD file for Llama2 distributed training on Trainium
  7. Modified outputs.tf, variables.tf, eks.tf and main.tf

Motivation

Llama2 distributed pre-training example using Trainium on EKS

More

  • [ Yes] Yes, I have tested the PR using my local account setup (Provide any test evidence report under Additional Notes)
  • [Yes ] Mandatory for new blueprints. Yes, I have added a example to support my blueprint PR
  • [ Yes] Mandatory for new blueprints. Yes, I have updated the website/docs or website/blog section for this feature
  • [ Yes] Yes, I ran pre-commit run -a with this PR. Link for installing pre-commit locally

For Moderators

  • E2E Test successfully complete before merge?

Additional Notes

@sanjeevrg89 sanjeevrg89 changed the title Added Llama2 training examples to trainium-inferentia module feat: Added Llama2 training examples to trainium-inferentia module Dec 17, 2023
Copy link
Collaborator

@vara-bonthu vara-bonthu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a few small suggestions, but aside from that, it looks great! 👍🏼

ai-ml/trainium-inferentia/eks.tf Outdated Show resolved Hide resolved
Comment on lines 61 to 63
data "external" "eks_azs" {
program = ["bash", "${path.module}/get_eks_azs.sh"]
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the inclusion of this shell script essential? While it's clear that az3 and az4 in us-west-2 are required, this setup might not hold for different regions, causing users to adjust the script accordingly.

My suggestion is to introduce a new azs variable immediately after the region variable, where you can directly specify the availability zones for the Trn1 instances.

To guide users, we should detail this requirement on our website and in the blog post, emphasizing the need to update the availability zones in tandem with the region within the variables prior to deploying the solution.

Incorporating a link to the Trn1 node availability regions in our documentation would be beneficial for users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exporting the region variable and running that script as part of terraform installation is ideal than asking users to specify Availability zones.

@@ -38,8 +46,8 @@ data "aws_ecrpublic_authorization_token" "token" {
provider = aws.ecr
}

locals {
name = var.name
/* locals {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we remove the commented lines from the code?

Comment on lines 1 to 4
/* output "configure_kubectl" {
description = "Configure kubectl: make sure you're logged in with the correct AWS profile and run the following command to update your kubeconfig"
value = "aws eks --region ${var.region} update-kubeconfig --name ${var.name}"
} */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the reason for this change?
Could we remove the commented lines from the code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was having issues using hardcoded eks cluster name. If I had to deploy another set of cluster I had issues because KMS would not allow creation of another AWS managed key with the same name. That is why I have used random string to the cluster name thats hardcoded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove the commented lines

@@ -1,6 +1,6 @@
variable "name" {
description = "Name of the VPC and EKS Cluster"
default = "trainium-inferentia"
default = "tr-inf"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When making code alterations, particularly regarding renaming the name, it's crucial to verify that these changes do not disrupt existing docs or blogs. Please ensure you check the current blueprint document for any dependencies before proceeding. you can search the name to find the trainium-inferentia to find the occurances.

For example. you need to change this doc to reflect the new name https://github.com/awslabs/data-on-eks/blob/main/website/docs/gen-ai/inference/Llama2.md

and update this doc as well https://github.com/awslabs/data-on-eks/blob/main/website/docs/blueprints/ai-ml/trainium.md by replacing the trainium cluster name with new name

Should this be trn1-inf2 instead?

Copy link
Contributor Author

@sanjeevrg89 sanjeevrg89 Jan 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update the name to reflect trn1-inf2 and also update the name in the documentation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: Replace [cluster-name] with your actual EKS cluster name

Added this to the documentation because we cannot hardcode cluster name in the docs as we have added random string to the cluster name

Verify the Amazon EKS Cluster

```bash
aws eks --region us-west-2 describe-cluster --name <cluster-name>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change the cluster-name to the actual name defined in the varaibles

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I cannot use the same name is I am appending a random string as discussed above


```bash
# Creates k8s config file to authenticate with EKS
aws eks --region us-west-2 update-kubeconfig --name <cluster-name>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

@@ -39,7 +47,7 @@ data "aws_ecrpublic_authorization_token" "token" {
}

locals {
name = var.name
name = "${var.name}-${random_string.this.result}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this and put the name back to var.name and keep the name as trainium-inferentia.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 33 to 40
resource "random_string" "this" {
length = 5
special = false
upper = false
lower = true
numeric = true
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the random string

@@ -0,0 +1,43 @@
#!/bin/bash
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed this file

@@ -1,10 +1,11 @@
variable "name" {
description = "Name of the VPC and EKS Cluster"
default = "trainium-inferentia"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep the same name is its being used in multiple blueprints

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted

default = 0
}

variable "trn1_32xl_max_size" {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove all the max variables and this can be hardcoded

default = 0
}

variable "inf2_24xl_max_size" {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove all the max variables and this can be hardcoded

default = 0
}

variable "inf2_48xl_max_size" {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove all the max variables and this can be hardcoded

```bash
aws eks --region us-west-2 describe-cluster --name trainium-inferentia
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put this change back

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted

```

```bash
# Creates k8s config file to authenticate with EKS
aws eks --region us-west-2 update-kubeconfig --name trainium-inferentia
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put this change back

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted

@@ -148,7 +150,7 @@ Users can also modify the Dockerfile to suit their specific requirements and pus

**Ensure the cluster is configured locally**
```bash
aws eks --region us-west-2 update-kubeconfig --name trainium-inferentia
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put this change back

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted

Copy link
Collaborator

@vara-bonthu vara-bonthu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍🏼

@vara-bonthu vara-bonthu merged commit 4097208 into awslabs:main Jan 24, 2024
50 of 52 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants