feat: Added Llama2 training examples to trainium-inferentia module #388

sanjeevrg89 · 2023-12-17T16:36:59Z

What does this PR do?

Showcases how to run distributed training on Trainium using Llama2

🛑 Please open an issue first to discuss any significant work and flesh out details/direction - we would hate for your time to be wasted.
Consult the CONTRIBUTING guide for submitting pull-requests.

Added random name to the EKS cluster instead of hard-coded trainium-inferentia. Reason to do this -> Its hard to run multiple EKS clusters with same name and clean up is challenging
Added llama2 distributed training examples (Scott Perry and I worked on this module)
Automated all the llama2 training examples using shell scripts
Automated pre-reqs steps and created a shell script as we are asking users to use EC2 instance or Cloud 9 for Llama2 example
Modified Availability zones script under main.tf to properly pick trn1 instances
Added MD file for Llama2 distributed training on Trainium
Modified outputs.tf, variables.tf, eks.tf and main.tf

Motivation

Llama2 distributed pre-training example using Trainium on EKS

More

[ Yes] Yes, I have tested the PR using my local account setup (Provide any test evidence report under Additional Notes)
[Yes ] Mandatory for new blueprints. Yes, I have added a example to support my blueprint PR
[ Yes] Mandatory for new blueprints. Yes, I have updated the website/docs or website/blog section for this feature
[ Yes] Yes, I ran pre-commit run -a with this PR. Link for installing pre-commit locally

For Moderators

E2E Test successfully complete before merge?

Additional Notes

simplified docker build for neuronx-nemo-megatron container added scripts for cli pod launch, precompilation, training added script for tensorboard deployment

Llama updates

bug fix - always store ecr repo uri

vara-bonthu

I've added a few small suggestions, but aside from that, it looks great! 👍🏼

ai-ml/trainium-inferentia/eks.tf

vara-bonthu · 2023-12-21T09:27:31Z

ai-ml/trainium-inferentia/main.tf

+data "external" "eks_azs" {
+  program = ["bash", "${path.module}/get_eks_azs.sh"]
+}


Is the inclusion of this shell script essential? While it's clear that az3 and az4 in us-west-2 are required, this setup might not hold for different regions, causing users to adjust the script accordingly.

My suggestion is to introduce a new azs variable immediately after the region variable, where you can directly specify the availability zones for the Trn1 instances.

To guide users, we should detail this requirement on our website and in the blog post, emphasizing the need to update the availability zones in tandem with the region within the variables prior to deploying the solution.

Incorporating a link to the Trn1 node availability regions in our documentation would be beneficial for users.

Exporting the region variable and running that script as part of terraform installation is ideal than asking users to specify Availability zones.

vara-bonthu · 2023-12-21T09:28:37Z

ai-ml/trainium-inferentia/main.tf

@@ -38,8 +46,8 @@ data "aws_ecrpublic_authorization_token" "token" {
  provider = aws.ecr
 }

-locals {
-  name   = var.name
+/* locals {


Could we remove the commented lines from the code?

vara-bonthu · 2023-12-21T09:29:31Z

ai-ml/trainium-inferentia/outputs.tf

+/* output "configure_kubectl" {
  description = "Configure kubectl: make sure you're logged in with the correct AWS profile and run the following command to update your kubeconfig"
  value       = "aws eks --region ${var.region} update-kubeconfig --name ${var.name}"
+} */


what is the reason for this change?
Could we remove the commented lines from the code?

I was having issues using hardcoded eks cluster name. If I had to deploy another set of cluster I had issues because KMS would not allow creation of another AWS managed key with the same name. That is why I have used random string to the cluster name thats hardcoded.

I will remove the commented lines

vara-bonthu · 2023-12-21T09:38:17Z

ai-ml/trainium-inferentia/variables.tf

@@ -1,6 +1,6 @@
 variable "name" {
  description = "Name of the VPC and EKS Cluster"
-  default     = "trainium-inferentia"
+  default     = "tr-inf"


When making code alterations, particularly regarding renaming the name, it's crucial to verify that these changes do not disrupt existing docs or blogs. Please ensure you check the current blueprint document for any dependencies before proceeding. you can search the name to find the trainium-inferentia to find the occurances.

For example. you need to change this doc to reflect the new name https://github.com/awslabs/data-on-eks/blob/main/website/docs/gen-ai/inference/Llama2.md

and update this doc as well https://github.com/awslabs/data-on-eks/blob/main/website/docs/blueprints/ai-ml/trainium.md by replacing the trainium cluster name with new name

Should this be trn1-inf2 instead?

I will update the name to reflect trn1-inf2 and also update the name in the documentation

NOTE: Replace [cluster-name] with your actual EKS cluster name

Added this to the documentation because we cannot hardcode cluster name in the docs as we have added random string to the cluster name

vara-bonthu · 2023-12-21T09:41:40Z

website/docs/gen-ai/training/Llama2.md

+Verify the Amazon EKS Cluster
+
+```bash
+aws eks --region us-west-2 describe-cluster --name <cluster-name>


change the cluster-name to the actual name defined in the varaibles

The reason I cannot use the same name is I am appending a random string as discussed above

vara-bonthu · 2023-12-21T09:41:48Z

website/docs/gen-ai/training/Llama2.md

+
+```bash
+# Creates k8s config file to authenticate with EKS
+aws eks --region us-west-2 update-kubeconfig --name <cluster-name>


same as above

…er's chosen region

vara-bonthu · 2024-01-19T15:40:16Z

ai-ml/trainium-inferentia/main.tf

@@ -39,7 +47,7 @@ data "aws_ecrpublic_authorization_token" "token" {
 }

 locals {
-  name   = var.name
+  name   = "${var.name}-${random_string.this.result}"


remove this and put the name back to var.name and keep the name as trainium-inferentia.

vara-bonthu · 2024-01-19T15:40:23Z

ai-ml/trainium-inferentia/main.tf

+resource "random_string" "this" {
+  length  = 5
+  special = false
+  upper   = false
+  lower   = true
+  numeric = true
+}
+


remove this one

removed the random string

vara-bonthu · 2024-01-19T15:40:30Z

ai-ml/trainium-inferentia/get_eks_azs.sh

@@ -0,0 +1,43 @@
+#!/bin/bash


remove this file

removed this file

vara-bonthu · 2024-01-19T15:43:30Z

ai-ml/trainium-inferentia/variables.tf

@@ -1,10 +1,11 @@
 variable "name" {
  description = "Name of the VPC and EKS Cluster"
-  default     = "trainium-inferentia"


keep the same name is its being used in multiple blueprints

vara-bonthu · 2024-01-19T15:44:59Z

ai-ml/trainium-inferentia/variables.tf

+  default = 0
+}
+
+variable "trn1_32xl_max_size" {


remove all the max variables and this can be hardcoded

vara-bonthu · 2024-01-19T15:48:56Z

ai-ml/trainium-inferentia/variables.tf

+  default = 0
+}
+
+variable "inf2_24xl_max_size" {


remove all the max variables and this can be hardcoded

vara-bonthu · 2024-01-19T15:49:05Z

ai-ml/trainium-inferentia/variables.tf

+  default = 0
+}
+
+variable "inf2_48xl_max_size" {


remove all the max variables and this can be hardcoded

vara-bonthu · 2024-01-19T15:49:30Z

website/docs/gen-ai/inference/Llama2.md

 ```bash
-aws eks --region us-west-2 describe-cluster --name trainium-inferentia


put this change back

vara-bonthu · 2024-01-19T15:49:37Z

website/docs/gen-ai/inference/Llama2.md

 ```

 ```bash
 # Creates k8s config file to authenticate with EKS
-aws eks --region us-west-2 update-kubeconfig --name trainium-inferentia


put this change back

vara-bonthu · 2024-01-19T15:49:41Z

website/docs/gen-ai/inference/Llama2.md

@@ -148,7 +150,7 @@ Users can also modify the Dockerfile to suit their specific requirements and pus

 **Ensure the cluster is configured locally**
 ```bash
-aws eks --region us-west-2 update-kubeconfig --name trainium-inferentia


put this change back

vara-bonthu

LGTM 👍🏼

sanjeevrg89 and others added 30 commits October 31, 2023 13:58

MPI operator code for distributed training

1457d05

Making MPI operator optional for users

0a46f3a

added type string to mpi operator variable version

fbd3ba9

Merge branch 'awslabs:main' into main

ed0d72a

Merge branch 'awslabs:main' into main

017a0ad

llama2 examples

c415097

llama2 pretraining updates

08634ed

simplified docker build for neuronx-nemo-megatron container added scripts for cli pod launch, precompilation, training added script for tensorboard deployment

fix typo

fe66508

Merge pull request #1 from 5cp/llama_updates

ff51584

Llama updates

install pre-req script

74cbfd2

more tools to prereq shell script

de1d173

addtional tooling

92a8873

addtional tooling python

3cde526

AZ fix

79b2b6d

added jq

b1f8343

added tool checks

a8cdf82

get az script update

30b259b

az code fix

f5dbc09

az code fix

407fa49

fix az script

0c35b43

fix az script json output

289bbfd

bug fix - always store ecr repo uri

53cb92e

Merge pull request #2 from 5cp/llama_updates

feaf9db

bug fix - always store ecr repo uri

eks and main code changes

080abb5

llama2 trainium doc

401e177

initial doc updates

e57cb42

more llama doc updates

2f85081

more updates

9b76684

more updates

1e5e8da

add subheadings to docs

7b5ac67

sanjeevrg89 temporarily deployed to DoEKS Test December 17, 2023 16:37 — with GitHub Actions Inactive

sanjeevrg89 changed the title ~~Added Llama2 training examples to trainium-inferentia module~~ feat: Added Llama2 training examples to trainium-inferentia module Dec 17, 2023

missing img folder

a5f9d5b

sanjeevrg89 temporarily deployed to DoEKS Test December 19, 2023 15:15 — with GitHub Actions Inactive

vara-bonthu reviewed Dec 21, 2023

View reviewed changes

PR review requested changes

ae30478

sanjeevrg89 temporarily deployed to DoEKS Test January 2, 2024 20:14 — with GitHub Actions Inactive

Automatically select appropriate trn1/inf2-supporting AZs based on us…

3c2b71f

…er's chosen region

5cp temporarily deployed to DoEKS Test January 4, 2024 15:34 — with GitHub Actions Inactive

added variables for trn1 and inf2 instance sizes

700d5e6

sanjeevrg89 temporarily deployed to DoEKS Test January 4, 2024 16:55 — with GitHub Actions Inactive

redo instance size variables for inf2 and trn1n

4ede8eb

sanjeevrg89 temporarily deployed to DoEKS Test January 4, 2024 18:30 — with GitHub Actions Inactive

instance size variables fix

ecbe68a

sanjeevrg89 temporarily deployed to DoEKS Test January 4, 2024 20:10 — with GitHub Actions Inactive

fix trn1 default max size setting

51ef0be

sanjeevrg89 temporarily deployed to DoEKS Test January 4, 2024 20:24 — with GitHub Actions Inactive

llama2 training doc update

b71e27f

sanjeevrg89 temporarily deployed to DoEKS Test January 4, 2024 21:03 — with GitHub Actions Inactive

code changes to map AZs

3d0d674

sanjeevrg89 temporarily deployed to DoEKS Test January 16, 2024 04:51 — with GitHub Actions Inactive

AZ fetch code changes

6eb0099

sanjeevrg89 temporarily deployed to DoEKS Test January 16, 2024 13:58 — with GitHub Actions Inactive

reverted back to original AZ implementation

49cf49a

sanjeevrg89 temporarily deployed to DoEKS Test January 17, 2024 05:22 — with GitHub Actions Inactive

vara-bonthu reviewed Jan 19, 2024

View reviewed changes

addressed latest PR reviewed changes

0620075

sanjeevrg89 temporarily deployed to DoEKS Test January 19, 2024 16:31 — with GitHub Actions Inactive

vara-bonthu approved these changes Jan 24, 2024

View reviewed changes

vara-bonthu merged commit 4097208 into awslabs:main Jan 24, 2024
50 of 52 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Added Llama2 training examples to trainium-inferentia module #388

feat: Added Llama2 training examples to trainium-inferentia module #388

sanjeevrg89 commented Dec 17, 2023

vara-bonthu left a comment

vara-bonthu Dec 21, 2023

sanjeevrg89 Jan 4, 2024

vara-bonthu Dec 21, 2023

vara-bonthu Dec 21, 2023

sanjeevrg89 Jan 2, 2024

sanjeevrg89 Jan 2, 2024

vara-bonthu Dec 21, 2023

sanjeevrg89 Jan 2, 2024 •

edited

Loading

sanjeevrg89 Jan 2, 2024

vara-bonthu Dec 21, 2023

sanjeevrg89 Jan 2, 2024

vara-bonthu Dec 21, 2023

vara-bonthu Jan 19, 2024

sanjeevrg89 Jan 19, 2024

vara-bonthu Jan 19, 2024

sanjeevrg89 Jan 19, 2024

vara-bonthu Jan 19, 2024

sanjeevrg89 Jan 19, 2024

vara-bonthu Jan 19, 2024

sanjeevrg89 Jan 19, 2024

vara-bonthu Jan 19, 2024

vara-bonthu Jan 19, 2024

vara-bonthu Jan 19, 2024

vara-bonthu Jan 19, 2024

sanjeevrg89 Jan 19, 2024

vara-bonthu Jan 19, 2024

sanjeevrg89 Jan 19, 2024

vara-bonthu Jan 19, 2024

sanjeevrg89 Jan 19, 2024

vara-bonthu left a comment

		```bash
		aws eks --region us-west-2 describe-cluster --name trainium-inferentia

feat: Added Llama2 training examples to trainium-inferentia module #388

feat: Added Llama2 training examples to trainium-inferentia module #388

Conversation

sanjeevrg89 commented Dec 17, 2023

What does this PR do?

Motivation

More

For Moderators

Additional Notes

vara-bonthu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanjeevrg89 Jan 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vara-bonthu left a comment

Choose a reason for hiding this comment

sanjeevrg89 Jan 2, 2024 •

edited

Loading