Add Batch Optimization Scripts for Neuron Instances #500

mattcjo · 2024-10-25T22:07:00Z

This pull request introduces the training and inference scripts essential for model development. Additionally, a supporting Dockerfile is provided to optimize batch sizes specifically for Neuron GPU instances, ensuring efficient utilization of GPU resources.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…erfile for the e2e BERT training task

…github action

…ER_HOST

…ty it up.

…or MPI, NCCL, and EFA.

…ated

…ce to be consistent with the other test images

…duplicate

cartermckinnon · 2024-10-25T22:31:11Z

Is this going to be used to tune our test cases? https://github.com/aws/aws-k8s-tester/tree/main/e2e2/test/cases/neuron

I'm not clear on the goal

ndbaker1 · 2024-10-25T22:39:29Z

hack/optimize/neuron/Dockerfile

+COPY train_bert_neuron.py /app/train_bert_neuron.py
+COPY infer_bert_neuron.py /app/infer_bert_neuron.py


so this image supports inference and training for neuron? should we just put it under e2e2's images folder rather than hack?

these python scripts you could leave in /hack and then just volume mount them into the container

Correct. Yeah honestly I struggled with where to put these, and someone recommended hack a couple weeks ago. The main use case right now is to just get optimal batch size to support upcoming benchmarking efforts for our e2e tests.

I could see it evolving in the future to being automatically ran when certain dependencies are updated, or as new instance types become available.

so IIUC we can use the neuron test for inference tuning but you need an imaage for neuron here that supports training as well? im trying to decouple the test image from the optimization suite/framework.

@ndbaker1 Use of Dockerfile was just to make things more portable across instances as I did testing. Also, while probably made no difference, there is slight overhead introduced from running in a container versus just a script. Additional dependencies (e.g. neuron container runtime) as well, which makes the optimization's environment closer to the tests' runtime environment.

@ndbaker1 @cartermckinnon Not sure I have a perfect answer of where these scripts/dockerfile should go, but here's the full context...

The training and inference tests part of e2e2 currently have suboptimal values for their batch parameter.

A standard batch value is hardcoded for all of them, leaving many of the instance's GPUs underutilized.

A major goal moving forward is to be able to benchmark these tests on all instances, and to gain an understanding of what full peak performance looks like for each instance type.

These new optimization scripts look to target a single GPU on an instance (even if multiple GPU), and to determine max batch size that a GPU of a certain type can handle.

The optimal batch value will then be used to determine the total batch size per instance (batch_size * num_gpus) for each instance, enabling us to run benchmarking for each instance at full GPU utilization (like our customers would)

The need for a training and inference script has to do with the fact that depending on the "mode" of a model, more/less memory might be utilized

Memory utilization by mode differs significantly because training requires large amounts of temporary parameter values to be held in memory (as weights/parameters get updated during the training process), while inference does not (parameter values are static)

The scripts were containerized to more closely mirror the test's runtime environment of running on kubernetes

A single Dockerfile was used for simplicity

Makes sense. Can we include this script in our existing test images so we don't need a separate pipeline for it? will be easier to set up a periodic for this as well if it's all the same spec with a different command

I like this, dependencies should be kept the consistent anyways. Can't do this for Neuron yet, I'm just now noticing that the PR for Neuron BERT training/inference was closed and never merged. Will need to get that merged in first.

mattcjo · 2024-10-25T22:49:01Z

Is this going to be used to tune our test cases? https://github.com/aws/aws-k8s-tester/tree/main/e2e2/test/cases/neuron

I'm not clear on the goal

@cartermckinnon Yes, these are used to determine optimal batch size for Neuron instances for both training and inference e2e tests. There's one for NVIDIA instances as well - #498

mattcjo and others added 23 commits July 11, 2024 17:36

Add python training script, requirements.txt (dependencies), and dock…

af9fda0

…erfile for the e2e BERT training task

Add github action to build bert-testing image on PR

104fa93

Specify directory the BERT training image should be built in for the …

477f672

…github action

Add default values and include in docker env for MASTER_ADDR and MAST…

fb7d18f

…ER_HOST

Slightly change env var value retrieval. Also ran a formatter to pret…

b5aedc7

…ty it up.

Update bert training dockerfile to include amazon specific packages f…

7f9480b

…or MPI, NCCL, and EFA.

Change Dockerfile.bert-training file name to just Dockerfile

19613e1

Update git workflow to use new Dockerfile path since the name was upd…

974da50

…ated

Update Docker image to use Python version 3.10.12 and build from sour…

5b4ae1a

…ce to be consistent with the other test images

Merge remote-tracking branch 'upstream/main'

6bc3ef4

Remove extra line

fa8d244

Had been setting MASTER_ADDR and MASTER_PORT env vars twice. Removed …

f87ba65

…duplicate

Set each process to a GPU via local rank instead of overall rank

7af6b13

Merge remote-tracking branch 'upstream/main'

1a3ad52

Change comment describing section in dockerfile

1f5b1c9

Merge branch 'aws:main' into main

b67026c

parameterize number of gpus per node in Dockerfile and train.py

4a8e0ec

Merge remote-tracking branch 'upstream/main'

60ddc02

formatting in train.py

01d8270

Merge remote-tracking branch 'upstream/main'

21fd336

Merge branch 'aws:main' into main

f250ede

Add nvidia batch optimization scripts for both training and inference

f000ec6

Merge branch 'aws:main' into batch-optimization-neuron

21e27a0

mattcjo changed the title ~~Batch optimization neuron~~ Add Batch Optimization Scripts for Neuron Instances Oct 25, 2024

Move Neuron scripts into neuron directory

7493cfd

ndbaker1 reviewed Oct 25, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Batch Optimization Scripts for Neuron Instances #500

Add Batch Optimization Scripts for Neuron Instances #500

mattcjo commented Oct 25, 2024

cartermckinnon commented Oct 25, 2024 •

edited

Loading

ndbaker1 Oct 25, 2024 •

edited

Loading

mattcjo Oct 25, 2024

ndbaker1 Oct 25, 2024 •

edited

Loading

mattcjo Oct 25, 2024

mattcjo Oct 25, 2024 •

edited

Loading

cartermckinnon Oct 25, 2024

mattcjo Oct 25, 2024

mattcjo commented Oct 25, 2024

		COPY train_bert_neuron.py /app/train_bert_neuron.py
		COPY infer_bert_neuron.py /app/infer_bert_neuron.py

Add Batch Optimization Scripts for Neuron Instances #500

Are you sure you want to change the base?

Add Batch Optimization Scripts for Neuron Instances #500

Conversation

mattcjo commented Oct 25, 2024

cartermckinnon commented Oct 25, 2024 • edited Loading

ndbaker1 Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

mattcjo Oct 25, 2024

Choose a reason for hiding this comment

ndbaker1 Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

mattcjo Oct 25, 2024

Choose a reason for hiding this comment

mattcjo Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

cartermckinnon Oct 25, 2024

Choose a reason for hiding this comment

mattcjo Oct 25, 2024

Choose a reason for hiding this comment

mattcjo commented Oct 25, 2024

cartermckinnon commented Oct 25, 2024 •

edited

Loading

ndbaker1 Oct 25, 2024 •

edited

Loading

ndbaker1 Oct 25, 2024 •

edited

Loading

mattcjo Oct 25, 2024 •

edited

Loading