Auto-detection of accelerator_type for aws_accelerators trn1_inf #37998

chappidim · 2023-08-01T23:47:34Z

Why are these changes needed?

This change is to support auto-detection of AWS accelerators and configuring appropriate environment variables to designate the neuron_core per task/actor.

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Manual tests

Manual testing

Steps followed

Built custom whl file with changes
Compile and save simple neuron model on trn1.2xlarge machine

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch, torch_neuronx

hf_model = "j-hartmann/emotion-english-distilroberta-base"
neuron_model = "sentiment_neuron.pt"

model = AutoModelForSequenceClassification.from_pretrained(hf_model)
tokenizer = AutoTokenizer.from_pretrained(hf_model)
example_inputs = tokenizer.encode_plus("When I eat pizza I always have an amazing time", return_tensors="pt", padding="max_length", truncation=True, max_length=128)
neuron_inputs = example_inputs["input_ids"], example_inputs["attention_mask"]
nmod = torch_neuronx.trace(model, neuron_inputs)
nmod.save(neuron_model)
print(f"Saved Neuron-compiled model {neuron_model}")

Run the inference/Ray script with an Actor and Task where the Actor uses GPU

import os
import ray
import time
import torch
import torch_neuronx

ray.init()

@ray.remote(resources={"neuron_cores": 1})
class GPUActor:
    def ping(self):
        import torch, torch_neuronx
        from transformers import AutoTokenizer

        hf_model = "j-hartmann/emotion-english-distilroberta-base"

        print(f"rt_visible_cores: {os.environ['NEURON_RT_VISIBLE_CORES']}")
        model_neuron = torch.jit.load("sentiment_neuron.pt")
        tokenizer = AutoTokenizer.from_pretrained(hf_model)
        encoded = tokenizer.encode_plus("hello world", return_tensors="pt", padding="max_length", truncation=True, max_length=128)
        output = model_neuron(*(encoded["input_ids"], encoded["attention_mask"]))
        print(output)

@ray.remote(resources={"neuron_cores": 1})
def use_gpu():
    print(f"rt_visible_cores: {os.environ['NEURON_RT_VISIBLE_CORES']}")
    time.sleep(10)

print(ray.available_resources())
gpu_actor = GPUActor.remote()
ray.get(gpu_actor.ping.remote())
ray.get(use_gpu.remote())

Test scenarios

Where the host contains two neuron_cores (trn1.2xlarge)

Status	(num_neuron_cores) ray_init	(num_neuron_cores) Actor	(num_neuron_cores) Task	Sample logs
✅	auto	1	1	{'CPU': 8.0, 'accelerator_type:aws-neuron-core': 2.0, 'num_neuron_cores': 2.0, 'node:internal_head': 1.0, 'node:172.31.55.43': 1.0, 'object_store_memory': 9004382208.0, 'memory': 18008764416.0} (GPUActor pid=3789266) ray.get_nc_ids(): [0] (GPUActor pid=3789266) rt_visible_cores: 0 (GPUActor pid=3789266) {'logits': tensor([[-1.4126, -1.9890, -1.3332, -0.2176, 3.9735, -0.6969, 1.8381]])} (use_gpu pid=3789087) ray.get_nc_ids(): [1]
✅	2	1	1	{'node:172.31.55.43': 1.0, 'accelerator_type:aws-neuron-core': 2.0, 'memory': 17983684608.0, 'num_neuron_cores': 2.0, 'node:internal_head': 1.0, 'object_store_memory': 8991842304.0, 'CPU': 8.0} (GPUActor pid=3792042) ray.get_nc_ids(): [0] (GPUActor pid=3792042) rt_visible_cores: 0 (GPUActor pid=3792042) {'logits': tensor([[-1.4126, -1.9890, -1.3332, -0.2176, 3.9735, -0.6969, 1.8381]])} (use_gpu pid=3791855) ray.get_nc_ids(): [1]
✅	2	2	1	{'memory': 17954471118.0, 'accelerator_type:aws-neuron-core': 2.0, 'node:172.31.55.43': 1.0, 'CPU': 8.0, 'num_neuron_cores': 2.0, 'node:internal_head': 1.0, 'object_store_memory': 8977235558.0} (GPUActor pid=3792731) ray.get_nc_ids(): [0, 1] (GPUActor pid=3792731) rt_visible_cores: 0,1 (GPUActor pid=3792731) {'logits': tensor([[-1.4126, -1.9890, -1.3332, -0.2176, 3.9735, -0.6969, 1.8381]])} (autoscaler +27s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0. (autoscaler +27s) Warning: The following resource request cannot be scheduled right now: {'num_neuron_cores': 1.0, 'CPU': 1.0}. This is likely due to all cluster resources being claimed by actors. Consider creating fewer actors or adding more nodes to this Ray cluster.
✅	3	3	1	{'object_store_memory': 8976119808.0, 'num_neuron_cores': 3.0, 'accelerator_type:aws-neuron-core': 3.0, 'memory': 17952239616.0, 'node:172.31.55.43': 1.0, 'node:internal_head': 1.0, 'CPU': 8.0} (GPUActor pid=3793572) ray.get_nc_ids(): [0, 1, 2] (GPUActor pid=3793572) rt_visible_cores: 0,1,2 (GPUActor pid=3793572) 2023-Aug-07 19:21:54.0564 3793572:3793572 ERROR TDRV:tdrv_init_mla_phase1 ... Could not open the nd1

Overall happy-path confirms that that neuron_core has been used by the Actor (using neuron-ls)

Every 2.0s: neuron-ls --show-all-procs                                                                                                 ip-172-31-55-43: Mon Aug  7 19:24:18 2023

instance-type: trn1.2xlarge
instance-id: i-07eb09789f992275c
+--------+--------+--------+---------+---------+----------------------------+---------+
| NEURON | NEURON | NEURON |   PCI   |   PID   |          COMMAND           | RUNTIME |
| DEVICE | CORES  | MEMORY |   BDF   |         |                            | VERSION |
+--------+--------+--------+---------+---------+----------------------------+---------+
| 0      | 2      | 32 GB  | 00:1e.0 | 3794786 | ray::GPUActor              | 2.15.11 |
|        |        |        |         | 3794861 | neuron-ls --show-all-procs | NA      |
+--------+--------+--------+---------+---------+----------------------------+---------+

scv119 · 2023-08-03T20:50:25Z

thanks for the contribution! the PR looks pretty great delta today's discussion (using accelerator type instead of GPU). I'll take a look once we made the change.

Signed-off-by: maheedhar reddy chappidi <[email protected]>

github-actions · 2023-08-04T18:12:06Z

Attention: External code changed

A previous version of this PR changed code that is used or cited in external sources, e.g. blog posts.

It looks like these changes have been reverted or are otherwise not present in this PR anymore. Please still carefully review the changes to make sure code we use in external sources still works.

Signed-off-by: maheedhar reddy chappidi <[email protected]>

python/ray/_private/worker.py

Signed-off-by: maheedhar reddy chappidi <[email protected]>

python/ray/_private/worker.py

src/ray/common/ray_config_def.h

python/ray/tests/test_autoscaler_yaml.py

python/ray/_private/utils.py

python/ray/_private/resource_spec.py

Signed-off-by: maheedhar reddy chappidi <[email protected]>

python/ray/_private/accelerator.py

python/ray/_private/utils.py

python/ray/_private/ray_constants.py

python/ray/_private/utils.py

src/ray/common/ray_config_def.h

Signed-off-by: maheedhar reddy chappidi <[email protected]>

…o feat-trn1-accel

Signed-off-by: maheedhar reddy chappidi <[email protected]>

jjyao

In the doc, can we make sure to mention that the feature is alpha/experimental now?

Signed-off-by: maheedhar reddy chappidi <[email protected]>

doc/source/ray-core/tasks/using-ray-with-gpus.rst

Co-authored-by: matthewdeng <[email protected]> Signed-off-by: Chen Shen <[email protected]>

python/ray/_private/accelerator.py

doc/source/ray-core/doc_code/neuron_core_accelerator.py

rkooo567 · 2023-08-17T15:16:56Z

No more comments. LGTM. @pcmoritz should we "request change" to apply your request?

Signed-off-by: maheedhar reddy chappidi <[email protected]>

…-project#37998) This change is to support auto-detection of AWS accelerators and configuring appropriate environment variables to designate the neuron_core per task/actor. Related REP ray-project#33707 Signed-off-by: e428265 <[email protected]>

…-project#37998) This change is to support auto-detection of AWS accelerators and configuring appropriate environment variables to designate the neuron_core per task/actor. Related REP ray-project#33707 Signed-off-by: Victor <[email protected]>

matthewdeng assigned scv119 Aug 3, 2023

scv119 assigned rkooo567 and rickyyx Aug 3, 2023

scv119 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 3, 2023

Refactor to support accelerator_type instead of num_gpus

faf7e9f

Signed-off-by: maheedhar reddy chappidi <[email protected]>

chappidim force-pushed the feat-trn1-accel branch from 51c1859 to faf7e9f Compare August 4, 2023 18:11

github-actions bot added the external-code-affected label Aug 4, 2023

Fix UTs, resource regex and bug fixes

dc9472d

Signed-off-by: maheedhar reddy chappidi <[email protected]>

chappidim changed the title ~~Auto-detection of GPU/accelerator_type for aws_accelerators trn1_inf~~ Auto-detection of accelerator_type for aws_accelerators trn1_inf Aug 5, 2023

chappidim commented Aug 5, 2023

View reviewed changes

python/ray/_private/worker.py Outdated Show resolved Hide resolved

chappidim added 2 commits August 7, 2023 10:53

Improve UTs, add NC to custom_unit_instance_resources

d28744a

Signed-off-by: maheedhar reddy chappidi <[email protected]>

Bug fix on NoneType to fix UTs

bb41d2b

Signed-off-by: maheedhar reddy chappidi <[email protected]>

chappidim marked this pull request as ready for review August 7, 2023 19:26

chappidim requested review from wuisawesome, DmitriGekhtman, ericl and a team as code owners August 7, 2023 19:26

scv119 reviewed Aug 8, 2023

View reviewed changes

python/ray/_private/worker.py Outdated Show resolved Hide resolved

src/ray/common/ray_config_def.h Outdated Show resolved Hide resolved

python/ray/tests/test_autoscaler_yaml.py Outdated Show resolved Hide resolved

python/ray/_private/utils.py Outdated Show resolved Hide resolved

scv119 reviewed Aug 8, 2023

View reviewed changes

python/ray/_private/resource_spec.py Outdated Show resolved Hide resolved

Helper method, UT fixes, global var

c7818d7

Signed-off-by: maheedhar reddy chappidi <[email protected]>

chappidim requested a review from scv119 August 8, 2023 06:33

Merge branch 'ray-project:master' into feat-trn1-accel

54c00af

rkooo567 reviewed Aug 8, 2023

View reviewed changes

chappidim added 3 commits August 8, 2023 14:26

Refactor, validate on remote, UT coverage for accelerator

f2f0d06

Signed-off-by: maheedhar reddy chappidi <[email protected]>

Bug fixes on UT reg new annotations

76bea58

Signed-off-by: maheedhar reddy chappidi <[email protected]>

Merge branch 'feat-trn1-accel' of https://github.com/chappidm/ray int…

3de32aa

…o feat-trn1-accel

chappidim requested a review from rkooo567 August 9, 2023 18:20

Bug fix on neuron_core options validation

e137b57

Signed-off-by: maheedhar reddy chappidi <[email protected]>

Merge branch 'master' into feat-trn1-accel

eb51da0

jjyao approved these changes Aug 16, 2023

View reviewed changes

chappidim added 2 commits August 16, 2023 11:26

[Doc] Add experimental tag

ffe5078

Signed-off-by: maheedhar reddy chappidi <[email protected]>

Merge branch 'master' into feat-trn1-accel

471f3a7

chappidim requested a review from scv119 August 16, 2023 18:28

Merge branch 'master' into feat-trn1-accel

594ce80

matthewdeng approved these changes Aug 16, 2023

View reviewed changes

doc/source/ray-core/tasks/using-ray-with-gpus.rst Outdated Show resolved Hide resolved

doc/source/ray-core/tasks/using-ray-with-gpus.rst Outdated Show resolved Hide resolved

doc/source/ray-core/tasks/using-ray-with-gpus.rst Outdated Show resolved Hide resolved

scv119 and others added 3 commits August 16, 2023 12:33

Update doc/source/ray-core/tasks/using-ray-with-gpus.rst

9575f87

Co-authored-by: matthewdeng <[email protected]> Signed-off-by: Chen Shen <[email protected]>

Update doc/source/ray-core/tasks/using-ray-with-gpus.rst

774edc7

Co-authored-by: matthewdeng <[email protected]> Signed-off-by: Chen Shen <[email protected]>

Update doc/source/ray-core/tasks/using-ray-with-gpus.rst

2810ac8

Co-authored-by: matthewdeng <[email protected]> Signed-off-by: Chen Shen <[email protected]>

cadedaniel reviewed Aug 16, 2023

View reviewed changes

python/ray/_private/accelerator.py Show resolved Hide resolved

pcmoritz reviewed Aug 16, 2023

View reviewed changes

doc/source/ray-core/doc_code/neuron_core_accelerator.py Outdated Show resolved Hide resolved

Refactor num_neuron_cores to neuron_cores

29e8b63

Signed-off-by: maheedhar reddy chappidi <[email protected]>

chappidim requested a review from architkulkarni as a code owner August 17, 2023 16:31

chappidim added 2 commits August 17, 2023 09:32

Merge branch 'master' into feat-trn1-accel

ac09f78

Merge branch 'master' into feat-trn1-accel

c66c86e

scv119 merged commit 6b69524 into ray-project:master Aug 17, 2023

scv119 added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. external-code-affected labels Aug 17, 2023

allenwang28 mentioned this pull request Aug 22, 2023

[Ray Core] Adds in Google Cloud TPUs as a native Resource #38669

Merged

8 tasks

cadedaniel mentioned this pull request Aug 22, 2023

[Core] Support Intel GPU #36493

Closed

8 tasks

harborn mentioned this pull request Aug 24, 2023

[Core] Support Intel GPU #38553

Merged

8 tasks

scv119 mentioned this pull request Aug 24, 2023

[Core] holistic design for supporting accelerators in ray core #38504

Closed

chappidim deleted the feat-trn1-accel branch August 25, 2023 00:14

architkulkarni mentioned this pull request Sep 6, 2023

[Ray Core] Adds in Google Cloud TPUs as a native Resource (#38669) #39352

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto-detection of accelerator_type for aws_accelerators trn1_inf #37998

Auto-detection of accelerator_type for aws_accelerators trn1_inf #37998

chappidim commented Aug 1, 2023 •

edited

Loading

scv119 commented Aug 3, 2023

github-actions bot commented Aug 4, 2023 •

edited

Loading

jjyao left a comment

rkooo567 commented Aug 17, 2023 •

edited

Loading

Auto-detection of accelerator_type for aws_accelerators trn1_inf #37998

Auto-detection of accelerator_type for aws_accelerators trn1_inf #37998

Conversation

chappidim commented Aug 1, 2023 • edited Loading

Why are these changes needed?

Related

Checks

Manual testing

Steps followed

Test scenarios

scv119 commented Aug 3, 2023

github-actions bot commented Aug 4, 2023 • edited Loading

Attention: External code changed

jjyao left a comment

Choose a reason for hiding this comment

rkooo567 commented Aug 17, 2023 • edited Loading

chappidim commented Aug 1, 2023 •

edited

Loading

github-actions bot commented Aug 4, 2023 •

edited

Loading

rkooo567 commented Aug 17, 2023 •

edited

Loading