Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-detection of accelerator_type for aws_accelerators trn1_inf #37998

Merged
merged 42 commits into from
Aug 17, 2023

Conversation

chappidim
Copy link
Contributor

@chappidim chappidim commented Aug 1, 2023

Why are these changes needed?

This change is to support auto-detection of AWS accelerators and configuring appropriate environment variables to designate the neuron_core per task/actor.

Related

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Manual tests

Manual testing

Steps followed

  1. Built custom whl file with changes
  2. Compile and save simple neuron model on trn1.2xlarge machine
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch, torch_neuronx

hf_model = "j-hartmann/emotion-english-distilroberta-base"
neuron_model = "sentiment_neuron.pt"

model = AutoModelForSequenceClassification.from_pretrained(hf_model)
tokenizer = AutoTokenizer.from_pretrained(hf_model)
example_inputs = tokenizer.encode_plus("When I eat pizza I always have an amazing time", return_tensors="pt", padding="max_length", truncation=True, max_length=128)
neuron_inputs = example_inputs["input_ids"], example_inputs["attention_mask"]
nmod = torch_neuronx.trace(model, neuron_inputs)
nmod.save(neuron_model)
print(f"Saved Neuron-compiled model {neuron_model}")
  1. Run the inference/Ray script with an Actor and Task where the Actor uses GPU
import os
import ray
import time
import torch
import torch_neuronx

ray.init()

@ray.remote(resources={"neuron_cores": 1})
class GPUActor:
    def ping(self):
        import torch, torch_neuronx
        from transformers import AutoTokenizer

        hf_model = "j-hartmann/emotion-english-distilroberta-base"

        print(f"rt_visible_cores: {os.environ['NEURON_RT_VISIBLE_CORES']}")
        model_neuron = torch.jit.load("sentiment_neuron.pt")
        tokenizer = AutoTokenizer.from_pretrained(hf_model)
        encoded = tokenizer.encode_plus("hello world", return_tensors="pt", padding="max_length", truncation=True, max_length=128)
        output = model_neuron(*(encoded["input_ids"], encoded["attention_mask"]))
        print(output)

@ray.remote(resources={"neuron_cores": 1})
def use_gpu():
    print(f"rt_visible_cores: {os.environ['NEURON_RT_VISIBLE_CORES']}")
    time.sleep(10)

print(ray.available_resources())
gpu_actor = GPUActor.remote()
ray.get(gpu_actor.ping.remote())
ray.get(use_gpu.remote())

Test scenarios

Where the host contains two neuron_cores (trn1.2xlarge)

Status (num_neuron_cores) ray_init (num_neuron_cores) Actor (num_neuron_cores) Task Sample logs
auto 1 1 {'CPU': 8.0, 'accelerator_type:aws-neuron-core': 2.0, 'num_neuron_cores': 2.0, 'node:internal_head': 1.0, 'node:172.31.55.43': 1.0, 'object_store_memory': 9004382208.0, 'memory': 18008764416.0}
(GPUActor pid=3789266) ray.get_nc_ids(): [0]
(GPUActor pid=3789266) rt_visible_cores: 0 (GPUActor pid=3789266)
{'logits': tensor([[-1.4126, -1.9890, -1.3332, -0.2176, 3.9735, -0.6969, 1.8381]])}
(use_gpu pid=3789087) ray.get_nc_ids(): [1]
2 1 1 {'node:172.31.55.43': 1.0, 'accelerator_type:aws-neuron-core': 2.0, 'memory': 17983684608.0, 'num_neuron_cores': 2.0, 'node:internal_head': 1.0, 'object_store_memory': 8991842304.0, 'CPU': 8.0}
(GPUActor pid=3792042) ray.get_nc_ids(): [0]
(GPUActor pid=3792042) rt_visible_cores: 0
(GPUActor pid=3792042) {'logits': tensor([[-1.4126, -1.9890, -1.3332, -0.2176, 3.9735, -0.6969, 1.8381]])}
(use_gpu pid=3791855) ray.get_nc_ids(): [1]
2 2 1 {'memory': 17954471118.0, 'accelerator_type:aws-neuron-core': 2.0, 'node:172.31.55.43': 1.0, 'CPU': 8.0, 'num_neuron_cores': 2.0, 'node:internal_head': 1.0, 'object_store_memory': 8977235558.0}
(GPUActor pid=3792731) ray.get_nc_ids(): [0, 1]
(GPUActor pid=3792731) rt_visible_cores: 0,1
(GPUActor pid=3792731)
{'logits': tensor([[-1.4126, -1.9890, -1.3332, -0.2176, 3.9735, -0.6969, 1.8381]])}
(autoscaler +27s) Tip: use ray status to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(autoscaler +27s) Warning: The following resource request cannot be scheduled right now: {'num_neuron_cores': 1.0, 'CPU': 1.0}. This is likely due to all cluster resources being claimed by actors. Consider creating fewer actors or adding more nodes to this Ray cluster.
3 3 1 {'object_store_memory': 8976119808.0, 'num_neuron_cores': 3.0, 'accelerator_type:aws-neuron-core': 3.0, 'memory': 17952239616.0, 'node:172.31.55.43': 1.0, 'node:internal_head': 1.0, 'CPU': 8.0}
(GPUActor pid=3793572) ray.get_nc_ids(): [0, 1, 2]
(GPUActor pid=3793572) rt_visible_cores: 0,1,2
(GPUActor pid=3793572) 2023-Aug-07 19:21:54.0564 3793572:3793572 ERROR
TDRV:tdrv_init_mla_phase1 ... Could not open the nd1

Overall happy-path confirms that that neuron_core has been used by the Actor (using neuron-ls)

Every 2.0s: neuron-ls --show-all-procs                                                                                                 ip-172-31-55-43: Mon Aug  7 19:24:18 2023

instance-type: trn1.2xlarge
instance-id: i-07eb09789f992275c
+--------+--------+--------+---------+---------+----------------------------+---------+
| NEURON | NEURON | NEURON |   PCI   |   PID   |          COMMAND           | RUNTIME |
| DEVICE | CORES  | MEMORY |   BDF   |         |                            | VERSION |
+--------+--------+--------+---------+---------+----------------------------+---------+
| 0      | 2      | 32 GB  | 00:1e.0 | 3794786 | ray::GPUActor              | 2.15.11 |
|        |        |        |         | 3794861 | neuron-ls --show-all-procs | NA      |
+--------+--------+--------+---------+---------+----------------------------+---------+

@scv119
Copy link
Contributor

scv119 commented Aug 3, 2023

thanks for the contribution! the PR looks pretty great delta today's discussion (using accelerator type instead of GPU). I'll take a look once we made the change.

@scv119 scv119 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 3, 2023
@github-actions
Copy link

github-actions bot commented Aug 4, 2023

Attention: External code changed

A previous version of this PR changed code that is used or cited in external sources, e.g. blog posts.

It looks like these changes have been reverted or are otherwise not present in this PR anymore. Please still carefully review the changes to make sure code we use in external sources still works.

Signed-off-by: maheedhar reddy chappidi <[email protected]>
@chappidim chappidim changed the title Auto-detection of GPU/accelerator_type for aws_accelerators trn1_inf Auto-detection of accelerator_type for aws_accelerators trn1_inf Aug 5, 2023
Signed-off-by: maheedhar reddy chappidi <[email protected]>
@chappidim chappidim marked this pull request as ready for review August 7, 2023 19:26
python/ray/_private/worker.py Outdated Show resolved Hide resolved
src/ray/common/ray_config_def.h Outdated Show resolved Hide resolved
python/ray/tests/test_autoscaler_yaml.py Outdated Show resolved Hide resolved
python/ray/_private/utils.py Outdated Show resolved Hide resolved
Signed-off-by: maheedhar reddy chappidi <[email protected]>
@chappidim chappidim requested a review from scv119 August 8, 2023 06:33
python/ray/_private/accelerator.py Outdated Show resolved Hide resolved
python/ray/_private/accelerator.py Outdated Show resolved Hide resolved
python/ray/_private/accelerator.py Outdated Show resolved Hide resolved
python/ray/_private/accelerator.py Outdated Show resolved Hide resolved
python/ray/_private/utils.py Outdated Show resolved Hide resolved
python/ray/_private/utils.py Show resolved Hide resolved
python/ray/_private/ray_constants.py Outdated Show resolved Hide resolved
python/ray/_private/utils.py Outdated Show resolved Hide resolved
python/ray/_private/utils.py Outdated Show resolved Hide resolved
src/ray/common/ray_config_def.h Outdated Show resolved Hide resolved
@chappidim chappidim requested a review from rkooo567 August 9, 2023 18:20
Signed-off-by: maheedhar reddy chappidi <[email protected]>
Copy link
Collaborator

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the doc, can we make sure to mention that the feature is alpha/experimental now?

@chappidim chappidim requested a review from scv119 August 16, 2023 18:28
doc/source/ray-core/tasks/using-ray-with-gpus.rst Outdated Show resolved Hide resolved
doc/source/ray-core/tasks/using-ray-with-gpus.rst Outdated Show resolved Hide resolved
doc/source/ray-core/tasks/using-ray-with-gpus.rst Outdated Show resolved Hide resolved
@rkooo567
Copy link
Contributor

rkooo567 commented Aug 17, 2023

No more comments. LGTM. @pcmoritz should we "request change" to apply your request?

Signed-off-by: maheedhar reddy chappidi <[email protected]>
@scv119 scv119 merged commit 6b69524 into ray-project:master Aug 17, 2023
@scv119 scv119 added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. external-code-affected labels Aug 17, 2023
@cadedaniel cadedaniel mentioned this pull request Aug 22, 2023
8 tasks
@harborn harborn mentioned this pull request Aug 24, 2023
8 tasks
@chappidim chappidim deleted the feat-trn1-accel branch August 25, 2023 00:14
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
…-project#37998)

This change is to support auto-detection of AWS accelerators and configuring appropriate environment variables to designate the neuron_core per task/actor.

Related
REP
ray-project#33707

Signed-off-by: e428265 <[email protected]>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
…-project#37998)

This change is to support auto-detection of AWS accelerators and configuring appropriate environment variables to designate the neuron_core per task/actor.

Related
REP
ray-project#33707

Signed-off-by: Victor <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants