-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto-detection of accelerator_type for aws_accelerators trn1_inf #37998
Conversation
thanks for the contribution! the PR looks pretty great delta today's discussion (using accelerator type instead of GPU). I'll take a look once we made the change. |
Signed-off-by: maheedhar reddy chappidi <[email protected]>
51c1859
to
faf7e9f
Compare
Attention: External code changedA previous version of this PR changed code that is used or cited in external sources, e.g. blog posts. It looks like these changes have been reverted or are otherwise not present in this PR anymore. Please still carefully review the changes to make sure code we use in external sources still works. |
Signed-off-by: maheedhar reddy chappidi <[email protected]>
Signed-off-by: maheedhar reddy chappidi <[email protected]>
Signed-off-by: maheedhar reddy chappidi <[email protected]>
Signed-off-by: maheedhar reddy chappidi <[email protected]>
Signed-off-by: maheedhar reddy chappidi <[email protected]>
Signed-off-by: maheedhar reddy chappidi <[email protected]>
…o feat-trn1-accel
Signed-off-by: maheedhar reddy chappidi <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the doc, can we make sure to mention that the feature is alpha/experimental now?
Signed-off-by: maheedhar reddy chappidi <[email protected]>
Co-authored-by: matthewdeng <[email protected]> Signed-off-by: Chen Shen <[email protected]>
Co-authored-by: matthewdeng <[email protected]> Signed-off-by: Chen Shen <[email protected]>
Co-authored-by: matthewdeng <[email protected]> Signed-off-by: Chen Shen <[email protected]>
No more comments. LGTM. @pcmoritz should we "request change" to apply your request? |
Signed-off-by: maheedhar reddy chappidi <[email protected]>
…-project#37998) This change is to support auto-detection of AWS accelerators and configuring appropriate environment variables to designate the neuron_core per task/actor. Related REP ray-project#33707 Signed-off-by: e428265 <[email protected]>
…-project#37998) This change is to support auto-detection of AWS accelerators and configuring appropriate environment variables to designate the neuron_core per task/actor. Related REP ray-project#33707 Signed-off-by: Victor <[email protected]>
Why are these changes needed?
This change is to support auto-detection of AWS accelerators and configuring appropriate environment variables to designate the neuron_core per task/actor.
Related
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.Manual testing
Steps followed
Test scenarios
Where the host contains two neuron_cores (trn1.2xlarge)
(GPUActor pid=3789266) ray.get_nc_ids(): [0]
(GPUActor pid=3789266) rt_visible_cores: 0 (GPUActor pid=3789266)
{'logits': tensor([[-1.4126, -1.9890, -1.3332, -0.2176, 3.9735, -0.6969, 1.8381]])}
(use_gpu pid=3789087) ray.get_nc_ids(): [1]
(GPUActor pid=3792042) ray.get_nc_ids(): [0]
(GPUActor pid=3792042) rt_visible_cores: 0
(GPUActor pid=3792042) {'logits': tensor([[-1.4126, -1.9890, -1.3332, -0.2176, 3.9735, -0.6969, 1.8381]])}
(use_gpu pid=3791855) ray.get_nc_ids(): [1]
(GPUActor pid=3792731) ray.get_nc_ids(): [0, 1]
(GPUActor pid=3792731) rt_visible_cores: 0,1
(GPUActor pid=3792731)
{'logits': tensor([[-1.4126, -1.9890, -1.3332, -0.2176, 3.9735, -0.6969, 1.8381]])}
(autoscaler +27s) Tip: use
ray status
to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.(autoscaler +27s) Warning: The following resource request cannot be scheduled right now: {'num_neuron_cores': 1.0, 'CPU': 1.0}. This is likely due to all cluster resources being claimed by actors. Consider creating fewer actors or adding more nodes to this Ray cluster.
(GPUActor pid=3793572) ray.get_nc_ids(): [0, 1, 2]
(GPUActor pid=3793572) rt_visible_cores: 0,1,2
(GPUActor pid=3793572) 2023-Aug-07 19:21:54.0564 3793572:3793572 ERROR
TDRV:tdrv_init_mla_phase1 ... Could not open the nd1
Overall happy-path confirms that that neuron_core has been used by the Actor (using neuron-ls)