-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Train] Add accelerator ids to workers and share neuron_cores by default #39091
Conversation
… default Signed-off-by: maheedhar reddy chappidi <[email protected]>
Signed-off-by: maheedhar reddy chappidi <[email protected]>
cc: @matthewdeng |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome! At a high level the logic makes sense. I'm wondering if we can use this opportunity to make this a little more generic to support even more accelerator types.
For example, support for TPUs was just merged today and I think it would be great if we could establish a pattern here that would allow us to easily extend this to TPUs with minimal boiler plate code. 😄
Signed-off-by: maheedhar reddy chappidi <[email protected]>
Signed-off-by: maheedhar reddy chappidi <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly just some minor comments around forward compatibility, I think this should be good to go soon 🙂
if self._num_gpus_per_worker > 0 and share_cuda_visible_devices_enabled: | ||
self._share_cuda_visible_devices() | ||
elif self._additional_resources_per_worker: | ||
for ( | ||
accelerator, | ||
env_var, | ||
) in SUPPORTED_ACCELERATOR_DEVICES_TO_ENV_VAR.items(): | ||
if self._share_accelerator_devices_enabled(accelerator): | ||
self._share_resource_ids(accelerator, env_var) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I am thinking about is unifying the logic between GPUs and the other resources, which might be simpler since that's how it's set up in the WorkerGroup now, but this does not need to be done in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intentionally kept separate for now.
Signed-off-by: maheedhar reddy chappidi <[email protected]>
…o feat-train-trn2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few final touches. Thanks!
# 2_CPUs, 1_GPU | ||
assert len(gpu_and_accelerator_ids) == 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unclear to me what the purpose of this comment is. Also I'm not sure if we should assert the length of this, as it may change if we add more resource types in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
List doesn't change because these resources are configured at test/UT level (let say ray_start_2_cpus_and_gpus). Comments were added during my UT debugging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh what is the output of this? I tried running a super naive script on my laptop that prints out ray.get_runtime_context().get_resource_ids()
, and it showed up as {'GPU': [], 'neuron_cores': [], 'TPU': []}
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, i was under the impression (got lost with refactors) I get CPU/GPU/neuron_cores. It's actually TPU which got added recently. Fixed the UTs by removing length check
@matthewdeng besides this PR, what are the other changes needed to support different accelerators? For example, E.g. Currently |
@jjyao The users can do something like Possibly we can update the apis as |
Signed-off-by: maheedhar reddy chappidi <[email protected]>
Signed-off-by: maheedhar reddy chappidi <[email protected]>
Signed-off-by: maheedhar reddy chappidi <[email protected]>
has_resource_requested = ( | ||
self._additional_resources_per_worker.get(resource_name, None) is not None | ||
) | ||
return bool(env_integer(enable_sharing_env, has_resource_requested)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I was looking over this again, should we change this logic so that it only returns True if both are True? I don't think it makes sense to share resources if has_resource_requested
is False?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was my thinking,
- If user requests additional resources with neuron_cores, share by default (no TRAIN_ENABLE_SHARE_NEURON_CORES_ACCELERATOR exists)
- If user pass env (TRAIN_ENABLE_SHARE_NEURON_CORES_ACCELERATOR) check if integer exists. For, 1/True 0/False.
I agree then the env naming must be some DISABLE_SHARE_xxx. Fixed it. Good catch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I was thinking to just make this consistent with the GPU logic:
if self._num_gpus_per_worker > 0 and share_cuda_visible_devices_enabled:
self._share_cuda_visible_devices()
Signed-off-by: maheedhar reddy chappidi <[email protected]>
Signed-off-by: maheedhar reddy chappidi <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much!
…ult (ray-project#39091) Signed-off-by: maheedhar reddy chappidi <[email protected]> Signed-off-by: Victor <[email protected]>
Why are these changes needed?
This change enables the train worker to share neuron cores between workers on same node by retrieving the accelerator/neuron_core ids from each worker's runtime context
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.Manual testing
workers=2, neuron_cores=32 ✅
workers=64, neuron_cores=1 ✅