-
Notifications
You must be signed in to change notification settings - Fork 305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Ray Autoscaler to the Flyte-Ray plugin #1937
Add Ray Autoscaler to the Flyte-Ray plugin #1937
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #1937 +/- ##
=======================================
Coverage 85.53% 85.53%
=======================================
Files 309 309
Lines 23460 23475 +15
Branches 3630 3630
=======================================
+ Hits 20066 20079 +13
- Misses 2752 2753 +1
- Partials 642 643 +1 ☔ View full report in Codecov by Sentry. |
db8d4c3
to
ddfd824
Compare
07ee38a
to
5de8cf7
Compare
Signed-off-by: Yicheng-Lu-llll <[email protected]>
5de8cf7
to
3e54e42
Compare
Signed-off-by: Yicheng-Lu-llll <[email protected]>
Signed-off-by: Yicheng-Lu-llll <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, could we add a small test here?
@@ -148,13 +152,22 @@ def worker_group_spec(self) -> typing.List[WorkerGroupSpec]: | |||
""" | |||
return self._worker_group_spec | |||
|
|||
@property | |||
def enable_in_tree_autoscaling(self) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
qq: why in_tree
? Does Ray have other autoscaling strategies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only have one autoscaling strategy. I have changed name to enable_autoscaling
. Thank you!
Signed-off-by: Yicheng-Lu-llll <[email protected]>
Signed-off-by: Yicheng-Lu-llll <[email protected]>
Signed-off-by: Yicheng-Lu-llll <[email protected]>
@@ -178,9 +192,13 @@ def __init__( | |||
self, | |||
ray_cluster: RayCluster, | |||
runtime_env: typing.Optional[str], | |||
ttl_seconds_after_finished: typing.Optional[int] = None, | |||
shutdown_after_job_finishes: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this mean that by default the rayjob will not be reclaimed by kuberay once the job finishes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -178,9 +192,13 @@ def __init__( | |||
self, | |||
ray_cluster: RayCluster, | |||
runtime_env: typing.Optional[str], | |||
ttl_seconds_after_finished: typing.Optional[int] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a no-op if shutdown_after_job_finishes
is set to False, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
enable_autoscaling=True, | ||
shutdown_after_job_finishes=True, | ||
ttl_seconds_after_finished=20, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a new release of flyteidl.
Signed-off-by: Yicheng-Lu-llll <[email protected]>
c03ad40
to
150ab4d
Compare
Signed-off-by: Yicheng-Lu-llll <[email protected]>
Signed-off-by: Yicheng-Lu-llll <[email protected]> Signed-off-by: Jan Fiedler <[email protected]>
TL;DR
NOTE:The ray CI test failed because we need to first merge flyteorg/flyte#4363 to update flytecli.
Currently, the Flyte-Ray plugin utilizes Rayjob. However, there are cases where Rayjob may require an autoscaler.
So, this PR adds Ray Autoscaler config to the Flyte-Ray plugin. Also see flyteorg/flyte#4363.
btw, This PR adds
shutdown_after_job_finishes
andttl_seconds_after_finished
.ttl_seconds_after_finished
specifies the number of seconds after which the RayCluster will be deleted after the RayJob finishes.shutdown_after_job_finishes
specifies whether the RayCluster should be deleted after the RayJob finishes.Below is an example:
init(replicas=0), only head:
(flytekit) ubuntu@ip-172-31-2-249:~/flyte/flytekit$ kubectl get pod -n flytesnacks-development NAME READY STATUS RESTARTS AGE f50cd4d1a5ceb40bc9aa-n0-0-raycluster-pbvx9-head-szdgn 2/2 Running 0 3s
task executing(max_replicas=2):
(flytekit) ubuntu@ip-172-31-2-249:~/flyte/flytekit$ kubectl get pod -n flytesnacks-development NAME READY STATUS RESTARTS AGE f50cd4d1a5ceb40bc9aa-n0-0-raycluster-pbvx9-head-szdgn 2/2 Running 0 54s ceb40bc9aa-n0-0-raycluster-pbvx9-worker-ray-group-llzp7 1/1 Running 0 14s ceb40bc9aa-n0-0-raycluster-pbvx9-worker-ray-group-mqrb6 1/1 Running 0 14s
finsh(min_replicas=0):
(flytekit) ubuntu@ip-172-31-2-249:~/flyte/flytekit$ kubectl get pod -n flytesnacks-development NAME READY STATUS RESTARTS AGE f50cd4d1a5ceb40bc9aa-n0-0-raycluster-pbvx9-head-szdgn 2/2 Running 0 2m16s ceb40bc9aa-n0-0-raycluster-pbvx9-worker-ray-group-llzp7 1/1 Terminating 0 96s ceb40bc9aa-n0-0-raycluster-pbvx9-worker-ray-group-mqrb6 1/1 Terminating 0 96s
After TTL 120s:
Type
Are all requirements met?
Complete description
How did you fix the bug, make the feature etc. Link to any design docs etc
Tracking Issue
flyteorg/flyte#4187
Follow-up issue
NA
OR
https://github.com/flyteorg/flyte/issues/