-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Regression from v1.1] RayCluster Condition should surface resource quota issue correctly #2357
Comments
@kevin85421 @rueian @MortalHappiness I think this regression was introduced in #2249? |
The release notes for v1.2 around RayCluster status are guarded by a new feature gate |
Hi, thanks for taking a look. I wasn't aware that there's a feature gate. Can you tell me more? |
@han-steve Could you provide more detailed steps on how to reproduce this issue? I tried to reproduce the error, but it’s difficult to do so with the partial Go code you provided. For instance, I’m unsure what the value of The most effective way for us to reproduce the issue would be a single YAML file that we can apply with apiVersion: v1
kind: ResourceQuota
(other fields...)
---
apiVersion: ray.io/v1
kind: RayJob
(other fields...)
---
apiVersion: v1
kind: ConfigMap
(other fields...) Additionally, I searched through the codebase and found that the "ray.io/compute-template" annotation only appears in the |
Hi, apologies for the confusion. The reproduction code follows the style of the apiserver integration test suite. Here's a yaml file reproduction:
|
@kevin85421 Do you mean #2249 or #2258? You said 2249 but the link you provided is 2258. |
@MortalHappiness sorry, I am referring to #2258. |
Search before asking
KubeRay Component
ray-operator
What happened + What you expected to happen
In v1.1, when a RayCluster cannot spin up worker nodes due to a resource quota issue, it would have the following status
However, in v1.2, it simply says
First, the state should not be ready. According to the doc,
But not all pods in the RayCluster are running.
Second, the resource quota error should be added as a condition. The design doc alludes to emulating ReplicaSet conditions, which includes a type for resource quota error. Right now, the only place to find this error is in the operator logs:
This makes it impossible for users to self-serve and debug this error.
As mentioned in #2182, our current way of surfacing this error to the user when deploying a Ray Job is using a separate query to the Ray Cluster for the error:
However, the 1.2 update breaks the logic, so we cannot upgrade to 1.2 yet.
Reproduction script
Create a resource quota:
Deploy a Ray job:
Where the compute template is defined as
Anything else
No response
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: