[Regression from v1.1] RayCluster Condition should surface resource quota issue correctly #2357

han-steve · 2024-09-06T00:14:54Z

Search before asking

I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

In v1.1, when a RayCluster cannot spin up worker nodes due to a resource quota issue, it would have the following status

status:
  ...
  reason: 'pods "rayjob-test2-raycluster-bsh5z-worker-small-wg-mh9v2" is forbidden:
    exceeded quota: low-resource-quota, requested: limits.cpu=200m,limits.memory=256Mi,
    used: limits.cpu=0,limits.memory=0, limited: limits.cpu=100m,limits.memory=107374182400m'
  state: failed

However, in v1.2, it simply says

status:
  ...
  lastUpdateTime: "2024-09-05T23:43:05Z"
  maxWorkerReplicas: 1
  minWorkerReplicas: 1
  observedGeneration: 1
  state: ready
  stateTransitionTimes:
    ready: "2024-09-05T23:43:05Z"

First, the state should not be ready. According to the doc,

When all Pods in the RayCluster are running for the first time, the RayCluster CR’s Status.State will transition to ready.

But not all pods in the RayCluster are running.

Second, the resource quota error should be added as a condition. The design doc alludes to emulating ReplicaSet conditions, which includes a type for resource quota error. Right now, the only place to find this error is in the operator logs:

..."error":"FailedCreateWorkerPod pods \"rayjob-test2-raycluster-x7242-small-wg-worker-rbkqx\" is forbidden: exceeded quota: low-resource-quota, requested: limits.cpu=200m,limits.memory=256Mi, ...

This makes it impossible for users to self-serve and debug this error.

As mentioned in #2182, our current way of surfacing this error to the user when deploying a Ray Job is using a separate query to the Ray Cluster for the error:

rayCluster, err := r.rayClient.RayClusters(namespace).
         Get(ctx, clusterName, metav1.GetOptions{})
if rayCluster.Status.State == rayv1api.Failed {
         return rayCluster.Status.Reason, nil
}

However, the 1.2 update breaks the logic, so we cannot upgrade to 1.2 yet.

Reproduction script

Create a resource quota:

resourceQuota := corev1.ResourceQuota{
	ObjectMeta: metav1.ObjectMeta{
		Name:      "low-resource-quota",
		Namespace: tCtx.namespace,
	},
	Spec: corev1.ResourceQuotaSpec{
		Hard: corev1.ResourceList{
			corev1.ResourceLimitsCPU:    resource.MustParse(".1"),
			corev1.ResourceLimitsMemory: resource.MustParse(".1Gi"),
		},
	},
}

_, err := tCtx.k8sClient.CoreV1().
	ResourceQuotas(tCtx.namespace).
	Create(tCtx.ctx, &resourceQuota, metav1.CreateOptions{})

Deploy a Ray job:

const sampleJobResponseJSON = `{"name":"rayjob-test2", "namespace":"kubeflow-ml", "user":"3cp0", "entrypoint":"python -V", "clusterSpec":{"headGroupSpec":{"computeTemplate":"default-template", "image":"rayproject/ray:2.9.0", "serviceType":"NodePort", "rayStartParams":{"dashboard-host":"0.0.0.0"}, "environment":{}, "labels":{"sidecar.istio.io/inject":"false"}}, "workerGroupSpec":[{"groupName":"small-wg", "computeTemplate":"default-template", "image":"rayproject/ray:2.9.0", "replicas":1, "minReplicas":1, "maxReplicas":1, "rayStartParams":{"metrics-export-port":"8080"}, "environment":{}, "labels":{"sidecar.istio.io/inject":"false"}}]}, "createdAt":"2024-05-20T21:33:03Z"}`

Where the compute template is defined as

configMap := corev1.ConfigMap{
	ObjectMeta: metav1.ObjectMeta{
		Name:      tCtx.computeTemplate,
		Namespace: tCtx.namespace,
		Labels: map[string]string{
			"ray.io/compute-template": tCtx.computeTemplate,
			"ray.io/config-type":      "compute-template",
		},
	},
	Data: map[string]string{
		"cpu":             ".1",
		"gpu":             "0",
		"gpu_accelerator": "",
		"memory":          ".2",
		"name":            tCtx.computeTemplate,
		"namespace":       tCtx.namespace,
	},
}
_, err := tCtx.k8sClient.CoreV1().
	ConfigMaps(tCtx.namespace).
	Create(tCtx.ctx, &configMap, metav1.CreateOptions{})

Anything else

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

andrewsykim · 2024-09-06T00:30:43Z

@kevin85421 @rueian @MortalHappiness I think this regression was introduced in #2249?

andrewsykim · 2024-09-06T00:32:03Z

The release notes for v1.2 around RayCluster status are guarded by a new feature gate RayClusterStatusCondition. To my understanding, existing APIs for .status.state and .status.reason should be unchanged unless you enabled the new feature gate. Seems like a possible bug introduced when we refactored how reconcile errors are surfaced in #2249

han-steve · 2024-09-06T03:25:38Z

Hi, thanks for taking a look. I wasn't aware that there's a feature gate. Can you tell me more?
And for more context, the state machine used to fail with the reason field set when there's a resource quota issue. The new state machine doesn't seem to surface the error at all.

MortalHappiness · 2024-09-06T16:04:21Z

@han-steve Could you provide more detailed steps on how to reproduce this issue? I tried to reproduce the error, but it’s difficult to do so with the partial Go code you provided. For instance, I’m unsure what the value of tCtx is, and it's challenging to reconstruct sampleJob just by looking at the JSON response.

The most effective way for us to reproduce the issue would be a single YAML file that we can apply with kubectl easily. Looks like the following:

apiVersion: v1
kind: ResourceQuota
(other fields...)

---

apiVersion: ray.io/v1
kind: RayJob
(other fields...)

---

apiVersion: v1
kind: ConfigMap
(other fields...)

Additionally, I searched through the codebase and found that the "ray.io/compute-template" annotation only appears in the apiserver module, which hasn’t been maintained for a long time. Therefore, I’m unsure if this issue pertains solely to the apiserver module.

cc @andrewsykim @kevin85421 @rueian

han-steve · 2024-09-09T17:29:04Z

Hi, apologies for the confusion. The reproduction code follows the style of the apiserver integration test suite. Here's a yaml file reproduction:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: low-resource-quota
spec:
  hard:
    limits.cpu: 100m
    limits.memory: 107374182400m
---
apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: rayjob-test2-raycluster-jd6g6
  namespace: test-namespace
spec:
  headGroupSpec:
    rayStartParams:
      dashboard-host: 0.0.0.0
    serviceType: NodePort
    template:
      metadata:
        annotations:
          ray.io/compute-image: rayproject/ray:2.9.0
          ray.io/compute-template: test-anemone
        labels:
          sidecar.istio.io/inject: "false"
      spec:
        containers:
        - env:
          - name: MY_POD_IP
            valueFrom:
              fieldRef:
                fieldPath: status.podIP
          image: rayproject/ray:2.9.0
          imagePullPolicy: IfNotPresent
          name: ray-head
          ports:
          - containerPort: 6379
            name: redis
            protocol: TCP
          - containerPort: 10001
            name: head
            protocol: TCP
          - containerPort: 8265
            name: dashboard
            protocol: TCP
          - containerPort: 8080
            name: metrics
            protocol: TCP
          resources:
            limits:
              cpu: "0"
              memory: "0"
            requests:
              cpu: "0"
              memory: "0"
  rayVersion: 2.32.0
  workerGroupSpecs:
  - groupName: small-wg
    maxReplicas: 1
    minReplicas: 1
    numOfHosts: 1
    rayStartParams:
      metrics-export-port: "8080"
    replicas: 1
    scaleStrategy: {}
    template:
      metadata:
        annotations:
          ray.io/compute-image: rayproject/ray:2.9.0
          ray.io/compute-template: test-anemone
        labels:
          sidecar.istio.io/inject: "false"
      spec:
        containers:
        - env:
          - name: RAY_DISABLE_DOCKER_CPU_WARNING
            value: "1"
          - name: TYPE
            value: worker
          - name: CPU_REQUEST
            valueFrom:
              resourceFieldRef:
                containerName: ray-worker
                divisor: "0"
                resource: requests.cpu
          - name: CPU_LIMITS
            valueFrom:
              resourceFieldRef:
                containerName: ray-worker
                divisor: "0"
                resource: limits.cpu
          - name: MEMORY_REQUESTS
            valueFrom:
              resourceFieldRef:
                containerName: ray-worker
                divisor: "0"
                resource: requests.cpu
          - name: MEMORY_LIMITS
            valueFrom:
              resourceFieldRef:
                containerName: ray-worker
                divisor: "0"
                resource: limits.cpu
          - name: MY_POD_NAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
          - name: MY_POD_IP
            valueFrom:
              fieldRef:
                fieldPath: status.podIP
          image: rayproject/ray:2.9.0
          imagePullPolicy: IfNotPresent
          lifecycle:
            preStop:
              exec:
                command:
                - /bin/sh
                - -c
                - ray stop
          name: ray-worker
          ports:
          - containerPort: 80
            protocol: TCP
          resources:
            limits:
              cpu: "0"
              memory: "0"
            requests:
              cpu: "0"
              memory: "0"

kevin85421 · 2024-09-10T19:23:18Z

It seems to be introduced by #2258 (https://github.com/ray-project/kuberay/pull/2258/files#diff-72ecc3ca405f1e828187748d4f1ec8160bccffa2a4f84a364cd7a94a78e1adb9L1152-L1157).

MortalHappiness · 2024-09-10T23:59:07Z

@kevin85421 Do you mean #2249 or #2258? You said 2249 but the link you provided is 2258.

kevin85421 · 2024-09-11T00:07:06Z

@MortalHappiness sorry, I am referring to #2258.

han-steve added bug Something isn't working triage labels Sep 6, 2024

kevin85421 added 1.3.0 and removed triage labels Sep 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Regression from v1.1] RayCluster Condition should surface resource quota issue correctly #2357

[Regression from v1.1] RayCluster Condition should surface resource quota issue correctly #2357

han-steve commented Sep 6, 2024

andrewsykim commented Sep 6, 2024

andrewsykim commented Sep 6, 2024 •

edited

Loading

han-steve commented Sep 6, 2024

MortalHappiness commented Sep 6, 2024 •

edited

Loading

han-steve commented Sep 9, 2024

kevin85421 commented Sep 10, 2024 •

edited

Loading

MortalHappiness commented Sep 10, 2024

kevin85421 commented Sep 11, 2024

[Regression from v1.1] RayCluster Condition should surface resource quota issue correctly #2357

[Regression from v1.1] RayCluster Condition should surface resource quota issue correctly #2357

Comments

han-steve commented Sep 6, 2024

Search before asking

KubeRay Component

What happened + What you expected to happen

Reproduction script

Anything else

Are you willing to submit a PR?

andrewsykim commented Sep 6, 2024

andrewsykim commented Sep 6, 2024 • edited Loading

han-steve commented Sep 6, 2024

MortalHappiness commented Sep 6, 2024 • edited Loading

han-steve commented Sep 9, 2024

kevin85421 commented Sep 10, 2024 • edited Loading

MortalHappiness commented Sep 10, 2024

kevin85421 commented Sep 11, 2024

andrewsykim commented Sep 6, 2024 •

edited

Loading

MortalHappiness commented Sep 6, 2024 •

edited

Loading

kevin85421 commented Sep 10, 2024 •

edited

Loading