Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Leader Election Lost: Kuberay pod restarts every 5mins! #2252

Closed
2 tasks done
aviadshimoni opened this issue Jul 17, 2024 · 26 comments
Closed
2 tasks done

[Bug] Leader Election Lost: Kuberay pod restarts every 5mins! #2252

aviadshimoni opened this issue Jul 17, 2024 · 26 comments
Labels
bug Something isn't working

Comments

@aviadshimoni
Copy link
Contributor

aviadshimoni commented Jul 17, 2024

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

Kuberay keep restarting because of leader election lost, I've raised it in Slack couple of times but no luck, we suspect it is causing us issues in some RayServices.
We use kuberay 1.1.1 helm chart version.
Main Issue is here:
{"level":"error","ts":"2024-07-17T13:33:36.127Z","logger":"setup","msg":"problem running manager","error":"leader election lost","stacktrace":"main.exitOnError\n\t/home/runner/work/kuberay/kuberay/ray-operator/main.go:245\nmain.main\n\t/home/runner/work/kuberay/kuberay/ray-operator/main.go:228\nruntime.main\n\t/opt/hostedtoolcache/go/1.20.14/x64/src/runtime/proc.go:250"}
kuberay-logs.txt

Values configured for helm chart:

  env:
  - name: ENABLE_GCS_FT_REDIS_CLEANUP
    value: "false"
  nodeSelector:
    name: devops
  tolerations:
    - effect: NoSchedule
      key: CriticalAddonsOnly
      operator: Equal
      value: devops
  resources:
    limits:
      cpu: 500m
      # Anecdotally, managing 500 Ray pods requires roughly 500MB memory.
      # Monitor memory usage and adjust as needed.
      memory: 2Gi
    requests:
      cpu: 200m
      memory: 1Gi

replicas: 1
Willing to provide documentation or code if needed :)

Reproduction script

Kuberay 1.1.1, deployed on GKE v1.28.10-gke.107500.
Values configured above (nothing special IMHO).

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@aviadshimoni aviadshimoni added bug Something isn't working triage labels Jul 17, 2024
@andrewsykim
Copy link
Collaborator

How frequently is it happening? I suggest removing the 500m CPU limit in your KubeRay pod spec, it could be leading to leader election requests timing out

@andrewsykim
Copy link
Collaborator

Btw, since you mentioend running on GKE, consider using the official GKE addon https://cloud.google.com/kubernetes-engine/docs/add-on/ray-on-gke/how-to/enable-ray-on-gke

However, it's only available on v1.30+ at the moment (you mentioned running v1.28.10)

@aviadshimoni
Copy link
Contributor Author

aviadshimoni commented Jul 17, 2024

Thank you @andrewsykim for your quick reply, you're right, limit is no good here.

more logs:

{"level":"info","ts":"2024-07-17T15:12:06.947Z","logger":"controllers.RayCluster","msg":"CheckAllPodsRunning: Pod is not running; Pod Name: rayservice-raycluster-v44tf-worker-ray-worker-gpu-s9k76; Pod Status.Phase: Failed","RayCluster":{"name":"clip-model-rayservice-raycluster-v44tf","namespace":"clip-model"},"reconcileID":"aa3e136c-fd9a-47ff-8b83-76cf27be3bdc"}
{"level":"info","ts":"2024-07-17T15:12:06.947Z","logger":"controllers.RayCluster","msg":"Environment variable RAYCLUSTER_DEFAULT_REQUEUE_SECONDS_ENV is not set, using default value of 300 seconds","RayCluster":{"name":"clip-model-rayservice-raycluster-v44tf","namespace":"clip-model"},"reconcileID":"aa3e136c-fd9a-47ff-8b83-76cf27be3bdc","cluster name":"clip-model-rayservice-raycluster-v44tf"}
{"level":"info","ts":"2024-07-17T15:12:06.947Z","logger":"controllers.RayCluster","msg":"Unconditional requeue after","RayCluster":{"name":"clip-model-rayservice-raycluster-v44tf","namespace":"clip-model"},"reconcileID":"aa3e136c-fd9a-47ff-8b83-76cf27be3bdc","cluster name":"clip-model-rayservice-raycluster-v44tf","seconds":300}
E0717 15:12:07.440182       1 leaderelection.go:369] Failed to update lock: Put "https://10.109.64.1:443/apis/coordination.k8s.io/v1/namespaces/devops-ray/leases/ray-operator-leader": context deadline exceeded
I0717 15:12:07.440241       1 leaderelection.go:285] failed to renew lease devops-ray/ray-operator-leader: timed out waiting for the condition
{"level":"error","ts":"2024-07-17T15:12:07.440Z","logger":"setup","msg":"problem running manager","error":"leader election lost","stacktrace":"main.exitOnError\n\t/home/runner/work/kuberay/kuberay/ray-operator/main.go:245\nmain.main\n\t/home/runner/work/kuberay/kuberay/ray-operator/main.go:228\nruntime.main\n\t/opt/hostedtoolcache/go/1.20.14/x64/src/runtime/proc.go:250"}

@kevin85421 suggested to config kuberay without leader election, do kuberay supports env var? RAY_DISABLE_LEADER_ELECTION or something to configure it via values.yaml?
https://ray-distributed.slack.com/archives/C02GFQ82JPM/p1721200745507799?thread_ts=1717181822.513479&cid=C02GFQ82JPM

@aviadshimoni
Copy link
Contributor Author

@aviadshimoni
Copy link
Contributor Author

limit being set to default in kuberay chart:
2024-07-17T15:23:10+00:00 apps Deployment devops-ray kuberay-operator OutOfSync Healthy Deployment.apps "kuberay-operator" is invalid: spec.template.spec.containers[0].resources.requests: Invalid value: "500m": must be less than or equal to cpu limit of 100m

when removing limit and keeping request, getting this error ^

@aviadshimoni
Copy link
Contributor Author

Created this MR to remove default in new version of kuberay:
#2253

@andrewsykim
Copy link
Collaborator

Does removing CPU limits resolve the leader election issue though?

@kevin85421 kevin85421 removed the triage label Jul 17, 2024
@aviadshimoni
Copy link
Contributor Author

@andrewsykim I can't remove the limits currently as we've default in helpm chart, I need to do it manually

@aviadshimoni
Copy link
Contributor Author

Edited manually in stg:
image

let's see

@aviadshimoni
Copy link
Contributor Author

edited prod too:
image

but still, the best practice is to avoid setting CPU limits, so I would remove it from the default values in helm chart.

@aviadshimoni
Copy link
Contributor Author

Issue persist, even without cpu limits or downgrade to 1.1.0 kuberay version

any help?

@aviadshimoni
Copy link
Contributor Author

image

@andrewsykim
Copy link
Collaborator

Hmmm, not sure then. Since you're running a single kuberay-operator anyways, you can follow Kai-Hsun's suggestion and just disable leader election

@aviadshimoni
Copy link
Contributor Author

aviadshimoni commented Jul 20, 2024

@andrewsykim should I increase replicas? this is prod.
@kevin85421 how to pass that as env var to disable leader election? or how do I pass this flag to the binary?

@aviadshimoni
Copy link
Contributor Author

seems related: #601

@andrewsykim
Copy link
Collaborator

Try setting --enable-leader-election=false in the kuberay-operator flags

@andrewsykim
Copy link
Collaborator

can you also share the full pod YAML of the kuberay operator? kubectl get pod <kuberay-operator-pod> -o yaml

@aviadshimoni
Copy link
Contributor Author

@andrewsykim I can do it manually, but not via helm as we don't support this flag: https://github.com/ray-project/kuberay/blob/master/helm-chart/kuberay-operator/templates/deployment.yaml#L57

can I create MR for this?

and here is the yaml:

apiVersion: v1
items:
- apiVersion: v1
  kind: Pod
  metadata:
    creationTimestamp: "2024-07-21T09:25:06Z"
    generateName: kuberay-operator-54d77db944-
    labels:
      app.kubernetes.io/component: kuberay-operator
      app.kubernetes.io/instance: kuberay-operator
      app.kubernetes.io/name: kuberay-operator
      pod-template-hash: 54d77db944
    name: kuberay-operator-54d77db944-bsxkt
    namespace: devops-ray
    ownerReferences:
    - apiVersion: apps/v1
      blockOwnerDeletion: true
      controller: true
      kind: ReplicaSet
      name: kuberay-operator-54d77db944
      uid: f21b2341-a8b5-4856-8816-d1b5dfdb9812
    resourceVersion: "1257112557"
    uid: a40cb305-3bb2-4a22-a69e-8d5506531f1c
  spec:
    containers:
    - command:
      - /manager
      env:
      - name: ENABLE_GCS_FT_REDIS_CLEANUP
        value: "false"
      image: quay.io/kuberay/operator:v1.1.0
      imagePullPolicy: IfNotPresent
      livenessProbe:
        failureThreshold: 5
        httpGet:
          path: /metrics
          port: http
          scheme: HTTP
        initialDelaySeconds: 10
        periodSeconds: 5
        successThreshold: 1
        timeoutSeconds: 1
      name: kuberay-operator
      ports:
      - containerPort: 8080
        name: http
        protocol: TCP
      readinessProbe:
        failureThreshold: 5
        httpGet:
          path: /metrics
          port: http
          scheme: HTTP
        initialDelaySeconds: 10
        periodSeconds: 5
        successThreshold: 1
        timeoutSeconds: 1
      resources:
        limits:
          cpu: "8"
          memory: 8Gi
        requests:
          cpu: "1"
          memory: 1Gi
      securityContext:
        allowPrivilegeEscalation: false
        capabilities:
          drop:
          - ALL
        readOnlyRootFilesystem: true
        runAsNonRoot: true
        seccompProfile:
          type: RuntimeDefault
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
      - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
        name: kube-api-access-m56k4
        readOnly: true
    dnsPolicy: ClusterFirst
    enableServiceLinks: true
    imagePullSecrets:
    - name: dv-docker-main-gcp
    nodeName: gke-gp-ms-uc1-k8s-public-mle-devops-1-dab37758-rl5l
    nodeSelector:
      name: devops
    preemptionPolicy: PreemptLowerPriority
    priority: 0
    restartPolicy: Always
    schedulerName: default-scheduler
    securityContext: {}
    serviceAccount: kuberay-operator
    serviceAccountName: kuberay-operator
    terminationGracePeriodSeconds: 30
    tolerations:
    - effect: NoSchedule
      key: CriticalAddonsOnly
      operator: Equal
      value: devops
    - effect: NoExecute
      key: node.kubernetes.io/not-ready
      operator: Exists
      tolerationSeconds: 300
    - effect: NoExecute
      key: node.kubernetes.io/unreachable
      operator: Exists
      tolerationSeconds: 300
    volumes:
    - name: kube-api-access-m56k4
      projected:
        defaultMode: 420
        sources:
        - serviceAccountToken:
            expirationSeconds: 3607
            path: token
        - configMap:
            items:
            - key: ca.crt
              path: ca.crt
            name: kube-root-ca.crt
        - downwardAPI:
            items:
            - fieldRef:
                apiVersion: v1
                fieldPath: metadata.namespace
              path: namespace
  status:
    conditions:
    - lastProbeTime: null
      lastTransitionTime: "2024-07-21T09:26:15Z"
      status: "True"
      type: Initialized
    - lastProbeTime: null
      lastTransitionTime: "2024-07-21T10:28:56Z"
      status: "True"
      type: Ready
    - lastProbeTime: null
      lastTransitionTime: "2024-07-21T10:28:56Z"
      status: "True"
      type: ContainersReady
    - lastProbeTime: null
      lastTransitionTime: "2024-07-21T09:26:15Z"
      status: "True"
      type: PodScheduled
    containerStatuses:
    - containerID: containerd://8109abfec6a2c12c0aa452c3cd4fe997632065db3626df7f0874b669b95dc1d8
      image: quay.io/kuberay/operator:v1.1.0
      imageID: quay.io/kuberay/operator@sha256:a535757c28bcc27c06f906146970417cd88769f78df860131625d776608ec309
      lastState:
        terminated:
          containerID: containerd://bebcfb18507c4a237a14eebb85d3356c158155b8d00adaee117138769a2da929
          exitCode: 1
          finishedAt: "2024-07-21T10:28:43Z"
          reason: Error
          startedAt: "2024-07-21T10:05:33Z"
      name: kuberay-operator
      ready: true
      restartCount: 3
      started: true
      state:
        running:
          startedAt: "2024-07-21T10:28:44Z"
    hostIP: <host IP>
    phase: Running
    podIP: <pod IP>
    podIPs:
    - IP: <pod IP>
    qosClass: Burstable
    startTime: "2024-07-21T09:26:15Z"
kind: List
metadata:
  resourceVersion: ""

@aviadshimoni
Copy link
Contributor Author

Created this PR:
#2262

Contributing to this project seems like a big milestone for me, TIA! 🥇

@Irvingwangjr
Copy link

Thank you @andrewsykim for your quick reply, you're right, limit is no good here.

more logs:

{"level":"info","ts":"2024-07-17T15:12:06.947Z","logger":"controllers.RayCluster","msg":"CheckAllPodsRunning: Pod is not running; Pod Name: rayservice-raycluster-v44tf-worker-ray-worker-gpu-s9k76; Pod Status.Phase: Failed","RayCluster":{"name":"clip-model-rayservice-raycluster-v44tf","namespace":"clip-model"},"reconcileID":"aa3e136c-fd9a-47ff-8b83-76cf27be3bdc"}
{"level":"info","ts":"2024-07-17T15:12:06.947Z","logger":"controllers.RayCluster","msg":"Environment variable RAYCLUSTER_DEFAULT_REQUEUE_SECONDS_ENV is not set, using default value of 300 seconds","RayCluster":{"name":"clip-model-rayservice-raycluster-v44tf","namespace":"clip-model"},"reconcileID":"aa3e136c-fd9a-47ff-8b83-76cf27be3bdc","cluster name":"clip-model-rayservice-raycluster-v44tf"}
{"level":"info","ts":"2024-07-17T15:12:06.947Z","logger":"controllers.RayCluster","msg":"Unconditional requeue after","RayCluster":{"name":"clip-model-rayservice-raycluster-v44tf","namespace":"clip-model"},"reconcileID":"aa3e136c-fd9a-47ff-8b83-76cf27be3bdc","cluster name":"clip-model-rayservice-raycluster-v44tf","seconds":300}
E0717 15:12:07.440182       1 leaderelection.go:369] Failed to update lock: Put "https://10.109.64.1:443/apis/coordination.k8s.io/v1/namespaces/devops-ray/leases/ray-operator-leader": context deadline exceeded
I0717 15:12:07.440241       1 leaderelection.go:285] failed to renew lease devops-ray/ray-operator-leader: timed out waiting for the condition
{"level":"error","ts":"2024-07-17T15:12:07.440Z","logger":"setup","msg":"problem running manager","error":"leader election lost","stacktrace":"main.exitOnError\n\t/home/runner/work/kuberay/kuberay/ray-operator/main.go:245\nmain.main\n\t/home/runner/work/kuberay/kuberay/ray-operator/main.go:228\nruntime.main\n\t/opt/hostedtoolcache/go/1.20.14/x64/src/runtime/proc.go:250"}

@kevin85421 suggested to config kuberay without leader election, do kuberay supports env var? RAY_DISABLE_LEADER_ELECTION or something to configure it via values.yaml? https://ray-distributed.slack.com/archives/C02GFQ82JPM/p1721200745507799?thread_ts=1717181822.513479&cid=C02GFQ82JPM

Just curious about have you check why leader election timeout? Did kuberay was killed by OOM-killer? or the networking issue about connecting to the api-server. if you simply disable leader-election, you might rise of lacking HA.

@aviadshimoni
Copy link
Contributor Author

@Irvingwangjr you're right, it seems like a timeout connecting to the api-server.
any chance we hit API limit?
in Grafana we see 20k request per min, quota in gcp is 3k per min: https://cloud.google.com/kubernetes-engine/quotas

@andrewsykim can you share more knowledge about that?

In stg, after disabling leader election timeout, I see no restarts.

@Irvingwangjr and anyway I'm using 1 replicas for the kuberay, should I increase it and enable leader election? or keep using 1 replica and disable leader election?
image

@aviadshimoni
Copy link
Contributor Author

aviadshimoni commented Jul 22, 2024

image (12)
Here we can see the big difference between leader election enabled or disabled.

To me it seems that we hit the K8S control plane API server limit, as we see a bunch of timeout.
the question is why? and if anyone else receive it in GKE? @andrewsykim tagging you as the GKE expert 🐐

@aviadshimoni
Copy link
Contributor Author

aviadshimoni commented Jul 22, 2024

GCP acks that there is a limit of 3k / 1m for API server, so we get a timeout to the Control Plane api server. credit: @dyurchanka

@andrewsykim
Copy link
Collaborator

andrewsykim commented Jul 22, 2024

The 3K / minute quota is the default limit for the GKE API and NOT the Kubernetes API server. However, it's possible that your cluster is throttling API requests from kuberay operator. Usually you can figure out if this is happening by looking at the apiserver logs. See https://cloud.google.com/kubernetes-engine/docs/how-to/view-logs#control_plane_logs

@andrewsykim
Copy link
Collaborator

Also see apiserver metrics that could help identify throttling from API server: https://cloud.google.com/kubernetes-engine/docs/how-to/control-plane-metrics#api-server-metrics

The ones containing flowcontrol in particular might indicate throttlign issues on your cluster

@aviadshimoni
Copy link
Contributor Author

Closing, moving to regional cluser solved the kuberay restarts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants