-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Controller stops accepting jobs from the cluster queue #302
Comments
Hi @aressem, did you discover anything with your tests where the number is set to 0? |
@DrJosh9000 , the pipeline works as expected with |
Same issue when testing with 2024-05-21T21:31:57.923Z DEBUG limiter scheduler/limiter.go:79 max-in-flight reached {"in-flight": 1} |
i saw the same issue, |
🤔 Maybe the controller should periodically survey the cluster, and adjust tokens accordingly. |
We just had CI outage partially caused by this behavior, here is the gist:
The logs indicated that there were available tokens but yet it got stuck at lower number. 2024-11-12T17:59:37.861Z DEBUG limiter scheduler/limiter.go:87
Create: job is already in-flight {"uuid": "01931db6-67ea-403c-8687-e01ab64e8e94",
"num-in-flight": 93, "available-tokens": 162} We will be adding alarms for stale Buildkite jobs in the queue, but something still seems wrong with the controller because it should've still scheduled other K8S Jobs into the cluster. |
I am also running in to this issue, I have to restart my kubernetes deployment basically every day. 😅 |
I'm still looking into this one. I have a new theory: k8s jobs can be successfully created, but fail without ever starting a pod. This state isn't handled properly: the job remains present until the TTL, so can't be recreated under the same name. That remains the oldest job available, so before #427 the controller repeatedly tries and fails to recreate it. With #427 other jobs get a shot at being created instead, but this isn't much help if the jobs are failing because the cluster is very busy. |
v0.20.0 shipped with a few improvements in this area. I'm curious to see what has helped. And ideally if possible, when trying v0.20.0 or later, gather some Prometheus metrics to get a better sense of what else is going wrong: # values.yaml
config:
prometheus-port: 9216 # or some other port of your choosing Here's an example apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: agent-stack-k8s
labels:
app: agent-stack-k8s
spec:
jobLabel: app
namespaceSelector:
matchNames:
- buildkite
selector:
matchLabels:
app: agent-stack-k8s
podMetricsEndpoints:
- port: metrics # defined in the Helm chart when prometheus-port is set
interval: 1s # feel free to tune |
We have the
agent-stack-k8s
up and running and works fine for a while. However, it suddenly stops accepting new jobs and the last thing it outputs is (we turned on debug):We currently only have a single pipeline, single cluster and single queue. When this happens there are no jobs or pods named
buildkite-${UUID}
in the k8s cluster. Executingkubectl -n buildkite rollout restart deployment agent-stack-k8s
makes the controller happy again and it starts jobs from the queue.I suspect that there is something that should decrement the
in-flight
number, but fails to do so. We are now running a test where this number is set to 0 to see if that works around the problem.The text was updated successfully, but these errors were encountered: