-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flux 2.2, image-automation-controller stops reconciling; needed restart #689
Comments
The 2nd issue where a single There is some strange connection between the 2 automations that stopped today and Tuesday. Both of them were automations in the same cluster that were updating the same image name / tag, but for 2 different apps. Tuesday had automation After Tuesday's issue we decided to consolidate them down to 1 automation by increasing the |
After more investigation, the This however doesn't fully explain both issues:
|
The controller has a exponential backoff mechanism, same as Kubernetes has for image pull errors. Every time it fails, the backoff increase the retry interval and ca go up to hours. Is this repo an app repo? Does it have a huge .git history? |
It's a dedicated flux repo, but there are a lot of configurations in there (4 main clusters, approx 30+ apps / image automations per cluster each with intervals of
I think this command gives me the size of the history:
it clones pretty quickly on my laptop, like in just a couple seconds |
Could networking be a potential cause? The primary errors we're seeing from the image-controller (about 1-2 per minute) are:
-- We're using Gitlab and it has SSH git operations limit of 600/min. Is the I see the |
The controller does a full clone every time, you may want to increase the timeout in the GitRepo. |
I'll give it a shot and monitor over the next day or so. thanks! |
So the errors are perhaps a red herring and what I'm seeing doesn't seem associated with the exponential backoff. I turned on The final logs emitted for the automation (filtered by name) were:
So it's hanging after the debug log There was an "exponential backoff" log message + context deadline error about 15 minutes before this, but the later logs show the resource recovering and continuing to run "no changes made..." until it goes silent I've so far increased CPU on the image-automation-controller and increased the GitRepository timeout to |
That's strange because the timeout for the clone operation is derived from the GitRepository's |
Agree, it's well beyond the timeout. It was bumped up it to # kustomization.yaml
- patch: |
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
name: flux-system
spec:
timeout: 2m0s |
For the time being, we've deployed a pod that will kill/cycle the |
I'm wondering if this has a relationship to this issue: fluxcd/source-controller#1154 After digging some more, I found the This datadog metric is described as:
We do have a fair amount of SSH errors in our logs cloning from Gitlab.com, which to me seems pretty expected from the SAAS provider. Stuck threads, if they cause these controllers to stall, seem like a problem originating in FluxCD. Are there any mitigations you'd recommend that might reduce the impact of this on
|
Making a new issue after seeing this comment in the last issue on this topic: #282 (comment)
Specs
On Flux
v2.2
with the image-update-automation pod running the imageimage-automation-controller:v0.37.0
forimage.toolkit.fluxcd.io/v1beta1
automation resourcesDescription
The issue we saw was the
image-automation-controller
stopped reconciling for allImageUpdateAutomation
resources in the cluster for ~ 5 hours, then when it resumed stopped reconciling only 1ImageUpdateAutomation
resource out of many in the cluster.To resolve it, had to delete the
image-automation-controller
pod and upon restart everything worked fine.The following telemetry screenshots show CPU usage dropping off entirely, log volume dropping off entirely, and no errors produced in logs. There were no pod restarts / crashloops or any k8s events for the pods.
I've attached a
debug/pprof/goroutine?debug=2"
profiler to this issue as well.heap.txt
Please let me know if there's any further details I can provide. Thank you!
The text was updated successfully, but these errors were encountered: