-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak #785
Comments
I am running image-automation-controller with a single image update automation, and enabled metrics-server to check on the memory usage. My baseline (startup) memory usage is about 10mb and my usage after about 3 days was only 12mb. I will continue to monitor and do some more experiments on my end, if you can run Can you say if the 80 image updates were all one ImageUpdateAutomation, if you have more than one, etc. - the flux stats output will give us some of this information, if you can characterize your environment with some more detail (eg. single IUA, multiple tenants, etc.) that will also help. There is then some guidance about how to extract a profile which may help us debug this issue, can you please try to follow these instructions from the debugging guide and let us know what you find? Specifically the section "Collecting a profile" which should be easy to obtain from the metrics port: https://fluxcd.io/flux/gitops-toolkit/debugging/ Other information which may be useful, what are the intervals on Image resources set to - the Anything else you can tell which differentiates your environment from the Image Update guide may prove meaningful. |
Here's what I got from
No it was a total numberacross all image update automations.
We are having a base repository for all of our clusters with common applications that are always needed, e.g. ingress controllers, logging, etc.
Here's the heap file, at this moment it had around 37MB of memory usage after running 42 hours:
Usual intervals are
Since there is quite some of them, it is likely that there is one that's not small. We do however use the spec.ignore property to only whitelist a folder specific to flux resources, like so: # exclude all
/*
# include compacter
!/flux
Nothing so far
In fact, I don't see anything that would be different to the guide (https://fluxcd.io/flux/guides/image-update) |
A |
Our resources set for the image automation update controller are like this: resources:
limits:
cpu: '1'
memory: 1Gi
requests:
cpu: 100m
memory: 64Mi Thanks for your input @stefanprodan, i will try raising the time and see if that changes the memory usage metrics. |
Have you've seen the controller reach 1Gi, are you sure the restart is due to OOM? From what you've posted here I see no evidence of OOM if the controller gets to 100Mi that's normal as GC sees there is lots of free memory. An OOM is logged by kubelet, did that actually happen? |
I haven't seen it reaching 1Gi and i never mentioned (nor think) that it is OOM related restarts. |
A Memory leak would always result in OOM, hence the issue title is confusing to me. |
It can not result in OOM when the memory is rising slow enough for our instances to be replaced by other ones before the OOM could occur |
The extra memory you see can be reclaimed by the OS if it needs it, if you look at the memory dump you'll see this, there is no evidence of a memory leak as far as I can tell. |
Adjusting the interval from 1 minute to 5 minutes as helped a lot already and decreased the memory rise by ~60% over the same period of time. |
It's great to hear that lengthening the interval for Where you should keep a short interval is in the We appreciate the report and I agree there might be something we can do to handle this better, a bug in the controller around the edge cases when intervals are arriving faster than the GC can process the old data out of the heap. I'm not familiar enough with the codebase to dive in and try to narrow it down. But please try setting the ImageUpdateAutomation to an even longer interval, either You should get a better end-user experience that way, no long delays between commits, without triggering any leak issue. |
To expand a little bit, I didn't want to bring up Receiver right away, but the broader characterization which I often make is that typically "apply" operations in Flux are more expensive than "check for changes" source-type operations. We can trigger appliers indirectly by making sure the source kind updates often by polling at a short interval. When there is no change in a source, the source controller has caching opportunities so it can make that fetch inexpensive and basically a no-op. Similarly the Image Reflector Controller doesn't cost very much to fetch a list of tags and compare it to the previously observed list of tags. It's way less intensive polling a source frequently than doing a dry-run on the cluster or a full git clone operation as Image Update must do. A full clone is what's required in order to be able to push commits. So IUA's clone is a lot more expensive than the Source Controller's clone, which only needs to fetch the head commit. The Flux resources which apply from a source (or in IUA's case, generate a commit from a list of tags) are all configured automatically to create an internal watch on their upstreams, so when your GitRepository updates finding a new revision, it automatically notifies downstream Kustomizations so they can trigger reconcile immediately instead of waiting an interval. There is a similar relationship between ImageRepository/ImagePolicy/ImageUpdateAutomation. So when people are tuning intervals, and they haven't set up Receivers at all, but they want changes to go to the cluster fast... I say don't set Kustomization to a short interval because it will DDOS your cluster's control-plane with unnecessary dry-runs. Same for IUA and full git clones; when nothing on the source has changed and/or nothing in the target repository needs to change, the only purpose of the dry-run is for drift correction. If you don't worry about people overwriting the tag in the git repository with an older one, then there's very little reason to reconcile IUA so frequently. You can also make this behavior of sources triggering downstream resources instant without setting any short intervals at all by setting up a Receiver - in this case configure GitHub for a I recently revisited the receiver guide from end-to-end to ensure that it works with ImageRepository, GitRepository, OCIRepository, and also cert-manager: https://fluxcd.io/flux/guides/webhook-receivers/ But there should be basically no case even when receivers aren't configured where a short interval like So that's where you should set your short interval, if that's the issue that you're trying to solve. I think it is some kind of thread exhaustion issue, like Stefan suggested, I think the resources aren't getting cleaned up due to timeout or something, and that's what is causing the slow memory growth over a long period of time. If you can prevent the exhaustion/timeout from happening by setting a longer interval, then you shouldn't see any memory growth at all. My IUA controller running for several days with just one resource still uses only 12mb. I will set up more repositories, larger repositories, and shorter intervals to try to stress-test it, but you shouldn't need that configuration unless you have special circumstances. I have set up one Receiver with the package event, so even with default intervals, new images published are committed by IUA very fast - and then they are deployed just as quickly with a second Receiver connecting the GitRepository to the |
Thank you for the detailed explanation. I will adjust accordingly. |
I believe to have found a memory leak in the image-automation-controller.
Here's a screenshot of my grafana showing memory usage of the image-automation-controller over the last 7 days:
It only seems to affect the current leader (obvious, as it is the one doing the real work).
The restarts that can be seen by memory usage dropping are no crashes afaik (no logs about crashes), but seem to relate to instance scaling.
Image used:
ghcr.io/fluxcd/image-automation-controller:v0.39.0
Args:
--events-addr=http://notification-controller.flux-system.svc.cluster.local./ --watch-all-namespaces=true --log-level=info --log-encoding=json --enable-leader-election
There were ~80 image updates happening in these days, however they didn't seem related as the memory also increased on days without image updates.
The text was updated successfully, but these errors were encountered: