-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dramatic increase in CPU and memory consumption / performance issues #325
Comments
@mike-larson - did you get a chance to retest it using updated crossplane version - 1.10.2? It seems to have been fixed. |
We are using Crossplane v1.11.0 and still observe this issue, both for v0.26 and later versions of this provider. I can replicate it as follows:
Snippet of the logs when the problem is occuring:
It would be wonderful if we could at least adjust the reconcile retry rate, since in our case, we want it to keep retrying (since the topic will eventually get created), just at a much slower pace (exponential backoff would be nice). |
That explains the constant high CPU usage we've been seeing. But there is also a dramatic jump in memory usage when starting up for versions above v0.26, we can't get v0.29 running at all currently but v0.26 calms down much more quickly, I'm not sure if it's purely the increase in CRDs or something else going on. |
I'm seeing high CPU usage by this provider, though in a slightly different scenario, which I will describe in a separate issue. I just wanted to drop a suggestion here, for those who are in a position to use it: If possible, try and use the native aws provider for the resources that are supported by it. I realize this can complicate the platform definition if you need to add another provider to your setup, so this might not be a solution for everyone. |
@rozcietrzewiacz confirmed, the CPU usage is much lower with the community AWS provider. We've switched over to that and it's much more performant. Perhaps because it uses the AWS SDK instead of Terraform? |
I also see high CPU and memory usage. We haven't used this to build too much so far. Only 12 cognito user pools. I do see that they Reconcile constantly. I do see a Reconsciler error in the debug logs. 1.6763083302780247e+09 ERROR Reconciler error {"controller": "managed/wafv2.aws.upbound.io/v1beta1, kind=rulegroup", "controllerGroup": "wafv2.aws.upbound.io", "controllerKind": "RuleGroup", "userPool": {"name":"matt-user-pool-jwll2-b7j52"}, "namespace": "", "name": "matt-user-pool-jwll2-b7j52", "reconcileID": "8aa0442b-060f-4fb7-871c-dee3b587199c", "error": "cannot update managed resource status: Operation cannot be fulfilled on userpools.cognitoidp.aws.upbound.io "matt-user-pool-jwll2-b7j52": the object has been modified; please apply your changes to the latest version and try again"} I don't know that the resources was actually modified outside of the provider. I don't think this one has ever actually been used. |
Thank you everyone for the discussion here and sharing your observations and data! That is definitely helpful and appreciated. I think we should take a deeper look into this issue and see if there's a recent regression or some other root cause that is not simply related to the typical performance issues we've seen from a large number of CRDs installed. Please do keep sharing your findings as we dig into this more! 🙇 🙏 |
Just a quick note that is does indeed look like the timestamps for the reconcile events are happening more often than I would expect. Timestamps grabbed from #325 (comment) and then filtered with: grep -F 'Reconciling' | grep -F -i '/test-api-trading-activation-target-service-trading-activated-event' | awk '{print $1}' They're in a non human friendly scientific notation epoch format (tracked in crossplane/crossplane-runtime#373), but converted the timestamps look like below:
That's 11 reconcile events within about ~1 minute - I would think we'd see less than that if they were doing exponential backoff. Perhaps the condition for this SNS topic that is causing us to requeue another reconcile isn't an error and isn't subject to exponential backoff? Also, we see 2 events for grep -F 'events' | grep -F 'Successfully requested creation of external resource'
1.6760075409451208e+09 DEBUG events Normal {"object": {"kind":"TopicSubscription","name":"test-api-trading-activation-target-service-trading-activated-event","uid":"e7e6ee39-3dcb-40d7-b869-a6494c569188","apiVersion":"sns.aws.upbound.io/v1beta1","resourceVersion":"521434662"}, "reason": "CreatedExternalResource", "message": "Successfully requested creation of external resource"}
1.6760075807374654e+09 DEBUG events Normal {"object": {"kind":"TopicSubscription","name":"test-api-trading-activation-target-service-trading-activated-event","uid":"e7e6ee39-3dcb-40d7-b869-a6494c569188","apiVersion":"sns.aws.upbound.io/v1beta1","resourceVersion":"521435068"}, "reason": "CreatedExternalResource", "message": "Successfully requested creation of external resource"} Timestamps translated:
So the double creation and the frequency of reconciling here does look fishy to me 🎣 🤔 |
I verified that the huge spike in memory doesn't occur when all managed resources in the cluster are Synced and in Ready state, we were able to upgrade to v0.29 normally in this state, the memory usage didn't go up at all, in fact it dropped by ~70%. The bad news is, when the cluster was in this state (pods getting Evicted constantly) it was creating duplicate resources for apparently no reason which made the situation far worse. We went from 32 managed resources to 380+. Once provider-aws recovered it was able to successfully remove most of them, a handful required manual deletion and patching to remove the finalizer. |
I have crossplane installed on two clusters hub and dev. On the dev cluster I have it managing 12 cognito user pools. On the hub cluster it is only installed and isn't managing anything. The resource utilization between the two is vastly different. On the dev cluster where it is managing resources, we are experiencing high CPU and memory on my upbound-provider-aws pod. Both clusters are running the latest version of crossplane and the upbound-provider-aws containers. Let me provide some numbers. Its regularly using over 5 gig of memory and sometimes over 12. The real issue is the pod is causing Memory Node pressure on my dev cluster and the pods gets evicted. The pod is regularly causing node memory pressure and gets evicted. In the past 24 hours its been evicted 164 times because the node is under memory pressure. I have tried the following
My next stepsI am going to try to set pod resource requests and limits on the upbound-provider-aws pod to see if I can keep it from continually getting evicted. I am also going to consider using the community provider instead of the upbound version. Sample of my logsWhat I think I am seeing is that its constantly trying to reconcile the environments. I also see it getting errors on xray which is weird because I am not building anything with xray. https://gist.github.com/mmclane/b44b933cb831d26d520169f507d96c13 |
I was able to set resource limits but when I had it limited to 5.5GB of ram I was seeing regular restarts of the pod due to it running out of memory. Setting it to 8G limit also results in restarts. |
Even at a memory limit of 14G, this pod is restarted due to OOMKill 10 times in 5 hours |
One thing I have noticed this morning is that while my userpools show that they are synced and ready, the underlying resources in the composition are not showing has being in sync. If I look at the details on those I see that the plan is failing. I don't know why this would be and the AWS resource have all been created. I am wondering if this is what is causing all the CPU and Memory usage as it tries to rerun a plan for everything all the time. Furthermore, I had crossplane create an RDS database today (as I am working on an XRD around that) and it isn't able to delete the database it just created with the same errors. I am going to try to figure out why I am getting these errors if I can.... Here are some log snippets I have found
|
Increasing the memory limit on the provider pod to 20G forced the pod to update. The new pod was able to run the plans successfully and most resources are now showing as synced. |
Hi @mmclane and all, |
v1.24.8-eks-ffeb93d for us |
We're running v1.23.15-eks-49d8fe8. |
v1.24.8-eks-ffeb93d |
Thank you all for the version infos. I also did some tests yesterday on a local kind cluster to understand what's going on here. Need to do some further tests & observations on the issue, but currently it looks like to me that the concurrent TF calls and Terraform provider forks are causing the memory and CPU spikes we are observing. As @blakebarnett mentioned above, if the managed resources get to the I assume in the discussions above, the provider is running with the default values of the poll and the max-reconcile-rate parameters. Did anyone override these parameters? |
I've started using the community provider instead for my current package, at least until this gets sorted out. One thing I've noticed is that the community provider isn't constantly giving errors related to xray like I see on the upbound provider.
I know that package isn't building anything with xray! If I restart the pod I see it starting a lot of workers. Or at least ~ 2400 lines that include starting workers and reference xray.
I am wondering if this is related. Here is a gist with my fill startup log: https://gist.github.com/mmclane/b2c7199bd4bb7eac20853fbaaf4ece4b |
As for your question, I have not overwritten those settings. Here is my current ControllerConfig for the provider.
I have not seen resource utilization get low even when everything is ready and synced. At least, not nearly as low as second cluster or the community provisioner. I have seen the provider seem to get stuck and the resources get marked as out of sync until I restart the pod. |
Faced an issue with Upbound AWS provider version 0.29.0, Pods were getting evicted since today morning. Pod was running on 4C 8GB instance and getting evicted continuously. All of the resources were out of Snyn, when I exec in the pod and checked the process, it was running multiple terraform processes at once. May be trying to sync all of the resources in parallel and that explains why it was trying to use too much CPU and Memory. I increased the instance size to 16C and 64GB memory, It worked fine. |
To keep everyone up to date: We are treating this as our top priority and are busy investigating the root cause and possible solutions. We'll keep folks updated here as we gather information and form plans to address it. |
We are temporarily moving communication updates and requests for assistance and feedback to the following Crossplane community Slack workspace channel: #sig-provider-aws-resource-utilization. We plan to release two images of the provider tomorrow for testing and ask those who can try it to share their findings there. |
In the next provider release, v0.31.0, coming out early next week, we plan to do the following:
None of the above is a definitive fix for the larger issue yet, but it’s small iterative steps towards a better state. In the meantime, we continue to prioritize this work, and our team's focus is fully on improving the performance and making further quality enhancements which we will share more about in the coming weeks. |
With v0.31.0 memory consumption still skyrockets from time to time. Moreover, since the deployment for |
@fernandezcuesta The 0.31.0 release did not contain the cpu/memory optimizations yet, which is only coming in the next release. We are actively testing the work with the community in #sig-provider-aws-resource-utilization and once all issues are ironed out we will release a new version of the providers. |
you can set resource request and limits on the provider pod.
|
We are experiencing node terminations because of memory exhaustion on nodes running Our fix: The Controller is running without any The Pod starts with about 2Gi and dies (with the whole node) at around 22-28Gi (just for reference) in about 30mins.
Each consuming about 130Mi: Maybe relevant log informations:
As we've lost this node, i can't tell if this was killed by the kernel or from the provider itself (?) This is probably not the most useful information. Would there be anything more helpful to you? |
@nce thank you for the information. It would be helpful if you could try the test images we referenced in the XP community Slack, https://crossplane.slack.com/archives/C04QLETDJGN/p1679072821157379, and report on how they work for you. The test images are:
|
I've tested it and already mentioned it in slack, reposting it here for visibility. With the new image we get the following logs:
which results in a pod restart. The released v0.30 image starts just fine without any/a lot client-throttling.
Is there something we should've configured otherwise for v0.31? |
An update for everyone: We have release candidate images for the big three providers available for testing. These images contain the latest improvements in the context of the providers’ performances. Some load and correctness tests were done, and they were reported in this issue #576. For more context about the improvements, please see this PR: crossplane/upjet#178 xpkg.upbound.io/upbound/provider-aws:v0.32.0-rc.1 We recommend trying these RC images outside of production environments. The main purpose is to collect pre-release feedback and to confirm there are no blockers to releasing new versions of the providers at the end of the week. |
The new provider releases are available:
In addition to the performance improvements, we also have two additional resources for users to reference:
These improvements are only the start of our continued focus on improved performance and quality. |
Are there any updates on why with leader election, only one pod does the actual work? This is a big concern in terms of scalability for me. |
What happened?
Tried to create 75 resources of the kinds Subnet as well as RouteTableAssociation, as well as 3 RouteTables and Routes. All mentioned resources from apiVersion ec2.aws.upbound.io/v1beta1
The resources were created using a composition.
This results in the provider-aws pod (version 0.23.0) cannot process all the requests, increases dramatically in CPU and memory usage, and takes 24 hours to create these resources, until dedicating an entire node with 4 CPU's to Upbound provider-aws - after this, resource creation is faster, but -
After finishing creation, the resource use is permanently heightened to 4000 milicore and 3000Mi memory. It stays this way after all resources are ready and synced.
Without these resources, the resource use fluctuates between 800-2000milicore, and around 200Mi .
This is a dramatic and unsustainable increase in resource use upon adding additional resources.
When looking into the activity inside the aws-provider pod, it's visible that its constantly running terraform plan/apply on seemingly all resources. A very inefficient process
Additionally, tried to increase the replicas in the controller-config to 5, both with and without leader election. With leader election, only 1 of the pods was doing any practical work, and without, all the pods were competing to consume the same amount of resources as currently with just 1. In both cases, the same resource use.
As per requested, output of
kubectl get composite
kubectl get managed
An optional side-note which is not critical to read - based on my own analysis the Upbound AWS provider seems to do the following:
Using a Composition to create the resources had marginal benefits in relation to how quickly it was discovered that the subnet had been created, as there was a hard reference in status and patching, but otherwise no difference what so ever.
What I thought using a Composition would do is that it would gather all the resources in the same TF plan (so all three subnets and associations) to promote efficiency, but it doesn't do that.
How can we reproduce it?
I've put the compositeresourcedefinitions and a sample Helm template used to generate the 75 subnets in a public GitHub repository (simply increase the number of entries in the values file sample-values-format.yaml to created desired number of resources)
https://github.com/mike-larson/upbound-provider-aws-issue
What environment did it happen in?
kubectl version
): 1.22The text was updated successfully, but these errors were encountered: