-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Load tests with provider-gcp #255
Comments
First pass of provider GCP tests: Test scenario
Test resultsMemory, CPU and TTR (time to readiness) were recorded for each run. Ps output showed consistently 2 processes as expected. Memory graph in Prometheus DiscussionTTR is higher than with Azure probably due to the resource used (storage bucket) . I wasn't able to use container registry, it wouldn't show in the console for some reason. Interestingly peak memory usage is significantly lower, with comparable CPU results. |
Update to the above tests. The tests were run with the debug flag enabled in the |
New set of tests with the improved provider image Test scenario
Significant improvements in CPU and memory utilization, but interesting increase in the experiment duration. TTR remains the same. CPU metrics:
|
Here are improvements %
|
A new sizing guide has been published based on the findings from the performance tests: https://github.com/upbound/upjet/blob/main/docs/sizing-guide.md |
We would like to perform some load tests to better understand the scaling characteristics of provider-gcp. The most recent experiments related to provider performance are here but they were for parameter optimization and not load test experiments. These tests can also help us to give the community sizing & scaling guidance.
We may do a set of experiments (with the latest available version of provider-gcp) in which we gradually increase the # of MRs provisioned until we saturate the computing resources of upbound/provider-gcp. We use an EKS cluster with a worker instance type of m5.2xlarge (32 GB Memory - 8 vCPUs) initially with the vanilla provider and with the default parameters (especially with the default
max-reconcile-rate
of 10) as suggested here) so that we can better relate our results with the results of those previous ones and also because the current default provider parameters are chosen using the results of those experiments.We can also make use of the existing tooling from here & here to conduct these tests. We should collect & report at least the following for each experiment:
Ready=True, Synced=True
state in 10 min: During an interval of 10 min, how many of the MRs could acquire these conditions and how many failed to do so?helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack -n prometheus --set namespaceOverride=prometheus --set grafana.namespaceOverride=prometheus --set kube-state-metrics.namespaceOverride=prometheus --set prometheus-node-exporter.namespaceOverride=prometheus --create-namespace
from theprometheus-community
Helm repository. We may include the Grafana dashboard screenshots like here.kubectl get managed -o yaml
output at the end of the experiment.go run github.com/upbound/uptest/cmd/ttr@fix-69
output (related with the above item)ps -o pid,ppid,etime,comm,args
output from the provider container. We can do this at the end of each experiment run or better, we can have reporting during the course of the experiment with something like:while true; do date; k exec -it <provider pod> -- ps -o pid,ppid,etime,comm,args; done
and log the output to a file. You can refer to our conversion with @mmclane here for more context on why we do this.As long as we have not saturated the compute resources of the provider, we can iterate with a new experiment with more MRs in increments of 5 or 10. I think initially we can start with 30 (let's start with something with 100% success rate, i.e., all MRs provisioned can become ready in the allocated time, i.e., in 10 min).
The text was updated successfully, but these errors were encountered: