Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load tests with provider-gcp #255

Closed
ulucinar opened this issue Mar 14, 2023 · 6 comments
Closed

Load tests with provider-gcp #255

ulucinar opened this issue Mar 14, 2023 · 6 comments
Assignees

Comments

@ulucinar
Copy link
Collaborator

We would like to perform some load tests to better understand the scaling characteristics of provider-gcp. The most recent experiments related to provider performance are here but they were for parameter optimization and not load test experiments. These tests can also help us to give the community sizing & scaling guidance.

We may do a set of experiments (with the latest available version of provider-gcp) in which we gradually increase the # of MRs provisioned until we saturate the computing resources of upbound/provider-gcp. We use an EKS cluster with a worker instance type of m5.2xlarge (32 GB Memory - 8 vCPUs) initially with the vanilla provider and with the default parameters (especially with the default max-reconcile-rate of 10) as suggested here) so that we can better relate our results with the results of those previous ones and also because the current default provider parameters are chosen using the results of those experiments.

We can also make use of the existing tooling from here & here to conduct these tests. We should collect & report at least the following for each experiment:

  • The types and number of MRs provisioned during the test
  • Success rate for Ready=True, Synced=True state in 10 min: During an interval of 10 min, how many of the MRs could acquire these conditions and how many failed to do so?
  • Using the available Prometheus metrics from the provider, what was the peak & avg. memory/CPU utilization? You can install the Prometheus and Grafana stack using something like: helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack -n prometheus --set namespaceOverride=prometheus --set grafana.namespaceOverride=prometheus --set kube-state-metrics.namespaceOverride=prometheus --set prometheus-node-exporter.namespaceOverride=prometheus --create-namespace from the prometheus-community Helm repository. We may include the Grafana dashboard screenshots like here.
  • kubectl get managed -o yaml output at the end of the experiment.
  • Time-to-readiness metrics as defined here. Histograms like we have there would be great but we can also derive them later.
  • go run github.com/upbound/uptest/cmd/ttr@fix-69 output (related with the above item)
  • ps -o pid,ppid,etime,comm,args output from the provider container. We can do this at the end of each experiment run or better, we can have reporting during the course of the experiment with something like: while true; do date; k exec -it <provider pod> -- ps -o pid,ppid,etime,comm,args; done and log the output to a file. You can refer to our conversion with @mmclane here for more context on why we do this.

As long as we have not saturated the compute resources of the provider, we can iterate with a new experiment with more MRs in increments of 5 or 10. I think initially we can start with 30 (let's start with something with 100% success rate, i.e., all MRs provisioned can become ready in the allocated time, i.e., in 10 min).

@Piotr1215
Copy link

First pass of provider GCP tests:

Test scenario

  • Bursting 1,10,50 and 100 storage buckets
  • provider GCP v0.29
  • EKS with 1 node m5.2xlarge - 32 GB Memory - 8 vCPUs.
  • kubernetes version 1.25.

Test results

Memory, CPU and TTR (time to readiness) were recorded for each run. Ps output showed consistently 2 processes as expected.

Image

Image

Memory graph in Prometheus

Image

Discussion

TTR is higher than with Azure probably due to the resource used (storage bucket) . I wasn't able to use container registry, it wouldn't show in the console for some reason. Interestingly peak memory usage is significantly lower, with comparable CPU results.

@Piotr1215
Copy link

Update to the above tests. The tests were run with the debug flag enabled in the ControllerConfig, this affects CPU utilization. In order to keep the results streamlined, the tests will be done without the debug setting. Below are results without the debug setting (first row) and results with (second row)

Image

@Piotr1215
Copy link

Piotr1215 commented Mar 17, 2023

Here are more results pushing the GCP provider to 500 MRs on the same setup. It is interesting how little memory was consumed, CPU was definitely a bottleneck.

Image

Image

@Piotr1215
Copy link

Piotr1215 commented Mar 22, 2023

New set of tests with the improved provider image ulucinar/provider-gcp-amd64:v0.29.0-e45875a and the same test conditions

Test scenario

  • Bursting 1,10,50, 100 and 500 storage buckets
  • provider GCP v0.29 modified image ulucinar/provider-gcp-amd64:v0.29.0-e45875a
  • EKS with 1 node m5.2xlarge - 32 GB Memory - 8 vCPUs.
  • kubernetes version 1.25.

Significant improvements in CPU and memory utilization, but interesting increase in the experiment duration. TTR remains the same.

Image

CPU metrics:

Image

Provider Version Runs Experiment Duration Average Time to Readiness in seconds Peak Time to Readiness in seconds Average Memory Peak Memory Average CPU % Peak CPU %
v0.29.0-e45875a 1 122.31 65 65 157.10 MB 185.78 MB 1.58 1.81
v0.29.0-e45875a 10 153.21 66 67 306.52 MB 573.60 MB 4.06 6.08
v0.29.0-e45875a 50 387.97 322.76 330 616.14 MB 1.10 GB 6.85 19.65
v0.29.0-e45875a 100 957.32 686.03 850 515.08 MB 932.49 MB 9.53 37.52
v0.29.0-e45875a 500 4468.2 3375.77 3993 597.22 MB 1.21 GB 16.72 88.38
v0.29.0 1 124.83 67 67 122.80 MB 171.25 MB 2.62 3.05
v0.29.0 10 102.72 72.4 74 443.88 MB 802.28 MB 4.77 10.66
v0.29.0 50 417.79 337.32 350 728.35 MB 1.04 GB 15.06 40.3
v0.29.0 100 825.02 661.47 689 757.08 MB 1.09 GB 24.36 71.11
v0.29.0 500 3955.96 3240.79 3322 818.38 MB 1.25 GB 25.69 98.34

@Piotr1215
Copy link

Here are improvements %

Improvements:  
Peak CPU: 37.00%
Average CPU: 60.69%
Peak Memory: 8.13%
Average Memory: 26.87%
Peak Time to Readiness in seconds 16.37%
Average Time to Readiness in seconds 3.07%

@Piotr1215
Copy link

A new sizing guide has been published based on the findings from the performance tests: https://github.com/upbound/upjet/blob/main/docs/sizing-guide.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants