Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Vegeta rates / targets to SLA in performance tests #14429

Merged
merged 10 commits into from
Jan 15, 2024

Conversation

xiangpingjiang
Copy link
Contributor

Fixes #14403

Proposed Changes

Release Note

Add Vegeta rates / targets to SLA in performance tests

@knative-prow
Copy link

knative-prow bot commented Sep 25, 2023

Hi @xiangpingjiang. Thanks for your PR.

I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@knative-prow knative-prow bot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. area/test-and-release It flags unit/e2e/conformance/perf test issues for product features labels Sep 25, 2023
@skonto skonto added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Sep 25, 2023
@dprotaso
Copy link
Member

/ok-to-test

@knative-prow knative-prow bot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 25, 2023
@dprotaso
Copy link
Member

/test performance-tests

@codecov
Copy link

codecov bot commented Sep 25, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (81149da) 86.05% compared to head (59049d2) 86.02%.
Report is 23 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #14429      +/-   ##
==========================================
- Coverage   86.05%   86.02%   -0.04%     
==========================================
  Files         197      197              
  Lines       14937    14945       +8     
==========================================
+ Hits        12854    12856       +2     
- Misses       1774     1778       +4     
- Partials      309      311       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@skonto
Copy link
Contributor

skonto commented Sep 25, 2023

SLAs seem not to be respected (previous SLAs unrelated to the PR as well, see bellow).
We need to control the variance somehow eg. evenly distribute pods and make sure these tests pass.

2023/09/25 15:00:02 Cleaning up all created services
2023/09/25 15:00:07 Shutting down InfluxReporter
2023/09/25 15:00:07 SLA 1 failed. Errors occurred: 1
job.batch "real-traffic-test" deleted

503 Service Unavailable
2023/09/25 16:05:48 Shutting down InfluxReporter
2023/09/25 16:05:48 SLA 1 failed. P95 latency is not in 100-110ms time range: 177.512182ms
job.batch "rollout-probe-queue-direct" deleted

2023/09/25 15:05:18 SLA 1 passed. P95 latency is in 100000000-105000000ms time range
2023/09/25 15:05:18 Shutting down InfluxReporter
2023/09/25 15:05:18 SLA 2 failed. vegeta rate is 0.001
job.batch "dataplane-probe-deployment" deleted

2023/09/25 15:08:28 SLA 1 passed. P95 latency is in 100000000-110000000ms time range
2023/09/25 15:08:28 Shutting down InfluxReporter
2023/09/25 15:08:28 SLA 2 failed. vegeta rate is 0.001
job.batch "dataplane-probe-activator" deleted
service.serving.knative.dev "activator" deleted

2023/09/25 15:11:37 SLA 1 passed. P95 latency is in 100000000-110000000ms time range
2023/09/25 15:11:37 Shutting down InfluxReporter
Status Codes  [code:count]                      200:180000  
Error Set:
2023/09/25 15:11:37 SLA 2 failed. vegeta rate is 0.001
job.batch "dataplane-probe-queue" deleted

Error Set:
2023/09/25 15:26:47 SLA 1 passed. Amount of ready services is within the expected range. Is: 179.000000, expected: 174.000000-180.000000
2023/09/25 15:26:47 SLA 2 passed. P95 latency is in 0-25s time range
2023/09/25 15:26:50 SLA 3 failed. vegeta rate is 1253

Error Set:
2023/09/25 15:27:07 SLA 1 failed. P95 latency is not in 0-15000000ms time range: 34.532628ms
job.batch "scale-from-zero-1" deleted
service.serving.knative.dev "perftest-scalefromzero-00-bxwarrca" deleted

Error Set:
2023/09/25 15:27:26 Shutting down InfluxReporter
2023/09/25 15:27:26 SLA 1 failed. P95 latency is not in 0-15000000ms time range: 40.358438ms
job.batch "scale-from-zero-5" deleted

2023/09/25 15:40:46 SLA 1 passed. P95 latency is in 100-115ms time range
2023/09/25 15:40:46 SLA 2 passed. Max latency is below 10s
2023/09/25 15:40:46 SLA 3 passed. No errors occurred
2023/09/25 15:40:46 Shutting down InfluxReporter
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:719995  
Error Set:
2023/09/25 15:40:46 SLA 4 failed. total requests is 719995
job.batch "load-test-zero" deleted
service.serving.knative.dev "load-test-zero" deleted


Status Codes  [code:count]                      200:719997  
Error Set:
2023/09/25 15:48:00 SLA 1 passed. P95 latency is in 100-115ms time range
2023/09/25 15:48:00 SLA 2 passed. Max latency is below 10s
2023/09/25 15:48:00 SLA 3 passed. No errors occurred
2023/09/25 15:48:00 Shutting down InfluxReporter
2023/09/25 15:48:00 SLA 4 failed. total requests is 719997
job.batch "load-test-always" deleted

2023/09/25 15:55:13 SLA 1 passed. P95 latency is in 100-115ms time range
2023/09/25 15:55:13 SLA 2 passed. Max latency is below 10s
2023/09/25 15:55:13 SLA 3 passed. No errors occurred
2023/09/25 15:55:13 Shutting down InfluxReporter
2023/09/25 15:55:13 SLA 4 failed. total requests is 719998
job.batch "load-test-200" deleted

(Client.Timeout exceeded while awaiting headers)
Get "http://activator-with-cc.default.svc.cluster.local?sleep=100": dial tcp 0.0.0.0:0->10.88.7.84:80: connect: connection refused (Client.Timeout exceeded while awaiting headers)
2023/09/25 15:59:03 SLA 1 failed. P95 latency is not in 100-110ms time range: 1m6.797546265s
job.batch "rollout-probe-activator-direct" deleted
service.serving.knative.dev "activator-with-cc" deleted
=============================================

2023/09/25 16:05:48 Shutting down InfluxReporter
2023/09/25 16:05:48 SLA 1 failed. P95 latency is not in 100-110ms time range: 177.512182ms
job.batch "rollout-probe-queue-direct" deleted
service.serving.knative.dev "queue-proxy-with-cc" deleted

@xiangpingjiang xiangpingjiang marked this pull request as draft September 27, 2023 13:22
@knative-prow knative-prow bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 27, 2023
@xiangpingjiang xiangpingjiang changed the title Add Vegeta rates / targets to SLA in performance tests [WIP] Add Vegeta rates / targets to SLA in performance tests Oct 9, 2023
@dprotaso
Copy link
Member

/ok-to-test

@dprotaso
Copy link
Member

/test performance-tests

cc @ReToCode

Copy link
Member

@ReToCode ReToCode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just minor things, other than that it looks good.

Copy link

knative-prow bot commented Dec 13, 2023

[APPROVALNOTIFIER] This PR is APPROVED

Approval requirements bypassed by manually added approval.

This pull-request has been approved by: xiangpingjiang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow-robot knative-prow-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 13, 2023
@xiangpingjiang
Copy link
Contributor Author

@xiangpingjiang gentle ping for rebasing.

@skonto done

@skonto
Copy link
Contributor

skonto commented Dec 13, 2023

/test performance-tests

@ReToCode
Copy link
Member

Maybe we are a bit too restrictive at some tests:

2023/12/13 14:11:45 SLA 3 failed. total requests is 97, expected total requests is 100

2023/12/13 14:21:48 SLA 4 failed. total requests is 719998, expected total requests is 720000

2023/12/13 14:46:40 SLA 2 failed. vegeta rate is 1000.004559, expected Rate is 1000.000000

2023/12/13 14:40:03 SLA 2 failed. vegeta rate is 1000.001162, expected Rate is 1000.000000

@skonto
Copy link
Contributor

skonto commented Dec 14, 2023

Looking at the errors I see:

  1. "scale-from-zero-100"

2023/12/13 14:11:45 SLA 3 failed. total requests is 97, expected total requests is 100

This one should fail because I suspect that 3 out 100 services were not ready, given that we manually add the data point. I think this one needs further debugging as it seems we got stuck here:

_, err := pkgTest.WaitForEndpointStateWithTimeout(
, as no errors are returned:

2023/12/13 14:11:45 Shutting down InfluxReporter
Requests [total, rate, throughput] 97, 97.00, 0.00
Duration [total, attack, wait] 20.097s, 0s, 20.097s
Latencies [min, mean, 50, 90, 95, 99, max] 95.181ms, 7.725s, 6.981s, 17.895s, 19.132s, 20.04s, 20.097s
Bytes In [total, mean] 0, 0.00
Bytes Out [total, mean] 0, 0.00
Success [ratio] 0.00%
Status Codes [code:count] 0:97
Error Set:
2023/12/13 14:11:45 SLA 3 failed. total requests is 97, expected total requests is 100

  1. "load-test-zero"

2023/12/13 14:21:48 SLA 4 failed. total requests is 719998, expected total requests is 720000

Here I suspect that instead of:

	for i := 0; i < len(pacers); i++ {
		expectedRequests = expectedRequests + uint64(pacers[i].Rate(time.Second)*durations[i].Seconds())
	}

it may help to do:

	var expectedSum float64
	for i := 0; i < len(pacers); i++ {
		expectedSum = expectedSum + pacers[i].Rate(time.Second)*durations[i].Seconds()
	}
	expectedRequests = uint64(expectedSum)

rollout-probe-queue-direct

2023/12/13 14:46:40 SLA 2 failed. vegeta rate is 1000.004559, expected Rate is 1000.000000

rollout-probe-activator-direct

2023/12/13 14:40:03 SLA 2 failed. vegeta rate is 1000.001162, expected Rate is 1000.000000

We have more like the above:

rollout-probe-activator-direct-lin

2023/12/13 14:43:18 SLA 2 failed. vegeta rate is 1000.000960, expected Rate is 1000.000000

Wrt rate comparison and for the constant pacers we have:

// Rate returns a ConstantPacer's instantaneous hit rate (i.e. requests per second)
// at the given elapsed duration of an attack. Since it's constant, the return
// value is independent of the given elapsed duration.
func (cp ConstantPacer) Rate(elapsed time.Duration) float64 {
	return cp.hitsPerNs() * 1e9
}

// hitsPerNs returns the attack rate this ConstantPacer represents, in
// fractional hits per nanosecond.
func (cp ConstantPacer) hitsPerNs() float64 {
	return float64(cp.Freq) / float64(cp.Per)
}

I think here just rounding the observed rate is enough, eg. 1000.001162 -> 1000

@ReToCode
Copy link
Member

ReToCode commented Jan 3, 2024

/test performance-tests

@skonto
Copy link
Contributor

skonto commented Jan 9, 2024

It seems we are still getting the round errors. Re-running to make sure the run was the latest:
/test performance-tests

@dprotaso
Copy link
Member

dprotaso commented Jan 9, 2024

/test performance-tests

@ReToCode
Copy link
Member

Yep, we still have issues:

job.batch/load-test-zero created
pod/load-test-zero-b99mf condition met
{"level":"info","ts":1704825797.3743527,"logger":"fallback","caller":"injection/injection.go:63","msg":"Starting informers..."}
2024/01/09 18:43:17 Starting the load test.
2024/01/09 18:44:49 All pods are done (scaled to zero) or terminating after 1m32.001094628s
Requests      [total, rate, throughput]         719998, 2000.00, 1999.42
Duration      [total, attack, wait]             6m0s, 6m0s, 103.497ms
Latencies     [min, mean, 50, 90, 95, 99, max]  101.447ms, 104.7ms, 102.927ms, 103.859ms, 104.299ms, 110.409ms, 1.68s
Bytes In      [total, mean]                     17200124, 23.89
Bytes Out     [total, mean]                     0, 0.00
2024/01/09 18:50:49 SLA 1 passed. P95 latency is in 100-115ms time range
2024/01/09 18:50:49 SLA 2 passed. Max latency is below 10s
2024/01/09 18:50:49 SLA 3 passed. No errors occurred
2024/01/09 18:50:49 Shutting down InfluxReporter
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:719998  
Error Set:
2024/01/09 18:50:49 SLA 4 failed. total requests is 719998, expected total requests is 720000

@xiangpingjiang can you add a threshold in that test?

@xiangpingjiang
Copy link
Contributor Author

hello @ReToCode
Do you mean add a range like [expectedRequests-5,expectedRequests+5] ?

@ReToCode
Copy link
Member

Yeah, or maybe in %, like we accept a deviation of 0.1% or something like this.

@xiangpingjiang
Copy link
Contributor Author

/test performance-tests

@xiangpingjiang
Copy link
Contributor Author

/test performance-tests

@dprotaso
Copy link
Member

@ReToCode I still see SLA failures in the performance test - but fixing them seems out of scope for this PR. (unless the SLA is becoming computed incorrectly)

Another thing that would be useful is if we fail the performance test to surface the SLA failures have happened

@ReToCode
Copy link
Member

@ReToCode I still see SLA failures in the performance test - but fixing them seems out of scope for this PR. (unless the SLA is becoming computed incorrectly)

The SLAs were not constantly stable from the beginning, but this is a separate topic that we need to look into. So let's get this in, as it's better than before.

Another thing that would be useful is if we fail the performance test to surface the SLA failures have happened

Yeah, but probably after we make them stable, otherwise we only have red builds and/or partial test results in influxdb.

/lgtm

Thanks @xiangpingjiang for doing this!

@knative-prow knative-prow bot added the lgtm Indicates that a PR is ready to be merged. label Jan 15, 2024
@knative-prow knative-prow bot merged commit 8162fe2 into knative:main Jan 15, 2024
56 checks passed
@skonto
Copy link
Contributor

skonto commented Jan 15, 2024

@ReToCode @dprotaso Rounding errors or inaccurate conditions should not be fixed in this PR?
We are adding more failing points and not sure how that helps. I am not sure how we distinguish between a failure in rate due to some other reason vs the one here which is due to inaccuracy.

For example for scale-from-zero-25:

2024/01/12 16:02:10 SLA 3 failed. total requests is 24, expected total requests is 25

@ReToCode
Copy link
Member

@ReToCode @dprotaso Rounding errors or inaccurate conditions should not be fixed in this PR?

I partially agree. The SLAs were never really stable to begin with (not even before my refactoring PR). We should look into that topic separately, aside from this specific PR (or phrased differently, I would not revert it for that). I created this issue: #14793 to follow up on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test-and-release It flags unit/e2e/conformance/perf test issues for product features lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Vegeta rates / targets to SLA in performance tests
5 participants