fix(controller): fix waiting for stable rs to be fully scaled before canary scale down #3899

y-rabie · 2024-10-16T20:30:15Z

Checklist:

Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this is a chore.
The title of the PR is (a) conventional with a list of types and scopes found here, (b) states what changed, and (c) suffixes the related issues number. E.g. "fix(controller): Updates such and such. Fixes #1234".
I've signed my commits with DCO
I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
My builds are green. Try syncing with master if they are not.
My organization is added to USERS.md.

As described in the code comment, when used on a large scale with a cluster autoscaler that can disrupt nodes and evict pods, the canary RS stays scaled-up for a while until the stable RS is fully scaled. This makes sense if the controller scaled down the stable RS during the rollout (using dynamicStableScale), but it doesn't make sense if it didn't.

github-actions · 2024-10-16T21:02:04Z

Published E2E Test Results

4 files 4 suites 3h 8m 13s ⏱️
113 tests 104 ✅ 7 💤 2 ❌
454 runs 424 ✅ 28 💤 2 ❌

For more details on these failures, see this check.

Results for commit 01eaf42.

♻️ This comment has been updated with latest results.

github-actions · 2024-10-16T21:02:10Z

Published Unit Test Results

2 293 tests 2 293 ✅ 2m 59s ⏱️
128 suites 0 💤
1 files 0 ❌

Results for commit 01eaf42.

♻️ This comment has been updated with latest results.

codecov · 2024-10-16T21:02:46Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 82.70%. Comparing base (5f59344) to head (ec44f8d).

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #3899   +/-   ##
=======================================
  Coverage   82.69%   82.70%           
=======================================
  Files         163      163           
  Lines       22895    22902    +7     
=======================================
+ Hits        18934    18940    +6     
- Misses       3087     3088    +1     
  Partials      874      874

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

sonarqubecloud · 2024-10-17T09:00:16Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

zachaller · 2024-12-04T20:05:45Z

Can we get a test for this? It will also probably help me understand the exact issue a bit better as well, it sounds like the controller will just take a really long time to complete the rollout due to scaling events etc?

sonarqubecloud · 2024-12-06T11:47:09Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

y-rabie · 2024-12-06T21:03:54Z

@zachaller

Yep, exactly. As a principle, if we had nothing to do with scaling down the stable rs, why are we waiting for its full availability? To me that self-evidently shouldn't be the case. [In my case, that would be forcing a cluster autoscaler to consolidate on a current state (stable and canary pods), just so that when it finishes, we terminate the canary pods (effectively changing that state and causing a new process of consolidation to begin). A sort of a wasted effort.]

When we do have something to do with it however, I don't think there is a way around it. I thought of maybe instead of full availability, wait for stable RS PDB's (if exists) non-violation? But I don't think it'd be a correct/expected behavior, especially that the scaling up/down of the stable/canary in that case is incremental. So I left that part out of the PR, if you have any thoughts on it, I'd love to hear it

I'm not sure what kind of test to write for this bit though.

zachaller · 2024-12-10T20:23:12Z

@y-rabie Want to give this PR a good review, will get this into a 1.8 release still

…canary scale down Signed-off-by: Youssef Rabie <[email protected]>

sonarqubecloud · 2025-01-08T15:06:00Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

y-rabie force-pushed the fix-stablers-scaledown branch from 1b5655c to 958336c Compare October 17, 2024 08:59

zachaller modified the milestones: v1.9, v1.8 Dec 4, 2024

zachaller force-pushed the fix-stablers-scaledown branch from ec44f8d to 42cec92 Compare December 4, 2024 20:11

y-rabie force-pushed the fix-stablers-scaledown branch from 42cec92 to 3fe25d3 Compare December 6, 2024 11:46

zachaller self-assigned this Dec 10, 2024

fix(controller): fix waiting for stable rs to be fully scaled before …

01eaf42

…canary scale down Signed-off-by: Youssef Rabie <[email protected]>

zachaller force-pushed the fix-stablers-scaledown branch from 3fe25d3 to 01eaf42 Compare January 8, 2025 15:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(controller): fix waiting for stable rs to be fully scaled before canary scale down #3899

fix(controller): fix waiting for stable rs to be fully scaled before canary scale down #3899

y-rabie commented Oct 16, 2024 •

edited

Loading

github-actions bot commented Oct 16, 2024 •

edited

Loading

github-actions bot commented Oct 16, 2024 •

edited

Loading

codecov bot commented Oct 16, 2024 •

edited

Loading

sonarqubecloud bot commented Oct 17, 2024

zachaller commented Dec 4, 2024

sonarqubecloud bot commented Dec 6, 2024

y-rabie commented Dec 6, 2024 •

edited

Loading

zachaller commented Dec 10, 2024

sonarqubecloud bot commented Jan 8, 2025

fix(controller): fix waiting for stable rs to be fully scaled before canary scale down #3899

Are you sure you want to change the base?

fix(controller): fix waiting for stable rs to be fully scaled before canary scale down #3899

Conversation

y-rabie commented Oct 16, 2024 • edited Loading

github-actions bot commented Oct 16, 2024 • edited Loading

Published E2E Test Results

github-actions bot commented Oct 16, 2024 • edited Loading

Published Unit Test Results

codecov bot commented Oct 16, 2024 • edited Loading

Codecov Report

sonarqubecloud bot commented Oct 17, 2024

Quality Gate passed

zachaller commented Dec 4, 2024

sonarqubecloud bot commented Dec 6, 2024

Quality Gate passed

y-rabie commented Dec 6, 2024 • edited Loading

zachaller commented Dec 10, 2024

sonarqubecloud bot commented Jan 8, 2025

Quality Gate passed

y-rabie commented Oct 16, 2024 •

edited

Loading

github-actions bot commented Oct 16, 2024 •

edited

Loading

github-actions bot commented Oct 16, 2024 •

edited

Loading

codecov bot commented Oct 16, 2024 •

edited

Loading

y-rabie commented Dec 6, 2024 •

edited

Loading