Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(controller): fix waiting for stable rs to be fully scaled before canary scale down #3899

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

y-rabie
Copy link
Contributor

@y-rabie y-rabie commented Oct 16, 2024

Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this is a chore.
  • The title of the PR is (a) conventional with a list of types and scopes found here, (b) states what changed, and (c) suffixes the related issues number. E.g. "fix(controller): Updates such and such. Fixes #1234".
  • I've signed my commits with DCO
  • I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • My builds are green. Try syncing with master if they are not.
  • My organization is added to USERS.md.

As described in the code comment, when used on a large scale with a cluster autoscaler that can disrupt nodes and evict pods, the canary RS stays scaled-up for a while until the stable RS is fully scaled. This makes sense if the controller scaled down the stable RS during the rollout (using dynamicStableScale), but it doesn't make sense if it didn't.

Copy link
Contributor

github-actions bot commented Oct 16, 2024

Published E2E Test Results

  4 files    4 suites   3h 8m 13s ⏱️
113 tests 104 ✅  7 💤 2 ❌
454 runs  424 ✅ 28 💤 2 ❌

For more details on these failures, see this check.

Results for commit 01eaf42.

♻️ This comment has been updated with latest results.

Copy link
Contributor

github-actions bot commented Oct 16, 2024

Published Unit Test Results

2 293 tests   2 293 ✅  2m 59s ⏱️
  128 suites      0 💤
    1 files        0 ❌

Results for commit 01eaf42.

♻️ This comment has been updated with latest results.

Copy link

codecov bot commented Oct 16, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 82.70%. Comparing base (5f59344) to head (ec44f8d).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #3899   +/-   ##
=======================================
  Coverage   82.69%   82.70%           
=======================================
  Files         163      163           
  Lines       22895    22902    +7     
=======================================
+ Hits        18934    18940    +6     
- Misses       3087     3088    +1     
  Partials      874      874           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@y-rabie y-rabie force-pushed the fix-stablers-scaledown branch from 1b5655c to 958336c Compare October 17, 2024 08:59
Copy link

@zachaller
Copy link
Collaborator

Can we get a test for this? It will also probably help me understand the exact issue a bit better as well, it sounds like the controller will just take a really long time to complete the rollout due to scaling events etc?

@zachaller zachaller modified the milestones: v1.9, v1.8 Dec 4, 2024
@zachaller zachaller force-pushed the fix-stablers-scaledown branch from ec44f8d to 42cec92 Compare December 4, 2024 20:11
@y-rabie y-rabie force-pushed the fix-stablers-scaledown branch from 42cec92 to 3fe25d3 Compare December 6, 2024 11:46
Copy link

sonarqubecloud bot commented Dec 6, 2024

@y-rabie
Copy link
Contributor Author

y-rabie commented Dec 6, 2024

@zachaller

Yep, exactly. As a principle, if we had nothing to do with scaling down the stable rs, why are we waiting for its full availability? To me that self-evidently shouldn't be the case. [In my case, that would be forcing a cluster autoscaler to consolidate on a current state (stable and canary pods), just so that when it finishes, we terminate the canary pods (effectively changing that state and causing a new process of consolidation to begin). A sort of a wasted effort.]

When we do have something to do with it however, I don't think there is a way around it. I thought of maybe instead of full availability, wait for stable RS PDB's (if exists) non-violation? But I don't think it'd be a correct/expected behavior, especially that the scaling up/down of the stable/canary in that case is incremental. So I left that part out of the PR, if you have any thoughts on it, I'd love to hear it

I'm not sure what kind of test to write for this bit though.

@zachaller zachaller self-assigned this Dec 10, 2024
@zachaller
Copy link
Collaborator

@y-rabie Want to give this PR a good review, will get this into a 1.8 release still

@zachaller zachaller force-pushed the fix-stablers-scaledown branch from 3fe25d3 to 01eaf42 Compare January 8, 2025 15:05
Copy link

sonarqubecloud bot commented Jan 8, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants