-
-
Notifications
You must be signed in to change notification settings - Fork 228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Attempt to repair disconnected/failed master nodes before failing over #1105
Conversation
Signed-off-by: mluffman <[email protected]>
Signed-off-by: mluffman <[email protected]>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1105 +/- ##
==========================================
+ Coverage 35.20% 44.34% +9.14%
==========================================
Files 19 20 +1
Lines 3213 3412 +199
==========================================
+ Hits 1131 1513 +382
+ Misses 2015 1813 -202
- Partials 67 86 +19 ☔ View full report in Codecov by Sentry. |
@drivebyer do you mind having a look please? |
Sure, I would add some end-to-end tests to improve this fix. |
Signed-off-by: drivebyer <[email protected]>
0a01dc7
to
89b3b52
Compare
Signed-off-by: drivebyer <[email protected]>
Signed-off-by: drivebyer <[email protected]>
Oh, thanks for adding them! Was just getting around to it :) I've had a hell of a time battling flakes on the e2e tests |
Signed-off-by: drivebyer <[email protected]>
Signed-off-by: drivebyer <[email protected]>
Signed-off-by: drivebyer <[email protected]>
Signed-off-by: drivebyer <[email protected]>
Thanks for your help @drivebyer! Is there planned release coming soon so I can pick up this change? Looks like the last release was in July of this year |
I’m not sure about the exact timing of the next release. If you’re in a hurry, you could build your own image using the Dockerfile from this link: https://github.com/OT-CONTAINER-KIT/redis-operator/blob/master/Dockerfile. |
ended up closing #1101 as I was forgetting to sign commits and the git history was getting out of control with having to rebase
Description
Fixes #1100
As stated in the above issue, a cluster that has unhealthy leaders as the result of being scaled to zero nodes can be recovered from without having to issue a failover (which leads to data loss).
The failed/disconnected nodes simply need to have their address updated with the IP of the new leader pods.
CLUSTER MEET
is able to map the address specified to the existing host & port, meaning we don't need to wipe the master & start afresh. If this fails, we fall back to the failover.Type of change
This is a best-effort attempt
Checklist
If new strategy fails, failover still works as expected
and new strategy working as expected
corresponding logs
Additional Context
There's a small bit of refactoring too
CLUSTER NODES
response for a bit more safety on helper functionsnodeFailedOrDisconnected
andnodeIsOfType