Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Attempt to repair disconnected/failed master nodes before failing over #1105

Merged
merged 9 commits into from
Oct 18, 2024

Conversation

Nashluffy
Copy link
Contributor

@Nashluffy Nashluffy commented Oct 16, 2024

ended up closing #1101 as I was forgetting to sign commits and the git history was getting out of control with having to rebase

Description

Fixes #1100

As stated in the above issue, a cluster that has unhealthy leaders as the result of being scaled to zero nodes can be recovered from without having to issue a failover (which leads to data loss).

The failed/disconnected nodes simply need to have their address updated with the IP of the new leader pods. CLUSTER MEET is able to map the address specified to the existing host & port, meaning we don't need to wipe the master & start afresh. If this fails, we fall back to the failover.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)

This is a best-effort attempt

Checklist

  • Tests have been added/modified and all tests pass.
  • Functionality/bugs have been confirmed to be unchanged or fixed.
  • I have performed a self-review of my own code.
  • Documentation has been updated or added where necessary.

If new strategy fails, failover still works as expected

{"level":"info","ts":"2024-10-13T19:26:29Z","logger":"controllers.RedisCluster","msg":"Reconciling opstree redis Cluster controller","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T19:26:31Z","logger":"controllers.RedisCluster","msg":"Number of Redis nodes match desired","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T19:26:31Z","logger":"controllers.RedisCluster","msg":"healthy leader count does not match desired; attempting to repair disconnected masters","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T19:26:41Z","logger":"controllers.RedisCluster","msg":"unhealthy nodes exist after attempting to repair disconnected masters; starting failover","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T19:26:52Z","logger":"controllers.RedisCluster","msg":"Reconciling opstree redis Cluster controller","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T19:26:53Z","logger":"controllers.RedisCluster","msg":"Creating redis cluster by executing cluster creation commands","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T19:26:53Z","logger":"controllers.RedisCluster","msg":"Not all leader are part of the cluster...","Request.Namespace":"default","Request.Name":"redis-cluster","Leaders.Count":1,"Instance.Size":3}
{"level":"info","ts":"2024-10-13T19:27:58Z","logger":"controllers.RedisCluster","msg":"Reconciling opstree redis Cluster controller","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T19:27:59Z","logger":"controllers.RedisCluster","msg":"Number of Redis nodes match desired","Request.Namespace":"default","Request.Name":"redis-cluster"}

and new strategy working as expected

# existing data
nash:~/code/redis-operator$ k exec redis-cluster-leader-2 -- redis-cli -c get k1
v1

# scale down statefulset & operator
nash:~/code/redis-operator$ k scale -n redis-operator-system deploy -l control-plane=redis-operator --replicas=0
deployment.apps/redis-operator-redis-operator scaled
nash:~/code/redis-operator$ k scale sts redis-cluster-leader --replicas=0
statefulset.apps/redis-cluster-leader scaled

# scale back up
nash:~/code/redis-operator$ k scale sts redis-cluster-leader --replicas=3
statefulset.apps/redis-cluster-leader scaled
nash:~/code/redis-operator$ k scale -n redis-operator-system deploy -l control-plane=redis-operator --replicas=1
deployment.apps/redis-operator-redis-operator scaled

# observe data persisted
nash:~/code/redis-operator$ k exec redis-cluster-leader-2 -- redis-cli -c get k1
v1

corresponding logs

{"level":"info","ts":"2024-10-13T20:04:11Z","logger":"controllers.RedisCluster","msg":"Reconciling opstree redis Cluster controller","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T20:04:14Z","logger":"controllers.RedisCluster","msg":"Number of Redis nodes match desired","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T20:04:14Z","logger":"controllers.RedisCluster","msg":"healthy leader count does not match desired; attempting to repair disconnected masters","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T20:04:19Z","logger":"controllers.RedisCluster","msg":"repairing unhealthy masters successful, no unhealthy masters left","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T20:04:49Z","logger":"controllers.RedisCluster","msg":"Reconciling opstree redis Cluster controller","Request.Namespace":"default","Request.Name":"redis-cluster"}
{"level":"info","ts":"2024-10-13T20:04:50Z","logger":"controllers.RedisCluster","msg":"Number of Redis nodes match desired","Request.Namespace":"default","Request.Name":"redis-cluster"}

Additional Context

There's a small bit of refactoring too

  • define type for CLUSTER NODES response for a bit more safety on helper functions
  • extract some common functionality that operates on above type, ie nodeFailedOrDisconnected and nodeIsOfType
  • renames some functions to better represent what they do

Signed-off-by: mluffman <[email protected]>
Copy link

codecov bot commented Oct 16, 2024

Codecov Report

Attention: Patch coverage is 34.04255% with 62 lines in your changes missing coverage. Please review.

Project coverage is 44.34%. Comparing base (d121d86) to head (628d1b2).
Report is 121 commits behind head on master.

Files with missing lines Patch % Lines
pkg/k8sutils/redis.go 47.69% 30 Missing and 4 partials ⚠️
...ontrollers/rediscluster/rediscluster_controller.go 3.44% 28 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1105      +/-   ##
==========================================
+ Coverage   35.20%   44.34%   +9.14%     
==========================================
  Files          19       20       +1     
  Lines        3213     3412     +199     
==========================================
+ Hits         1131     1513     +382     
+ Misses       2015     1813     -202     
- Partials       67       86      +19     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Nashluffy
Copy link
Contributor Author

@drivebyer do you mind having a look please?

@drivebyer
Copy link
Collaborator

@drivebyer do you mind having a look please?

Sure, I would add some end-to-end tests to improve this fix.

Signed-off-by: drivebyer <[email protected]>
Signed-off-by: drivebyer <[email protected]>
Signed-off-by: drivebyer <[email protected]>
@Nashluffy
Copy link
Contributor Author

Nashluffy commented Oct 17, 2024

@drivebyer do you mind having a look please?

Sure, I would add some end-to-end tests to improve this fix.

Oh, thanks for adding them! Was just getting around to it :) I've had a hell of a time battling flakes on the e2e tests

Signed-off-by: drivebyer <[email protected]>
Signed-off-by: drivebyer <[email protected]>
Signed-off-by: drivebyer <[email protected]>
Signed-off-by: drivebyer <[email protected]>
@drivebyer drivebyer merged commit de6b066 into OT-CONTAINER-KIT:master Oct 18, 2024
19 of 20 checks passed
@Nashluffy
Copy link
Contributor Author

Thanks for your help @drivebyer! Is there planned release coming soon so I can pick up this change? Looks like the last release was in July of this year

@drivebyer
Copy link
Collaborator

Thanks for your help @drivebyer! Is there planned release coming soon so I can pick up this change?

I’m not sure about the exact timing of the next release. If you’re in a hurry, you could build your own image using the Dockerfile from this link: https://github.com/OT-CONTAINER-KIT/redis-operator/blob/master/Dockerfile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

rediscluster failing to recover when scaling up from zero nodes
2 participants