-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Graceful leave will often timeout on a large cluster though nothing is wrong #8435
Comments
We are seeing similar error when a node is leaving. Our case this happens when leaving a network area.
This is causing the network area peered remote consul dcs mark the leaving node as failed. Cross dc traffic are actually routed to that "failed" node and causing cross dc requests to fail. |
I have a customer who is likely bumping into this issue as well and causing the remote state to grow out of control because it's not removing dead members from the member list. This puts them at risk of reaching the max limit of 25mb. Feel free to contact me on Slack for more information - bbarkhouse. |
I've modified our local installation of consul to have configurable serf-broadcast-timeout and serf-leave-propagation-delay. On our cluster of ~3600 hosts, I had to raise the serf-broadcast-timeout to 25s, in order to get the "shutdown without leave" issue to happen less than 10% of the time. I don't think using the cluster size is a good guideline here. (Not to mention that the default for serf is 5 seconds, so your proposed values would lower it for all cluster sizes.) I think a proper fix would need to change the serf message queue to be a priority queue (it should prioritize the leave message above anything else really). I also recommend:
|
In a cluster with about 1000 nodes, we notice that the fix from hashicorp/serf#640 doesn't always work. Example log lines when it works and the terminating pod's member status becomes
Example log lines when it doesn't work and the terminating pod's status becomes
The patterns are almost the same, except that the other pods will log about After some checking, this is likely caused by the logic in https://github.com/hashicorp/serf/blob/v0.10.0/serf/serf.go#L1018-L1026:
together with the limitation with user-defined broadcast queue described in hashicorp/memberlist#299. As the leave intent message from https://github.com/hashicorp/serf/blob/v0.10.0/serf/serf.go#L732 has to go through an user-defined broadcast queue, sometimes the message will not be broadcasted at all. Subsequently, when the actual memberlist leave message is broadcasted by https://github.com/hashicorp/serf/blob/v0.10.0/serf/serf.go#L744, via the system broadcast queue, we will get the |
This is really a Serf/Memberlist issue but causes trouble for Consul users so reporting it here as the solution likely involves all three layers.
When a node attempts to gracefully leave a cluster, it calls
serf.Leave
which will wait forBroadcastTimeout
for the message to be sent out: https://github.com/hashicorp/serf/blob/2a20f94a0fd71f606ea5f58ad3c1457f5ee11bf5/serf/serf.go#L717Resulting in output like:
BroadcastTimeout
is 5 seconds by default (and not configurable in Consul).But if you follow the code through into memberlist, the notify channel we are waiting on is actually just waiting for the message to get through the broadcast queue the appropriate number of times: https://github.com/hashicorp/serf/blob/2a20f94a0fd71f606ea5f58ad3c1457f5ee11bf5/vendor/github.com/hashicorp/memberlist/queue.go#L353-L356
In Consul, the retransmit mult defaults to
4
and so the retransmit limit ends up being: https://github.com/hashicorp/serf/blob/2a20f94a0fd71f606ea5f58ad3c1457f5ee11bf5/vendor/github.com/hashicorp/memberlist/util.go#L71-L76i.e. for a different cluster size, the following number of retransmits is used:
https://play.golang.org/p/7Jk6G34fkle
Now at worst
getBroadcasts
is only called once everyGossipInterval
(defaults to 200ms in Consul). It is sometimes better than this since we attempt to piggy back on any other message being sent too which might occur more often, but let's assume it's common only to be making one transmit attempt every 200ms.So the
minBroadcastTime
column shows the theoretical minimum time it would take to make enough reBroadcasts for the notify channel to be closed. In theory even with 10k nodes this fits inside the 5s default but only just and it's only a minimum - any other messages being broadcast are competing for the limited space in each UDP packet sent which may well mean it takes several rounds of gossip for each broadcast to go out. In fact we prioritize messages that have been sent fewer times, so it gets increasingly likely on each re-broadcast that we won't deliver it in the next gossip round.Anecdotally, anyone running a large enough cluster (on the order of 1000 nodes or more). Will often see graceful leave "timeout" and especially if there are any other changes in the cluster causing more gossip messages to be broadcast than usual.
The question is: What does this broadcast timeout achieve? If the goal is to keep the sending node around for long enough to ensure the message is sent, then the timeout should probably be proportional to the clustersize/number of attempts that will be made to send it. If we only care that we made some effort to send more than a few times, we should probably not wait for every single retransmit. The ultimate question is: Why report this as an error to operators when it's just natural in any large cluster and doesn't typically mean that the broadcast was actually any less effective?
Possible solutions:
One or more of these are possible.
rebroadcast_mult
broadcasts before saying "we sent this".The text was updated successfully, but these errors were encountered: