Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graceful leave will often timeout on a large cluster though nothing is wrong #8435

Open
banks opened this issue Aug 5, 2020 · 4 comments
Open
Labels
theme/internals Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics theme/operator-usability Replaces UX. Anything related to making things easier for the practitioner theme/reliability

Comments

@banks
Copy link
Member

banks commented Aug 5, 2020

This is really a Serf/Memberlist issue but causes trouble for Consul users so reporting it here as the solution likely involves all three layers.

When a node attempts to gracefully leave a cluster, it calls serf.Leave which will wait for BroadcastTimeout for the message to be sent out: https://github.com/hashicorp/serf/blob/2a20f94a0fd71f606ea5f58ad3c1457f5ee11bf5/serf/serf.go#L717

Resulting in output like:

<...> [INFO] agent: Gracefully shutting down agent...
<...> [ERR] consul: Failed to leave LAN Serf cluster: timeout while waiting for graceful leave

BroadcastTimeout is 5 seconds by default (and not configurable in Consul).

But if you follow the code through into memberlist, the notify channel we are waiting on is actually just waiting for the message to get through the broadcast queue the appropriate number of times: https://github.com/hashicorp/serf/blob/2a20f94a0fd71f606ea5f58ad3c1457f5ee11bf5/vendor/github.com/hashicorp/memberlist/queue.go#L353-L356

In Consul, the retransmit mult defaults to 4 and so the retransmit limit ends up being: https://github.com/hashicorp/serf/blob/2a20f94a0fd71f606ea5f58ad3c1457f5ee11bf5/vendor/github.com/hashicorp/memberlist/util.go#L71-L76

i.e. for a different cluster size, the following number of retransmits is used:

numNodes =      1, retransmitLimit =      4, minBroadcastTime: 800ms
numNodes =     10, retransmitLimit =      8, minBroadcastTime: 1.6s
numNodes =    100, retransmitLimit =     12, minBroadcastTime: 2.4s
numNodes =   1000, retransmitLimit =     16, minBroadcastTime: 3.2s
numNodes =   2000, retransmitLimit =     16, minBroadcastTime: 3.2s
numNodes =   5000, retransmitLimit =     16, minBroadcastTime: 3.2s
numNodes =   7500, retransmitLimit =     16, minBroadcastTime: 3.2s
numNodes =  10000, retransmitLimit =     20, minBroadcastTime: 4s

https://play.golang.org/p/7Jk6G34fkle

Now at worst getBroadcasts is only called once every GossipInterval (defaults to 200ms in Consul). It is sometimes better than this since we attempt to piggy back on any other message being sent too which might occur more often, but let's assume it's common only to be making one transmit attempt every 200ms.

So the minBroadcastTime column shows the theoretical minimum time it would take to make enough reBroadcasts for the notify channel to be closed. In theory even with 10k nodes this fits inside the 5s default but only just and it's only a minimum - any other messages being broadcast are competing for the limited space in each UDP packet sent which may well mean it takes several rounds of gossip for each broadcast to go out. In fact we prioritize messages that have been sent fewer times, so it gets increasingly likely on each re-broadcast that we won't deliver it in the next gossip round.

Anecdotally, anyone running a large enough cluster (on the order of 1000 nodes or more). Will often see graceful leave "timeout" and especially if there are any other changes in the cluster causing more gossip messages to be broadcast than usual.

The question is: What does this broadcast timeout achieve? If the goal is to keep the sending node around for long enough to ensure the message is sent, then the timeout should probably be proportional to the clustersize/number of attempts that will be made to send it. If we only care that we made some effort to send more than a few times, we should probably not wait for every single retransmit. The ultimate question is: Why report this as an error to operators when it's just natural in any large cluster and doesn't typically mean that the broadcast was actually any less effective?

Possible solutions:

One or more of these are possible.

  1. don't show that error on leave as it makes no real difference especially since we immediately sleep afterwards for longer than 5s any way which probably means the broadcast does complete even when we say it "timedout"
  2. change Serf to make that timeout be proportional to cluster size so it at least indicates that something unusual happened (rather than just being always hit past a certain practical size)
  3. change Serf/memberlist to only wait for N broadcasts before we consider the broadcast "sent" even if we send more - i.e. regardless of cluster size only wait on say rebroadcast_mult broadcasts before saying "we sent this".
@banks banks added theme/operator-usability Replaces UX. Anything related to making things easier for the practitioner theme/internals Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics labels Aug 5, 2020
@banks banks changed the title Graceful leave will always timeout on a large cluster Graceful leave will often timeout on a large cluster though nothing is wrong Aug 5, 2020
@ro-stripe
Copy link

We are seeing similar error when a node is leaving. Our case this happens when leaving a network area.

failed to leave area: area=8dad7f10-b874-1d43-eb70-3a5c8b0abdba error=""timeout while waiting for graceful leave

This is causing the network area peered remote consul dcs mark the leaving node as failed. Cross dc traffic are actually routed to that "failed" node and causing cross dc requests to fail.

@bbarkhouse-hashicorp
Copy link

I have a customer who is likely bumping into this issue as well and causing the remote state to grow out of control because it's not removing dead members from the member list. This puts them at risk of reaching the max limit of 25mb. Feel free to contact me on Slack for more information - bbarkhouse.

@nahratzah
Copy link

I've modified our local installation of consul to have configurable serf-broadcast-timeout and serf-leave-propagation-delay.

On our cluster of ~3600 hosts, I had to raise the serf-broadcast-timeout to 25s, in order to get the "shutdown without leave" issue to happen less than 10% of the time. I don't think using the cluster size is a good guideline here. (Not to mention that the default for serf is 5 seconds, so your proposed values would lower it for all cluster sizes.)

I think a proper fix would need to change the serf message queue to be a priority queue (it should prioritize the leave message above anything else really).

I also recommend:

@sayap
Copy link

sayap commented Apr 1, 2024

In a cluster with about 1000 nodes, we notice that the fix from hashicorp/serf#640 doesn't always work.

Example log lines when it works and the terminating pod's member status becomes left:

# from the terminating pod
consul-75rcl consul 2024-04-01T02:41:40.794Z [INFO]  agent: Caught: signal=terminated
consul-75rcl consul 2024-04-01T02:41:40.794Z [INFO]  agent: Gracefully shutting down agent...
consul-75rcl consul 2024-04-01T02:41:40.794Z [INFO]  agent.client: client starting leave
...
# from the terminating pod
consul-75rcl consul 2024-04-01T02:41:45.794Z [WARN]  agent.client.serf.lan: serf: timeout while waiting for graceful leave
consul-75rcl consul 2024-04-01T02:41:45.794Z [INFO]  agent.client.serf.lan: serf: EventMemberLeave: ip-10-6-6-6.ap-southeast-1.compute.internal 10.6.6.6
...
# from the other pods
consul-7zzhc consul 2024-04-01T02:41:46.524Z [INFO]  agent.client.serf.lan: serf: EventMemberLeave: ip-10-6-6-6.ap-southeast-1.compute.internal 10.6.6.6
consul-bb8sx consul 2024-04-01T02:41:46.621Z [INFO]  agent.client.serf.lan: serf: EventMemberLeave: ip-10-6-6-6.ap-southeast-1.compute.internal 10.6.6.6
consul-jwvgz consul 2024-04-01T02:41:46.717Z [INFO]  agent.client.serf.lan: serf: EventMemberLeave: ip-10-6-6-6.ap-southeast-1.compute.internal 10.6.6.6
...
# from the terminating pod
consul-75rcl consul 2024-04-01T02:41:50.795Z [WARN]  agent.client.serf.lan: serf: timeout waiting for leave broadcast: timeout waiting for leave broadcast

Example log lines when it doesn't work and the terminating pod's status becomes failed:

# from the terminating pod
consul-5rjvv consul 2024-04-01T02:41:10.224Z [INFO]  agent: Caught: signal=terminated
consul-5rjvv consul 2024-04-01T02:41:10.224Z [INFO]  agent: Gracefully shutting down agent...
consul-5rjvv consul 2024-04-01T02:41:10.224Z [INFO]  agent.client: client starting leave
...
# from the terminating pod
consul-5rjvv consul 2024-04-01T02:41:15.225Z [WARN]  agent.client.serf.lan: serf: timeout while waiting for graceful leave
consul-5rjvv consul 2024-04-01T02:41:15.225Z [INFO]  agent.client.serf.lan: serf: EventMemberLeave: ip-10-4-4-4.ap-southeast-1.compute.internal 10.4.4.4
...
# from the other pods
consul-4kqhp consul 2024-04-01T02:41:16.431Z [INFO]  agent.client.serf.lan: serf: EventMemberFailed: ip-10-4-4-4.ap-southeast-1.compute.internal 10.4.4.4
consul-f5cps consul 2024-04-01T02:41:16.462Z [INFO]  agent.client.serf.lan: serf: EventMemberFailed: ip-10-4-4-4.ap-southeast-1.compute.internal 10.4.4.4
consul-4hts6 consul 2024-04-01T02:41:16.460Z [INFO]  agent.client.serf.lan: serf: EventMemberFailed: ip-10-4-4-4.ap-southeast-1.compute.internal 10.4.4.4
...
# from the terminating pod
consul-5rjvv consul 2024-04-01T02:41:20.225Z [WARN]  agent.client.serf.lan: serf: timeout waiting for leave broadcast: timeout waiting for leave broadcast

The patterns are almost the same, except that the other pods will log about EventMemberLeave when it works, and will log about EventMemberFailed when it doesn't work.

After some checking, this is likely caused by the logic in https://github.com/hashicorp/serf/blob/v0.10.0/serf/serf.go#L1018-L1026:

	switch member.Status {
	case StatusLeaving:
		member.Status = StatusLeft
		member.leaveTime = time.Now()
		s.leftMembers = append(s.leftMembers, member)
	case StatusAlive:
		member.Status = StatusFailed
		member.leaveTime = time.Now()
		s.failedMembers = append(s.failedMembers, member)

together with the limitation with user-defined broadcast queue described in hashicorp/memberlist#299.

As the leave intent message from https://github.com/hashicorp/serf/blob/v0.10.0/serf/serf.go#L732 has to go through an user-defined broadcast queue, sometimes the message will not be broadcasted at all. Subsequently, when the actual memberlist leave message is broadcasted by https://github.com/hashicorp/serf/blob/v0.10.0/serf/serf.go#L744, via the system broadcast queue, we will get the Alive -> Failed case instead of the more desirable Leaving -> Left case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/internals Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics theme/operator-usability Replaces UX. Anything related to making things easier for the practitioner theme/reliability
Projects
None yet
Development

No branches or pull requests

6 participants