Graceful leave will often timeout on a large cluster though nothing is wrong #8435

banks · 2020-08-05T10:29:18Z

This is really a Serf/Memberlist issue but causes trouble for Consul users so reporting it here as the solution likely involves all three layers.

When a node attempts to gracefully leave a cluster, it calls serf.Leave which will wait for BroadcastTimeout for the message to be sent out: https://github.com/hashicorp/serf/blob/2a20f94a0fd71f606ea5f58ad3c1457f5ee11bf5/serf/serf.go#L717

Resulting in output like:

<...> [INFO] agent: Gracefully shutting down agent...
<...> [ERR] consul: Failed to leave LAN Serf cluster: timeout while waiting for graceful leave

BroadcastTimeout is 5 seconds by default (and not configurable in Consul).

But if you follow the code through into memberlist, the notify channel we are waiting on is actually just waiting for the message to get through the broadcast queue the appropriate number of times: https://github.com/hashicorp/serf/blob/2a20f94a0fd71f606ea5f58ad3c1457f5ee11bf5/vendor/github.com/hashicorp/memberlist/queue.go#L353-L356

In Consul, the retransmit mult defaults to 4 and so the retransmit limit ends up being: https://github.com/hashicorp/serf/blob/2a20f94a0fd71f606ea5f58ad3c1457f5ee11bf5/vendor/github.com/hashicorp/memberlist/util.go#L71-L76

i.e. for a different cluster size, the following number of retransmits is used:

numNodes =      1, retransmitLimit =      4, minBroadcastTime: 800ms
numNodes =     10, retransmitLimit =      8, minBroadcastTime: 1.6s
numNodes =    100, retransmitLimit =     12, minBroadcastTime: 2.4s
numNodes =   1000, retransmitLimit =     16, minBroadcastTime: 3.2s
numNodes =   2000, retransmitLimit =     16, minBroadcastTime: 3.2s
numNodes =   5000, retransmitLimit =     16, minBroadcastTime: 3.2s
numNodes =   7500, retransmitLimit =     16, minBroadcastTime: 3.2s
numNodes =  10000, retransmitLimit =     20, minBroadcastTime: 4s

https://play.golang.org/p/7Jk6G34fkle

Now at worst getBroadcasts is only called once every GossipInterval (defaults to 200ms in Consul). It is sometimes better than this since we attempt to piggy back on any other message being sent too which might occur more often, but let's assume it's common only to be making one transmit attempt every 200ms.

So the minBroadcastTime column shows the theoretical minimum time it would take to make enough reBroadcasts for the notify channel to be closed. In theory even with 10k nodes this fits inside the 5s default but only just and it's only a minimum - any other messages being broadcast are competing for the limited space in each UDP packet sent which may well mean it takes several rounds of gossip for each broadcast to go out. In fact we prioritize messages that have been sent fewer times, so it gets increasingly likely on each re-broadcast that we won't deliver it in the next gossip round.

Anecdotally, anyone running a large enough cluster (on the order of 1000 nodes or more). Will often see graceful leave "timeout" and especially if there are any other changes in the cluster causing more gossip messages to be broadcast than usual.

The question is: What does this broadcast timeout achieve? If the goal is to keep the sending node around for long enough to ensure the message is sent, then the timeout should probably be proportional to the clustersize/number of attempts that will be made to send it. If we only care that we made some effort to send more than a few times, we should probably not wait for every single retransmit. The ultimate question is: Why report this as an error to operators when it's just natural in any large cluster and doesn't typically mean that the broadcast was actually any less effective?

Possible solutions:

One or more of these are possible.

don't show that error on leave as it makes no real difference especially since we immediately sleep afterwards for longer than 5s any way which probably means the broadcast does complete even when we say it "timedout"
change Serf to make that timeout be proportional to cluster size so it at least indicates that something unusual happened (rather than just being always hit past a certain practical size)
change Serf/memberlist to only wait for N broadcasts before we consider the broadcast "sent" even if we send more - i.e. regardless of cluster size only wait on say rebroadcast_mult broadcasts before saying "we sent this".

The text was updated successfully, but these errors were encountered:

ro-stripe · 2021-09-24T20:21:21Z

We are seeing similar error when a node is leaving. Our case this happens when leaving a network area.

failed to leave area: area=8dad7f10-b874-1d43-eb70-3a5c8b0abdba error=""timeout while waiting for graceful leave

This is causing the network area peered remote consul dcs mark the leaving node as failed. Cross dc traffic are actually routed to that "failed" node and causing cross dc requests to fail.

bbarkhouse-hashicorp · 2022-01-05T18:23:18Z

I have a customer who is likely bumping into this issue as well and causing the remote state to grow out of control because it's not removing dead members from the member list. This puts them at risk of reaching the max limit of 25mb. Feel free to contact me on Slack for more information - bbarkhouse.

nahratzah · 2022-04-15T11:22:49Z

I've modified our local installation of consul to have configurable serf-broadcast-timeout and serf-leave-propagation-delay.

On our cluster of ~3600 hosts, I had to raise the serf-broadcast-timeout to 25s, in order to get the "shutdown without leave" issue to happen less than 10% of the time. I don't think using the cluster size is a good guideline here. (Not to mention that the default for serf is 5 seconds, so your proposed values would lower it for all cluster sizes.)

I think a proper fix would need to change the serf message queue to be a priority queue (it should prioritize the leave message above anything else really).

I also recommend:

make the broadcast-timeout and serf-leave-propagation-delay configurable.
remove the graceful-shutdown-timeout in consul: it's redundant, since serf has bounded time on its shutdown steps. It also gets in the way of lengthening serf timeouts.

sayap · 2024-04-01T03:59:24Z

In a cluster with about 1000 nodes, we notice that the fix from hashicorp/serf#640 doesn't always work.

Example log lines when it works and the terminating pod's member status becomes left:

# from the terminating pod
consul-75rcl consul 2024-04-01T02:41:40.794Z [INFO]  agent: Caught: signal=terminated
consul-75rcl consul 2024-04-01T02:41:40.794Z [INFO]  agent: Gracefully shutting down agent...
consul-75rcl consul 2024-04-01T02:41:40.794Z [INFO]  agent.client: client starting leave
...
# from the terminating pod
consul-75rcl consul 2024-04-01T02:41:45.794Z [WARN]  agent.client.serf.lan: serf: timeout while waiting for graceful leave
consul-75rcl consul 2024-04-01T02:41:45.794Z [INFO]  agent.client.serf.lan: serf: EventMemberLeave: ip-10-6-6-6.ap-southeast-1.compute.internal 10.6.6.6
...
# from the other pods
consul-7zzhc consul 2024-04-01T02:41:46.524Z [INFO]  agent.client.serf.lan: serf: EventMemberLeave: ip-10-6-6-6.ap-southeast-1.compute.internal 10.6.6.6
consul-bb8sx consul 2024-04-01T02:41:46.621Z [INFO]  agent.client.serf.lan: serf: EventMemberLeave: ip-10-6-6-6.ap-southeast-1.compute.internal 10.6.6.6
consul-jwvgz consul 2024-04-01T02:41:46.717Z [INFO]  agent.client.serf.lan: serf: EventMemberLeave: ip-10-6-6-6.ap-southeast-1.compute.internal 10.6.6.6
...
# from the terminating pod
consul-75rcl consul 2024-04-01T02:41:50.795Z [WARN]  agent.client.serf.lan: serf: timeout waiting for leave broadcast: timeout waiting for leave broadcast

Example log lines when it doesn't work and the terminating pod's status becomes failed:

# from the terminating pod
consul-5rjvv consul 2024-04-01T02:41:10.224Z [INFO]  agent: Caught: signal=terminated
consul-5rjvv consul 2024-04-01T02:41:10.224Z [INFO]  agent: Gracefully shutting down agent...
consul-5rjvv consul 2024-04-01T02:41:10.224Z [INFO]  agent.client: client starting leave
...
# from the terminating pod
consul-5rjvv consul 2024-04-01T02:41:15.225Z [WARN]  agent.client.serf.lan: serf: timeout while waiting for graceful leave
consul-5rjvv consul 2024-04-01T02:41:15.225Z [INFO]  agent.client.serf.lan: serf: EventMemberLeave: ip-10-4-4-4.ap-southeast-1.compute.internal 10.4.4.4
...
# from the other pods
consul-4kqhp consul 2024-04-01T02:41:16.431Z [INFO]  agent.client.serf.lan: serf: EventMemberFailed: ip-10-4-4-4.ap-southeast-1.compute.internal 10.4.4.4
consul-f5cps consul 2024-04-01T02:41:16.462Z [INFO]  agent.client.serf.lan: serf: EventMemberFailed: ip-10-4-4-4.ap-southeast-1.compute.internal 10.4.4.4
consul-4hts6 consul 2024-04-01T02:41:16.460Z [INFO]  agent.client.serf.lan: serf: EventMemberFailed: ip-10-4-4-4.ap-southeast-1.compute.internal 10.4.4.4
...
# from the terminating pod
consul-5rjvv consul 2024-04-01T02:41:20.225Z [WARN]  agent.client.serf.lan: serf: timeout waiting for leave broadcast: timeout waiting for leave broadcast

The patterns are almost the same, except that the other pods will log about EventMemberLeave when it works, and will log about EventMemberFailed when it doesn't work.

After some checking, this is likely caused by the logic in https://github.com/hashicorp/serf/blob/v0.10.0/serf/serf.go#L1018-L1026:

	switch member.Status {
	case StatusLeaving:
		member.Status = StatusLeft
		member.leaveTime = time.Now()
		s.leftMembers = append(s.leftMembers, member)
	case StatusAlive:
		member.Status = StatusFailed
		member.leaveTime = time.Now()
		s.failedMembers = append(s.failedMembers, member)

together with the limitation with user-defined broadcast queue described in hashicorp/memberlist#299.

As the leave intent message from https://github.com/hashicorp/serf/blob/v0.10.0/serf/serf.go#L732 has to go through an user-defined broadcast queue, sometimes the message will not be broadcasted at all. Subsequently, when the actual memberlist leave message is broadcasted by https://github.com/hashicorp/serf/blob/v0.10.0/serf/serf.go#L744, via the system broadcast queue, we will get the Alive -> Failed case instead of the more desirable Leaving -> Left case.

banks added theme/operator-usability Replaces UX. Anything related to making things easier for the practitioner theme/internals Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics labels Aug 5, 2020

banks changed the title ~~Graceful leave will always timeout on a large cluster~~ Graceful leave will often timeout on a large cluster though nothing is wrong Aug 5, 2020

jsosulska mentioned this issue Aug 7, 2020

TCP CLOSE_WAIT counts up to 500 and rpc error #8443

Closed

rboyer added the theme/reliability label Mar 11, 2021

dhiaayachi mentioned this issue Oct 20, 2021

When a serf member is leaving the pool using Leave if messageLeaveType broadcast fail, the leave is not completed hashicorp/serf#638

Closed

yokaze mentioned this issue Sep 20, 2022

serf: Set broadcast timeout to one minute cybozu-go/neco#2100

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graceful leave will often timeout on a large cluster though nothing is wrong #8435

Graceful leave will often timeout on a large cluster though nothing is wrong #8435

banks commented Aug 5, 2020 •

edited

Loading

ro-stripe commented Sep 24, 2021

bbarkhouse-hashicorp commented Jan 5, 2022

nahratzah commented Apr 15, 2022

sayap commented Apr 1, 2024 •

edited

Loading

Graceful leave will often timeout on a large cluster though nothing is wrong #8435

Graceful leave will often timeout on a large cluster though nothing is wrong #8435

Comments

banks commented Aug 5, 2020 • edited Loading

ro-stripe commented Sep 24, 2021

bbarkhouse-hashicorp commented Jan 5, 2022

nahratzah commented Apr 15, 2022

sayap commented Apr 1, 2024 • edited Loading

banks commented Aug 5, 2020 •

edited

Loading

sayap commented Apr 1, 2024 •

edited

Loading