Bug Report: vtgate buffering times out even after receiving healthcheck from new primary #17629

deepthi · 2025-01-27T04:06:05Z

Overview of the Issue

When doing a software rollout across many shards, it has been observed that some vtgates are ending buffering with a timeout, instead of by marking the shard as consistent.

Reproduction Steps

Rollout across 32 shard keyspace simultaneously using PlannedReparentShard to elect a new primary before restarting the old primary.

Binary Version

v19, but affects all versions since keyspace_events buffering became the default

Operating System and Environment details

any

Log Fragments

I0124 16:16:31.646648       1 shard_buffer.go:565] Stopping buffering for shard: test-keyspace/01-02 after: 10.0 seconds due to: stopping buffering because failover did not finish in time (10s). Draining 26 buffered requests now.

Logs also show that the vtgate has received a healthcheck from the new primary several seconds before this message.

deepthi · 2025-01-27T04:19:51Z

KeyspaceEventWatcher subscribes to events from the healthcheck. Healthcheck broadcasts to a buffered channel

func (hc *HealthCheckImpl) broadcast(th *TabletHealth) {
	hc.subMu.Lock()
	defer hc.subMu.Unlock()
	for c := range hc.subscribers {
		select {
		case c <- th:
		default:
		}
	}
}

When it receives a healthcheck update, the KEW processes it

			case result := <-hcChan:
				if result == nil {
					return
				}
				kew.processHealthCheck(ctx, result)
			}

This ends up calling kss.onHealthCheck which has

	kss.mu.Lock()
	defer kss.mu.Unlock()

So processing of these healthcheck updates is serialized by that mutex. When there are 10's of updates within microseconds of each other, it is possible for the channel to fill up and lose some updates.

func (hc *HealthCheckImpl) Subscribe() chan *TabletHealth {
	hc.subMu.Lock()
	defer hc.subMu.Unlock()
	c := make(chan *TabletHealth, 2)
	hc.subscribers[c] = struct{}{}
	return c

The channel capacity is only 2!! This might have been fine with the old healthcheck buffering which had no synchronization between shards, but is insufficient now.

Credit to @maxenglander for pointing us to the root cause.

GuptaManan100 · 2025-01-27T10:14:41Z

I have two alternate fixes for this problem and I've coded both and benchmarked them both - #17632 (infinite buffer using a message queue), #17634 (just increased channel capacity to 1024).

The results for the benchmark -

goos: darwin
goarch: arm64
pkg: vitess.io/vitess/go/vt/discovery
cpu: Apple M1 Max
                       │ benchmarks/v1.txt │         benchmarks/v2.txt         │
                       │      sec/op       │   sec/op     vs base              │
Access_FastConsumer-10         101.6m ± 0%   101.7m ± 4%       ~ (p=0.485 n=6)
Access_SlowConsumer-10          4.953 ± 0%    5.105 ± 1%  +3.08% (p=0.002 n=6)
geomean                        709.3m        720.5m       +1.59%

                       │ benchmarks/v1.txt │          benchmarks/v2.txt          │
                       │       B/op        │     B/op      vs base               │
Access_FastConsumer-10        97.10Ki ± 1%   88.01Ki ± 0%   -9.37% (p=0.002 n=6)
Access_SlowConsumer-10        20.92Ki ± 3%   28.04Ki ± 2%  +34.06% (p=0.002 n=6)
geomean                       45.07Ki        49.68Ki       +10.23%

                       │ benchmarks/v1.txt │         benchmarks/v2.txt         │
                       │     allocs/op     │  allocs/op   vs base              │
Access_FastConsumer-10         1.023k ± 0%   1.008k ± 0%  -1.47% (p=0.002 n=6)
Access_SlowConsumer-10          220.5 ± 3%    212.0 ± 3%  -3.85% (p=0.002 n=6)
geomean                         474.9         462.3       -2.67%

I personally prefer the first solution because it doesn't place a theoretical bound at the size of the queue, and it also seems to be using less memory in the case of a slow consumer. The performance is comparable in the fast consumer scenario, and there is a slight difference in slow consumer where-in the second approach is slightly faster.

dbussink · 2025-01-27T10:38:41Z

I personally prefer the first solution because it doesn't place a theoretical bound at the size of the queue, and it also seems to be using less memory in the case of a slow consumer. The performance is comparable in the fast consumer scenario, and there is a slight difference in slow consumer where-in the second approach is slightly faster.

Isn't the second solution here much simpler? An unbound queue doesn't exist, it's always limited by the practical concerns like memory and CPU available.

Raising the channel size to something that is significantly higher than any practical limit on the size of a Vitess cluster would suffice too then? We could go significantly higher than 1024 as well then of that's safer? Like 64k or something along those lines?

GuptaManan100 · 2025-01-28T05:58:18Z

@dbussink I would agree with you that 64k would be theoritically the same fix, but as per the benchmarks #17629 (comment) the message queue seams to be using less memory for a slower consumer.

arthurschreiber · 2025-01-28T11:46:51Z

There was/is a second issue with doing PlannedReparentShard across different shards at the same time. I think I talked to @GuptaManan100 about this some while ago?

And then I never opened an issue. 😆

So, basically what happens is that vtgates don't track the health of individual shards, they track the health of the Keyspace as a whole. When there's a rolling deployment that runs PlannedReparentShard across shards, the time that a keyspace is seen as unhealthy is the longest overlapping time of unhealthy shards. This can easily be longer than the buffer duration, which will cause the buffering timeouts.

deepthi added Needs Triage This issue needs to be correctly labelled and triaged Type: Bug labels Jan 27, 2025

deepthi added Component: Query Serving and removed Needs Triage This issue needs to be correctly labelled and triaged labels Jan 27, 2025

deepthi assigned GuptaManan100 Jan 27, 2025

This was referenced Jan 27, 2025

Add infinite buffer to health check consumers #17632

Open

Increase health check channel buffer #17634

Closed

GuptaManan100 mentioned this issue Jan 28, 2025

[release-21.0] Increase health check buffer size #17636

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug Report: vtgate buffering times out even after receiving healthcheck from new primary #17629

Bug Report: vtgate buffering times out even after receiving healthcheck from new primary #17629

deepthi commented Jan 27, 2025 •

edited

Loading

deepthi commented Jan 27, 2025

GuptaManan100 commented Jan 27, 2025

dbussink commented Jan 27, 2025

GuptaManan100 commented Jan 28, 2025

arthurschreiber commented Jan 28, 2025

Bug Report: vtgate buffering times out even after receiving healthcheck from new primary #17629

Bug Report: vtgate buffering times out even after receiving healthcheck from new primary #17629

Comments

deepthi commented Jan 27, 2025 • edited Loading

Overview of the Issue

Reproduction Steps

Binary Version

Operating System and Environment details

Log Fragments

deepthi commented Jan 27, 2025

GuptaManan100 commented Jan 27, 2025

dbussink commented Jan 27, 2025

GuptaManan100 commented Jan 28, 2025

arthurschreiber commented Jan 28, 2025

deepthi commented Jan 27, 2025 •

edited

Loading