Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: vtgate buffering times out even after receiving healthcheck from new primary #17629

Open
deepthi opened this issue Jan 27, 2025 · 5 comments · May be fixed by #17632
Open

Bug Report: vtgate buffering times out even after receiving healthcheck from new primary #17629

deepthi opened this issue Jan 27, 2025 · 5 comments · May be fixed by #17632

Comments

@deepthi
Copy link
Member

deepthi commented Jan 27, 2025

Overview of the Issue

When doing a software rollout across many shards, it has been observed that some vtgates are ending buffering with a timeout, instead of by marking the shard as consistent.

Reproduction Steps

Rollout across 32 shard keyspace simultaneously using PlannedReparentShard to elect a new primary before restarting the old primary.

Binary Version

v19, but affects all versions since keyspace_events buffering became the default

Operating System and Environment details

any

Log Fragments

I0124 16:16:31.646648       1 shard_buffer.go:565] Stopping buffering for shard: test-keyspace/01-02 after: 10.0 seconds due to: stopping buffering because failover did not finish in time (10s). Draining 26 buffered requests now.

Logs also show that the vtgate has received a healthcheck from the new primary several seconds before this message.

@deepthi deepthi added Needs Triage This issue needs to be correctly labelled and triaged Type: Bug labels Jan 27, 2025
@deepthi
Copy link
Member Author

deepthi commented Jan 27, 2025

KeyspaceEventWatcher subscribes to events from the healthcheck. Healthcheck broadcasts to a buffered channel

func (hc *HealthCheckImpl) broadcast(th *TabletHealth) {
	hc.subMu.Lock()
	defer hc.subMu.Unlock()
	for c := range hc.subscribers {
		select {
		case c <- th:
		default:
		}
	}
}

When it receives a healthcheck update, the KEW processes it

			case result := <-hcChan:
				if result == nil {
					return
				}
				kew.processHealthCheck(ctx, result)
			}

This ends up calling kss.onHealthCheck which has

	kss.mu.Lock()
	defer kss.mu.Unlock()

So processing of these healthcheck updates is serialized by that mutex. When there are 10's of updates within microseconds of each other, it is possible for the channel to fill up and lose some updates.

func (hc *HealthCheckImpl) Subscribe() chan *TabletHealth {
	hc.subMu.Lock()
	defer hc.subMu.Unlock()
	c := make(chan *TabletHealth, 2)
	hc.subscribers[c] = struct{}{}
	return c

The channel capacity is only 2!! This might have been fine with the old healthcheck buffering which had no synchronization between shards, but is insufficient now.

Credit to @maxenglander for pointing us to the root cause.

@deepthi deepthi added Component: Query Serving and removed Needs Triage This issue needs to be correctly labelled and triaged labels Jan 27, 2025
@GuptaManan100
Copy link
Member

I have two alternate fixes for this problem and I've coded both and benchmarked them both - #17632 (infinite buffer using a message queue), #17634 (just increased channel capacity to 1024).

The results for the benchmark -

goos: darwin
goarch: arm64
pkg: vitess.io/vitess/go/vt/discovery
cpu: Apple M1 Maxbenchmarks/v1.txtbenchmarks/v2.txt         │
                       │      sec/opsec/op     vs baseAccess_FastConsumer-10         101.6m ± 0%   101.7m ± 4%       ~ (p=0.485 n=6)
Access_SlowConsumer-10          4.953 ± 0%    5.105 ± 1%  +3.08% (p=0.002 n=6)
geomean                        709.3m        720.5m       +1.59%benchmarks/v1.txtbenchmarks/v2.txt          │
                       │       B/opB/op      vs baseAccess_FastConsumer-10        97.10Ki ± 1%   88.01Ki ± 0%   -9.37% (p=0.002 n=6)
Access_SlowConsumer-10        20.92Ki ± 3%   28.04Ki ± 2%  +34.06% (p=0.002 n=6)
geomean                       45.07Ki        49.68Ki       +10.23%benchmarks/v1.txtbenchmarks/v2.txt         │
                       │     allocs/opallocs/op   vs baseAccess_FastConsumer-10         1.023k ± 0%   1.008k ± 0%  -1.47% (p=0.002 n=6)
Access_SlowConsumer-10          220.5 ± 3%    212.0 ± 3%  -3.85% (p=0.002 n=6)
geomean                         474.9         462.3       -2.67%

I personally prefer the first solution because it doesn't place a theoretical bound at the size of the queue, and it also seems to be using less memory in the case of a slow consumer. The performance is comparable in the fast consumer scenario, and there is a slight difference in slow consumer where-in the second approach is slightly faster.

@dbussink
Copy link
Contributor

I personally prefer the first solution because it doesn't place a theoretical bound at the size of the queue, and it also seems to be using less memory in the case of a slow consumer. The performance is comparable in the fast consumer scenario, and there is a slight difference in slow consumer where-in the second approach is slightly faster.

Isn't the second solution here much simpler? An unbound queue doesn't exist, it's always limited by the practical concerns like memory and CPU available.

Raising the channel size to something that is significantly higher than any practical limit on the size of a Vitess cluster would suffice too then? We could go significantly higher than 1024 as well then of that's safer? Like 64k or something along those lines?

@GuptaManan100
Copy link
Member

@dbussink I would agree with you that 64k would be theoritically the same fix, but as per the benchmarks #17629 (comment) the message queue seams to be using less memory for a slower consumer.

@arthurschreiber
Copy link
Contributor

There was/is a second issue with doing PlannedReparentShard across different shards at the same time. I think I talked to @GuptaManan100 about this some while ago?

image

And then I never opened an issue. 😆

So, basically what happens is that vtgates don't track the health of individual shards, they track the health of the Keyspace as a whole. When there's a rolling deployment that runs PlannedReparentShard across shards, the time that a keyspace is seen as unhealthy is the longest overlapping time of unhealthy shards. This can easily be longer than the buffer duration, which will cause the buffering timeouts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants