-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug Report: vtgate buffering times out even after receiving healthcheck from new primary #17629
Comments
KeyspaceEventWatcher subscribes to events from the healthcheck. Healthcheck broadcasts to a buffered channel
When it receives a healthcheck update, the KEW processes it
This ends up calling
So processing of these healthcheck updates is serialized by that mutex. When there are 10's of updates within microseconds of each other, it is possible for the channel to fill up and lose some updates.
The channel capacity is only 2!! This might have been fine with the old Credit to @maxenglander for pointing us to the root cause. |
I have two alternate fixes for this problem and I've coded both and benchmarked them both - #17632 (infinite buffer using a message queue), #17634 (just increased channel capacity to 1024). The results for the benchmark - goos: darwin
goarch: arm64
pkg: vitess.io/vitess/go/vt/discovery
cpu: Apple M1 Max
│ benchmarks/v1.txt │ benchmarks/v2.txt │
│ sec/op │ sec/op vs base │
Access_FastConsumer-10 101.6m ± 0% 101.7m ± 4% ~ (p=0.485 n=6)
Access_SlowConsumer-10 4.953 ± 0% 5.105 ± 1% +3.08% (p=0.002 n=6)
geomean 709.3m 720.5m +1.59%
│ benchmarks/v1.txt │ benchmarks/v2.txt │
│ B/op │ B/op vs base │
Access_FastConsumer-10 97.10Ki ± 1% 88.01Ki ± 0% -9.37% (p=0.002 n=6)
Access_SlowConsumer-10 20.92Ki ± 3% 28.04Ki ± 2% +34.06% (p=0.002 n=6)
geomean 45.07Ki 49.68Ki +10.23%
│ benchmarks/v1.txt │ benchmarks/v2.txt │
│ allocs/op │ allocs/op vs base │
Access_FastConsumer-10 1.023k ± 0% 1.008k ± 0% -1.47% (p=0.002 n=6)
Access_SlowConsumer-10 220.5 ± 3% 212.0 ± 3% -3.85% (p=0.002 n=6)
geomean 474.9 462.3 -2.67% I personally prefer the first solution because it doesn't place a theoretical bound at the size of the queue, and it also seems to be using less memory in the case of a slow consumer. The performance is comparable in the fast consumer scenario, and there is a slight difference in slow consumer where-in the second approach is slightly faster. |
Isn't the second solution here much simpler? An unbound queue doesn't exist, it's always limited by the practical concerns like memory and CPU available. Raising the channel size to something that is significantly higher than any practical limit on the size of a Vitess cluster would suffice too then? We could go significantly higher than 1024 as well then of that's safer? Like 64k or something along those lines? |
@dbussink I would agree with you that 64k would be theoritically the same fix, but as per the benchmarks #17629 (comment) the message queue seams to be using less memory for a slower consumer. |
There was/is a second issue with doing ![]() And then I never opened an issue. 😆 So, basically what happens is that |
Overview of the Issue
When doing a software rollout across many shards, it has been observed that some vtgates are ending buffering with a timeout, instead of by marking the shard as consistent.
Reproduction Steps
Rollout across 32 shard keyspace simultaneously using PlannedReparentShard to elect a new primary before restarting the old primary.
Binary Version
Operating System and Environment details
Log Fragments
Logs also show that the vtgate has received a healthcheck from the new primary several seconds before this message.
The text was updated successfully, but these errors were encountered: