Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rls: Fix flaky test Test/ControlChannelConnectivityStateMonitoring #8055

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

eshitachandwani
Copy link
Member

fixes #5468
The test flakes because

    • Test creates a ClientConn with service config setting the LB policy to "rls"
    • RLS LB policy initializes the control channel to the RLS server. As part of this, it spawns a goroutine to monitor the control channel connectivity state changes.
    • But by the time the first RPC is successfully made, the above goroutine has not gotten a chance to run yet.
    • And at this time, the test stops the RLS server. This moves the control channel to IDLE and it is only now that the monitoring goroutine gets to run, and it has already missed the first transition to READY.

FIX : Use channel to make sure the go routine starts

  1. Our current state change API is lossy because state changes can be lost between the former returning and the caller invoking GetState
    FIX : The fix is to use grpcsync.pubsub to subscribe to the state changes so that we do not loose state changes.

RELEASE NOTES: N/A

Copy link

codecov bot commented Jan 30, 2025

Codecov Report

Attention: Patch coverage is 92.85714% with 2 lines in your changes missing coverage. Please review.

Project coverage is 82.29%. Comparing base (e0d191d) to head (990c949).

Files with missing lines Patch % Lines
balancer/rls/control_channel.go 92.85% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8055      +/-   ##
==========================================
- Coverage   82.29%   82.29%   -0.01%     
==========================================
  Files         387      387              
  Lines       39065    39081      +16     
==========================================
+ Hits        32150    32163      +13     
+ Misses       5584     5581       -3     
- Partials     1331     1337       +6     
Files with missing lines Coverage Δ
balancer/rls/control_channel.go 92.51% <92.85%> (-0.62%) ⬇️

... and 20 files with indirect coverage changes

@easwars
Copy link
Contributor

easwars commented Feb 3, 2025

Can you try 10K or 1M runs in forge before and after the fix to ensure that flakes are eliminated by the fix?

func (c *ccStateSubscriber) OnMessage(msg any) {
st, ok := msg.(connectivity.State)
if !ok {
return // Ignore invalid messages
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an error we don't expect to happen in practice and if it does, it indicates a severe programming error. I would be OK to add a panic here that includes the type being received.

stateSubscriber := &ccStateSubscriber{
state: buffer.NewUnbounded(),
}
unsubscribe := internal.SubscribeToConnectivityStateChanges.(func(cc *grpc.ClientConn, s grpcsync.Subscriber) func())(cc.cc, stateSubscriber)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a new subscriber is added to the pubsub, it receives the most recent message posted on the pubsub. This means that if there were N messages posted on the pubsub when a new subscriber is added, the subscriber only receives the most recently posted message. This might not be good enough for our purposes here. So, I suggest making the following changes.

  • Get rid of the ccStateSubscriber. Instead store the buffer.Unbounded as a field of controlChannel.
  • Initialize the unbounded buffer when the control channel is created in newControlChannel.
  • Change grpc.Dial to grpc.NewClient in newControlChannel.
  • Register the subscriber right after creating the ClientConn to the RLS server, but before calling Connect on it. This will ensure that the subscriber will receive every single state change on the ClientConn.
    • Implement the OnMessage method on the controlChannel type and pass it to the call to internal.SubscribeToConnectivityStateChanges

Comment on lines 178 to 179
unsubscribe()
stateSubscriber.state.Close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would move to controlChannel.close.

for {
// Wait for the control channel to become READY.
for s := cc.cc.GetState(); s != connectivity.Ready; s = cc.cc.GetState() {
var s any
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we define this variable to be of the concrete type connectivity.State instead?

@@ -176,11 +197,15 @@ func (cc *controlChannel) monitorConnectivityState() {
first = false

// Wait for the control channel to move out of READY.
cc.cc.WaitForStateChange(ctx, connectivity.Ready)
if cc.cc.GetState() == connectivity.Shutdown {
for s = <-stateSubscriber.state.Get(); s == connectivity.Ready; s = <-stateSubscriber.state.Get() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a for loop here? We know for a fact that we are in READY. So, the first time we actually read anything out of the unbounded buffer, we can be sure that we have moved out of READY. Am I missing something?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky test: 1/10k: ControlChannelConnectivityStateMonitoring
3 participants