Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make bootstrapping handle its own timeouts #3410

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

yacovm
Copy link
Contributor

@yacovm yacovm commented Sep 24, 2024

Why this should be merged

Currently, an engine registers timeouts into the handler, which schedules the timeouts on behalf of the the engine.

The handler then notifies the engine when the timeout expired.

However, the only engine that uses this mechanism is the bootstrapping engine, and not the other engine types such as the snowman and state sync engines.

It therefore makes sense to consolidate the timeout handling instead of delegating them to the handler.

By moving the timeout handling closer to the bootstrapper, we can make the API of the common.Engine be slimmer by removing the Timeout() method from it.

How this works

I refactored the timeout logic and decoupled it from the handler.
I also re-implemented the timeout mechanism in order to be more robust and testable.

How this was tested

  • CI tests
  • Added unit tests with 100% code coverage.
  • I ran a node which I built with the data race detector flag until after it was bootstrapped in order to ensure there were no data races introduced in this PR.
diff --git a/scripts/build_avalanche.sh b/scripts/build_avalanche.sh
index 74ec675e2..26320a562 100755
--- a/scripts/build_avalanche.sh
+++ b/scripts/build_avalanche.sh
@@ -29,4 +29,4 @@ source "$AVALANCHE_PATH"/scripts/constants.sh
 
 build_args="$race"
 echo "Building AvalancheGo..."
-go build $build_args -ldflags "-X github.com/ava-labs/avalanchego/version.GitCommit=$git_commit $static_ld_flags" -o "$avalanchego_path" "$AVALANCHE_PATH/main/"*.go
+go build -race $build_args -ldflags "-X github.com/ava-labs/avalanchego/version.GitCommit=$git_commit $static_ld_flags" -o "$avalanchego_path" "$AVALANCHE_PATH/main/"*.go

The node successfully finished bootstrapping and there were no data races observed in its log.

@yacovm yacovm force-pushed the simplifyTimeouts branch 2 times, most recently from e2641d7 to 09ad764 Compare September 24, 2024 00:56
@yacovm yacovm marked this pull request as draft September 24, 2024 01:07
@yacovm yacovm force-pushed the simplifyTimeouts branch 3 times, most recently from 25122c6 to b54e285 Compare September 24, 2024 01:59
@yacovm yacovm self-assigned this Sep 24, 2024
@yacovm yacovm force-pushed the simplifyTimeouts branch 7 times, most recently from 31d051e to c25a68d Compare September 25, 2024 01:02
@yacovm yacovm marked this pull request as ready for review September 25, 2024 01:15
}

// RegisterTimeout fires the function the timeout handler is initialized with no later than the given timeout.
func (th *timeoutHandler) RegisterTimeout(d time.Duration) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just moved this code section from handler.go

snow/engine/common/bootstrap_tracker.go Outdated Show resolved Hide resolved
snow/engine/snowman/bootstrap/bootstrapper.go Show resolved Hide resolved
snow/engine/common/timer.go Outdated Show resolved Hide resolved
snow/engine/common/timer.go Outdated Show resolved Hide resolved
Comment on lines 49 to 52
newTimer: func(d time.Duration) (<-chan time.Time, func() bool) {
timer := time.NewTimer(d)
return timer.C, timer.Stop
},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for specifying this function here rather than just using time.NewTimer directly?

It looks like we replace this with an identical implementation in one of our test cases

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I needed it for the tests :-)

},
},
} {
t.Run(testCase.desc, func(t *testing.T) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to remove clock from these tests?

What do you think of adding test cases for multiple timeouts w/ and w/o preemption?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added two tests as requested, but I need the clock for these tests.

@@ -14,5 +14,6 @@ type BootstrapTracker interface {
// Bootstrapped marks the named chain as being bootstrapped
Bootstrapped(chainID ids.ID)

OnBootstrapCompleted() chan struct{}
// AllBootstrapped returns a channel that is closed when all chains have been bootstrapped
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// AllBootstrapped returns a channel that is closed when all chains have been bootstrapped
// AllBootstrapped returns a channel that is closed when all chains in this
// subnet have been bootstrapped

return ps.signal
}

// Preempt causes any past and future call of Listen return a closed channel.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Preempt causes any past and future call of Listen return a closed channel.
// Preempt causes any past and future calls of Listen to return a closed channel.

Comment on lines 54 to 57
onTimeout: func() {
onTimeout()
<-timeoutBuff
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we wrap onTimeout rather than just draining the channel when we call this function? The code that executes this function is already managing the channel right?

Comment on lines 36 to 44
type timeoutHandler struct {
newTimer func(time.Duration) (<-chan time.Time, func() bool)
timeouts chan struct{}
onTimeout func()
preemptionSignal <-chan struct{}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been looking at the timeoutHandler for the past 10m and I'm still not really sure what this is doing.

Is all this complexity really needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it this way because I didn't want to change the original code pattern, and to make as minimal changes as possible.

I now re-implemented it. I hope it reads better and is easier to reason about.

Comment on lines 72 to 79
// If there is already a timeout ready to fire - just drop the
// additional timeout. This ensures that all goroutines that are spawned
// here are able to close if the chain is shutdown.
select {
case th.timeouts <- struct{}{}:
th.onTimeout()
default:
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels racy to me. What are the expected guarantees around onTimeout being called?

As this is currently written, it seems like a caller may call RegisterTimeout and then never see an onTimeout called.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be at least a single single timeout called, as th.timeouts is initialized to a buffered channel with a capacity of one.

If two are called concurrently, then there is an attempt to only call one.

This code was not made in this PR, I just moved it around.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-implemented this part, and added a lock on the context lock in the timeout callback

Comment on lines 144 to 151
timeout := func() {
if err := bs.Timeout(); err != nil {
bs.Config.Ctx.Log.Warn("Encountered error during bootstrapping: %w", zap.Error(err))
}
}
Copy link
Contributor

@StephenButtolph StephenButtolph Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is going to be called in a separate goroutine without the context lock held. Right? I think that can cause some racy behavior

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, you're saying that the current code synchronizes all the calls to Timeout and calls them via the handler, while this new code calls each function in its own goroutine, is that it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm now locking the context lock in the timeout callback. I am not thrilled about it, but since we already have a few dozens of places where we lock this context lock, i figured it makes sense to proceed than to revert and have a tighter coupling between the engine and the handler via the timeout feedback loop.

@yacovm yacovm force-pushed the simplifyTimeouts branch 5 times, most recently from 2d6f9b4 to 4f949b8 Compare October 3, 2024 19:56
Currently, an engine registers timeouts into the handler, which schedules the timeouts on behalf of the the engine.
The handler then notifies the engine when the timeout expired.

However, the only engine that uses this mechanism is the bootstrapping engine, and not the other engine types such as the snowman and state sync engines.

It therefore makes sense to consolidate the timeout handling instead of delegating them to the handler.

By moving the timeout handling closer to the bootstrapper, we can make the API of the common.Engine be slimmer by removing the Timeout() method from it.

Signed-off-by: Yacov Manevich <[email protected]>
@yacovm yacovm added the cleanup Code quality improvement label Oct 3, 2024
Comment on lines 64 to 67
if !preempted && !acquiredToken {
return
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there special handling around the preemption here? Specifically, I'm not sure why we allow multiple goroutines to be scheduled if we have been preempted. I think just having:

if !th.acquirePendingTimeoutToken() {
	return
}

should be sufficient for our expected invariants.

If we did this, I think we could get rid of the acquiredToken argument in scheduleTimeout and remove the preempted helper entirely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the behavior of the previous code scheduled a new goroutine regardless.
The trick in the bottom select doesn't work in most cases as it's very time reliant - it schedules a goroutine regardless, and then attempts to "cancel it" after the timeout.

func (h *handler) RegisterTimeout(d time.Duration) {
	go func() {
		timer := time.NewTimer(d)
		defer timer.Stop()

		select {
		case <-timer.C:
		case <-h.preemptTimeouts:
		}

		// If there is already a timeout ready to fire - just drop the
		// additional timeout. This ensures that all goroutines that are spawned
		// here are able to close if the chain is shutdown.
		select {
		case h.timeouts <- struct{}{}:
		default:
		}
	}()
}

I agree though I should probably take advantage of the opportunity and not replicate a pattern if it does not add any value.

timer := th.newTimer(d)
defer timer.Stop()

defer th.onTimeout()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I feel like this makes a bit more sense to just be called inline at the end of the function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inlined

}

if acquiredToken {
th.relinquishPendingTimeoutToken()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we comment why it is important to return the token prior to calling onTimeout?

// Allow the next timeout to be scheduled. It is important that this is
// performed prior to invoking the timeout handler to allow the timeout
// handler (or another goroutine synchronized with the timeout handler) to
// schedule a future timeout and be guaranteed it will fire.

or something

Comment on lines 111 to 113
// RegisterTimeout specifies how much time to delay the next timeout message
// by. If the subnet has been bootstrapped, the timeout will fire
// immediately.
// immediately via calling Preempt().
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we are here, it probably makes sense to expand on the interface comment a bit:

// RegisterTimeout specifies how much time to delay the next timeout message
// by.
//
// If there is already a pending timeout message, this call is a no-op.
// However, it is guaranteed that the timeout will fire at least once after
// calling this function.
//
// If the subnet has been bootstrapped, the timeout will fire immediately
// via calling Preempt().

Comment on lines 49 to 56
pendingTimout := make(chan struct{}, 1)
pendingTimout <- struct{}{}
return &timeoutScheduler{
preemptionSignal: preemptionSignal,
newTimer: newTimer,
onTimeout: onTimeout,
pendingTimeout: pendingTimout,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙏 I think this is much cleaner than before - ty


// NewTimeoutScheduler constructs a new timeout scheduler with the given function to be invoked upon a timeout,
// unless the preemptionSignal is closed and in which case it invokes the function immediately.
func NewTimeoutScheduler(onTimeout func(), preemptionSignal <-chan struct{}, newTimer func(duration time.Duration) *time.Timer) *timeoutScheduler {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

q: since we only ever specify newTimer manually in tests in this package, would it make sense to remove this from the signature (initializing to time.NewTimer in this constructer) and then override the newTimer variable in the tests?

It seems like that would give a nicer UX for the prod code using this... But I don't have strong opinions here, just a small suggestion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes makes sense. That was also how it was in the previous incarnation of this code.

Comment on lines 89 to 96
newTimer := func(_ time.Duration) *time.Timer {
// We use a duration of 0 to not leave a lingering timer
// after the test finishes.
// Then we replace the time channel to have control over the timer.
timer := time.NewTimer(0)
timer.C = clock
return timer
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make this a helper (since we define and use it twice?)

Comment on lines 98 to 103
var wg sync.WaitGroup
wg.Add(1)

onTimeout := func() {
wg.Done()
}

roChan := make(<-chan struct{})

ts := NewTimeoutScheduler(onTimeout, roChan, newTimer)

ts.RegisterTimeout(time.Hour) // First timeout is registered
ts.RegisterTimeout(time.Hour) // Second should not

// Clock ticks are after registering, in order to ensure onTimeout() isn't fired until second registration is invoked.
clock <- time.Now()
clock <- time.Now()

wg.Wait()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I wasn't sure what roChan meant, felt like preemptChan was a more clear name.
  2. We can just pass wg.Done rather than wrapping it into another function.
Suggested change
var wg sync.WaitGroup
wg.Add(1)
onTimeout := func() {
wg.Done()
}
roChan := make(<-chan struct{})
ts := NewTimeoutScheduler(onTimeout, roChan, newTimer)
ts.RegisterTimeout(time.Hour) // First timeout is registered
ts.RegisterTimeout(time.Hour) // Second should not
// Clock ticks are after registering, in order to ensure onTimeout() isn't fired until second registration is invoked.
clock <- time.Now()
clock <- time.Now()
wg.Wait()
var wg sync.WaitGroup
wg.Add(1)
preemptChan := make(<-chan struct{})
ts := NewTimeoutScheduler(wg.Done, preemptChan, newTimer)
ts.RegisterTimeout(time.Hour) // First timeout is registered
ts.RegisterTimeout(time.Hour) // Second should not
// Clock ticks are after registering, in order to ensure wg.Done() isn't fired until second registration is invoked.
clock <- time.Now()
clock <- time.Now()
wg.Wait()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

roChan means read-only channel. Agree preemptChan is better.

}
}

func TestTimeoutSchedulerConcurrentRegister(_ *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we comment that this test relies on wg.Done to panic if the counter goes negative? Typically a test that don't use the *testing.T isn't really a test (as it can't fail normally).

Comment on lines 26 to 27
initClock: func(_ chan time.Time) {},
advanceTime: func(_ chan time.Time) {},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super nit, but typically we don't put the _ if we don't need to.

Suggested change
initClock: func(_ chan time.Time) {},
advanceTime: func(_ chan time.Time) {},
initClock: func(chan time.Time) {},
advanceTime: func(chan time.Time) {},

Signed-off-by: Yacov Manevich <[email protected]>
@yacovm
Copy link
Contributor Author

yacovm commented Oct 11, 2024

@StephenButtolph thanks for the review, addressed your comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cleanup Code quality improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants