Make bootstrapping handle its own timeouts #3410

yacovm · 2024-09-24T00:40:41Z

Why this should be merged

Currently, an engine registers timeouts into the handler, which schedules the timeouts on behalf of the the engine.

The handler then notifies the engine when the timeout expired.

However, the only engine that uses this mechanism is the bootstrapping engine, and not the other engine types such as the snowman and state sync engines.

It therefore makes sense to consolidate the timeout handling instead of delegating them to the handler.

By moving the timeout handling closer to the bootstrapper, we can make the API of the common.Engine be slimmer by removing the Timeout() method from it.

How this works

I refactored the timeout logic and decoupled it from the handler.
I also re-implemented the timeout mechanism in order to be more robust and testable.

How this was tested

CI tests
Added unit tests with 100% code coverage.
I ran a node which I built with the data race detector flag until after it was bootstrapped in order to ensure there were no data races introduced in this PR.

diff --git a/scripts/build_avalanche.sh b/scripts/build_avalanche.sh
index 74ec675e2..26320a562 100755
--- a/scripts/build_avalanche.sh
+++ b/scripts/build_avalanche.sh
@@ -29,4 +29,4 @@ source "$AVALANCHE_PATH"/scripts/constants.sh
 
 build_args="$race"
 echo "Building AvalancheGo..."
-go build $build_args -ldflags "-X github.com/ava-labs/avalanchego/version.GitCommit=$git_commit $static_ld_flags" -o "$avalanchego_path" "$AVALANCHE_PATH/main/"*.go
+go build -race $build_args -ldflags "-X github.com/ava-labs/avalanchego/version.GitCommit=$git_commit $static_ld_flags" -o "$avalanchego_path" "$AVALANCHE_PATH/main/"*.go

The node successfully finished bootstrapping and there were no data races observed in its log.

yacovm · 2024-09-25T01:18:18Z

snow/engine/common/timer.go

+}
+
+// RegisterTimeout fires the function the timeout handler is initialized with no later than the given timeout.
+func (th *timeoutHandler) RegisterTimeout(d time.Duration) {


I just moved this code section from handler.go

snow/engine/common/bootstrap_tracker.go

snow/engine/snowman/bootstrap/bootstrapper.go

snow/engine/common/timer.go

aaronbuchwald · 2024-10-01T11:43:27Z

snow/engine/common/timer.go

+		newTimer: func(d time.Duration) (<-chan time.Time, func() bool) {
+			timer := time.NewTimer(d)
+			return timer.C, timer.Stop
+		},


What's the reason for specifying this function here rather than just using time.NewTimer directly?

It looks like we replace this with an identical implementation in one of our test cases

Because I needed it for the tests :-)

aaronbuchwald · 2024-10-01T11:46:01Z

snow/engine/common/timer_test.go

+			},
+		},
+	} {
+		t.Run(testCase.desc, func(t *testing.T) {


Is it possible to remove clock from these tests?

What do you think of adding test cases for multiple timeouts w/ and w/o preemption?

I added two tests as requested, but I need the clock for these tests.

StephenButtolph · 2024-10-02T21:24:19Z

snow/engine/common/bootstrap_tracker.go

@@ -14,5 +14,6 @@ type BootstrapTracker interface {
 	// Bootstrapped marks the named chain as being bootstrapped
 	Bootstrapped(chainID ids.ID)

-	OnBootstrapCompleted() chan struct{}
+	// AllBootstrapped returns a channel that is closed when all chains have been bootstrapped


Suggested change

// AllBootstrapped returns a channel that is closed when all chains have been bootstrapped

// AllBootstrapped returns a channel that is closed when all chains in this

// subnet have been bootstrapped

StephenButtolph · 2024-10-02T21:25:55Z

snow/engine/common/timer.go

+	return ps.signal
+}
+
+// Preempt causes any past and future call of Listen return a closed channel.


Suggested change

// Preempt causes any past and future call of Listen return a closed channel.

// Preempt causes any past and future calls of Listen to return a closed channel.

StephenButtolph · 2024-10-02T21:36:16Z

snow/engine/common/timer.go

+		onTimeout: func() {
+			onTimeout()
+			<-timeoutBuff
+		},


Why do we wrap onTimeout rather than just draining the channel when we call this function? The code that executes this function is already managing the channel right?

StephenButtolph · 2024-10-02T21:36:31Z

snow/engine/common/timer.go

+type timeoutHandler struct {
+	newTimer         func(time.Duration) (<-chan time.Time, func() bool)
+	timeouts         chan struct{}
+	onTimeout        func()
+	preemptionSignal <-chan struct{}
+}


I've been looking at the timeoutHandler for the past 10m and I'm still not really sure what this is doing.

Is all this complexity really needed?

I made it this way because I didn't want to change the original code pattern, and to make as minimal changes as possible.

I now re-implemented it. I hope it reads better and is easier to reason about.

StephenButtolph · 2024-10-02T21:41:03Z

snow/engine/common/timer.go

+		// If there is already a timeout ready to fire - just drop the
+		// additional timeout. This ensures that all goroutines that are spawned
+		// here are able to close if the chain is shutdown.
+		select {
+		case th.timeouts <- struct{}{}:
+			th.onTimeout()
+		default:
+		}


This feels racy to me. What are the expected guarantees around onTimeout being called?

As this is currently written, it seems like a caller may call RegisterTimeout and then never see an onTimeout called.

There should be at least a single single timeout called, as th.timeouts is initialized to a buffered channel with a capacity of one.

If two are called concurrently, then there is an attempt to only call one.

This code was not made in this PR, I just moved it around.

I re-implemented this part, and added a lock on the context lock in the timeout callback

StephenButtolph · 2024-10-02T21:44:24Z

snow/engine/snowman/bootstrap/bootstrapper.go

+	timeout := func() {
+		if err := bs.Timeout(); err != nil {
+			bs.Config.Ctx.Log.Warn("Encountered error during bootstrapping: %w", zap.Error(err))
+		}
+	}


I think this is going to be called in a separate goroutine without the context lock held. Right? I think that can cause some racy behavior

So, you're saying that the current code synchronizes all the calls to Timeout and calls them via the handler, while this new code calls each function in its own goroutine, is that it?

I'm now locking the context lock in the timeout callback. I am not thrilled about it, but since we already have a few dozens of places where we lock this context lock, i figured it makes sense to proceed than to revert and have a tighter coupling between the engine and the handler via the timeout feedback loop.

Currently, an engine registers timeouts into the handler, which schedules the timeouts on behalf of the the engine. The handler then notifies the engine when the timeout expired. However, the only engine that uses this mechanism is the bootstrapping engine, and not the other engine types such as the snowman and state sync engines. It therefore makes sense to consolidate the timeout handling instead of delegating them to the handler. By moving the timeout handling closer to the bootstrapper, we can make the API of the common.Engine be slimmer by removing the Timeout() method from it. Signed-off-by: Yacov Manevich <[email protected]>

StephenButtolph · 2024-10-11T18:33:53Z

snow/engine/common/timer.go

+	if !preempted && !acquiredToken {
+		return
+	}


Why is there special handling around the preemption here? Specifically, I'm not sure why we allow multiple goroutines to be scheduled if we have been preempted. I think just having:

if !th.acquirePendingTimeoutToken() { return }

should be sufficient for our expected invariants.

If we did this, I think we could get rid of the acquiredToken argument in scheduleTimeout and remove the preempted helper entirely.

Because the behavior of the previous code scheduled a new goroutine regardless.
The trick in the bottom select doesn't work in most cases as it's very time reliant - it schedules a goroutine regardless, and then attempts to "cancel it" after the timeout.

func (h *handler) RegisterTimeout(d time.Duration) { go func() { timer := time.NewTimer(d) defer timer.Stop() select { case <-timer.C: case <-h.preemptTimeouts: } // If there is already a timeout ready to fire - just drop the // additional timeout. This ensures that all goroutines that are spawned // here are able to close if the chain is shutdown. select { case h.timeouts <- struct{}{}: default: } }() }

I agree though I should probably take advantage of the opportunity and not replicate a pattern if it does not add any value.

StephenButtolph · 2024-10-11T18:34:56Z

snow/engine/common/timer.go

+	timer := th.newTimer(d)
+	defer timer.Stop()
+
+	defer th.onTimeout()


nit: I feel like this makes a bit more sense to just be called inline at the end of the function.

StephenButtolph · 2024-10-11T18:35:48Z

snow/engine/common/timer.go

+	}
+
+	if acquiredToken {
+		th.relinquishPendingTimeoutToken()


Can we comment why it is important to return the token prior to calling onTimeout?

// Allow the next timeout to be scheduled. It is important that this is // performed prior to invoking the timeout handler to allow the timeout // handler (or another goroutine synchronized with the timeout handler) to // schedule a future timeout and be guaranteed it will fire.

or something

StephenButtolph · 2024-10-11T18:37:29Z

snow/engine/common/timer.go

 	// RegisterTimeout specifies how much time to delay the next timeout message
 	// by. If the subnet has been bootstrapped, the timeout will fire
-	// immediately.
+	// immediately via calling Preempt().


While we are here, it probably makes sense to expand on the interface comment a bit:

// RegisterTimeout specifies how much time to delay the next timeout message // by. // // If there is already a pending timeout message, this call is a no-op. // However, it is guaranteed that the timeout will fire at least once after // calling this function. // // If the subnet has been bootstrapped, the timeout will fire immediately // via calling Preempt().

StephenButtolph · 2024-10-11T18:40:43Z

snow/engine/common/timer.go

+	pendingTimout := make(chan struct{}, 1)
+	pendingTimout <- struct{}{}
+	return &timeoutScheduler{
+		preemptionSignal: preemptionSignal,
+		newTimer:         newTimer,
+		onTimeout:        onTimeout,
+		pendingTimeout:   pendingTimout,
+	}


🙏 I think this is much cleaner than before - ty

StephenButtolph · 2024-10-11T18:44:07Z

snow/engine/common/timer.go

+
+// NewTimeoutScheduler constructs a new timeout scheduler with the given function to be invoked upon a timeout,
+// unless the preemptionSignal is closed and in which case it invokes the function immediately.
+func NewTimeoutScheduler(onTimeout func(), preemptionSignal <-chan struct{}, newTimer func(duration time.Duration) *time.Timer) *timeoutScheduler {


q: since we only ever specify newTimer manually in tests in this package, would it make sense to remove this from the signature (initializing to time.NewTimer in this constructer) and then override the newTimer variable in the tests?

It seems like that would give a nicer UX for the prod code using this... But I don't have strong opinions here, just a small suggestion.

Yes makes sense. That was also how it was in the previous incarnation of this code.

StephenButtolph · 2024-10-11T18:45:16Z

snow/engine/common/timer_test.go

+	newTimer := func(_ time.Duration) *time.Timer {
+		// We use a duration of 0 to not leave a lingering timer
+		// after the test finishes.
+		// Then we replace the time channel to have control over the timer.
+		timer := time.NewTimer(0)
+		timer.C = clock
+		return timer
+	}


Should we make this a helper (since we define and use it twice?)

StephenButtolph · 2024-10-11T18:48:32Z

snow/engine/common/timer_test.go

+	var wg sync.WaitGroup
+	wg.Add(1)
+
+	onTimeout := func() {
+		wg.Done()
+	}
+
+	roChan := make(<-chan struct{})
+
+	ts := NewTimeoutScheduler(onTimeout, roChan, newTimer)
+
+	ts.RegisterTimeout(time.Hour) // First timeout is registered
+	ts.RegisterTimeout(time.Hour) // Second should not
+
+	// Clock ticks are after registering, in order to ensure onTimeout() isn't fired until second registration is invoked.
+	clock <- time.Now()
+	clock <- time.Now()
+
+	wg.Wait()


I wasn't sure what roChan meant, felt like preemptChan was a more clear name.

We can just pass wg.Done rather than wrapping it into another function.

Suggested change

var wg sync.WaitGroup

wg.Add(1)

onTimeout := func() {

wg.Done()

}

roChan := make(<-chan struct{})

ts := NewTimeoutScheduler(onTimeout, roChan, newTimer)

ts.RegisterTimeout(time.Hour) // First timeout is registered

ts.RegisterTimeout(time.Hour) // Second should not

// Clock ticks are after registering, in order to ensure onTimeout() isn't fired until second registration is invoked.

clock <- time.Now()

clock <- time.Now()

wg.Wait()

var wg sync.WaitGroup

wg.Add(1)

preemptChan := make(<-chan struct{})

ts := NewTimeoutScheduler(wg.Done, preemptChan, newTimer)

ts.RegisterTimeout(time.Hour) // First timeout is registered

ts.RegisterTimeout(time.Hour) // Second should not

// Clock ticks are after registering, in order to ensure wg.Done() isn't fired until second registration is invoked.

clock <- time.Now()

clock <- time.Now()

wg.Wait()

roChan means read-only channel. Agree preemptChan is better.

StephenButtolph · 2024-10-11T18:50:42Z

snow/engine/common/timer_test.go

+	}
+}
+
+func TestTimeoutSchedulerConcurrentRegister(_ *testing.T) {


Should we comment that this test relies on wg.Done to panic if the counter goes negative? Typically a test that don't use the *testing.T isn't really a test (as it can't fail normally).

StephenButtolph · 2024-10-11T18:52:44Z

snow/engine/common/timer_test.go

+			initClock:               func(_ chan time.Time) {},
+			advanceTime:             func(_ chan time.Time) {},


Super nit, but typically we don't put the _ if we don't need to.

Suggested change

initClock: func(_ chan time.Time) {},

advanceTime: func(_ chan time.Time) {},

initClock: func(chan time.Time) {},

advanceTime: func(chan time.Time) {},

Signed-off-by: Yacov Manevich <[email protected]>

yacovm · 2024-10-11T20:27:28Z

@StephenButtolph thanks for the review, addressed your comments.

yacovm requested a review from StephenButtolph as a code owner September 24, 2024 00:40

yacovm force-pushed the simplifyTimeouts branch 2 times, most recently from e2641d7 to 09ad764 Compare September 24, 2024 00:56

yacovm marked this pull request as draft September 24, 2024 01:07

yacovm force-pushed the simplifyTimeouts branch 3 times, most recently from 25122c6 to b54e285 Compare September 24, 2024 01:59

yacovm self-assigned this Sep 24, 2024

yacovm force-pushed the simplifyTimeouts branch 7 times, most recently from 31d051e to c25a68d Compare September 25, 2024 01:02

yacovm marked this pull request as ready for review September 25, 2024 01:15

yacovm commented Sep 25, 2024

View reviewed changes

marun reviewed Sep 25, 2024

View reviewed changes

snow/engine/common/bootstrap_tracker.go Outdated Show resolved Hide resolved

snow/engine/snowman/bootstrap/bootstrapper.go Show resolved Hide resolved

snow/engine/common/timer.go Outdated Show resolved Hide resolved

snow/engine/common/timer.go Outdated Show resolved Hide resolved

marun approved these changes Sep 26, 2024

View reviewed changes

aaronbuchwald requested changes Oct 1, 2024

View reviewed changes

yacovm force-pushed the simplifyTimeouts branch from 9be93a5 to fe44d95 Compare October 1, 2024 18:18

yacovm requested a review from aaronbuchwald October 1, 2024 19:23

StephenButtolph reviewed Oct 2, 2024

View reviewed changes

yacovm force-pushed the simplifyTimeouts branch 5 times, most recently from 2d6f9b4 to 4f949b8 Compare October 3, 2024 19:56

yacovm force-pushed the simplifyTimeouts branch from 4f949b8 to bdf0063 Compare October 3, 2024 20:14

yacovm added the cleanup Code quality improvement label Oct 3, 2024

StephenButtolph reviewed Oct 11, 2024

View reviewed changes

Address code review comments

a7b3ab4

Signed-off-by: Yacov Manevich <[email protected]>

yacovm force-pushed the simplifyTimeouts branch from 214ed68 to a7b3ab4 Compare October 11, 2024 20:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make bootstrapping handle its own timeouts #3410

Make bootstrapping handle its own timeouts #3410

yacovm commented Sep 24, 2024 •

edited

Loading

yacovm Sep 25, 2024

aaronbuchwald Oct 1, 2024

yacovm Oct 1, 2024

aaronbuchwald Oct 1, 2024

yacovm Oct 1, 2024

StephenButtolph Oct 2, 2024

StephenButtolph Oct 2, 2024

StephenButtolph Oct 2, 2024

StephenButtolph Oct 2, 2024

yacovm Oct 3, 2024

StephenButtolph Oct 2, 2024

yacovm Oct 2, 2024

yacovm Oct 3, 2024

StephenButtolph Oct 2, 2024 •

edited

Loading

yacovm Oct 2, 2024

yacovm Oct 3, 2024

StephenButtolph Oct 11, 2024

yacovm Oct 11, 2024

StephenButtolph Oct 11, 2024

yacovm Oct 11, 2024

StephenButtolph Oct 11, 2024

StephenButtolph Oct 11, 2024

StephenButtolph Oct 11, 2024

StephenButtolph Oct 11, 2024

yacovm Oct 11, 2024

StephenButtolph Oct 11, 2024

StephenButtolph Oct 11, 2024

yacovm Oct 11, 2024

StephenButtolph Oct 11, 2024

StephenButtolph Oct 11, 2024

yacovm commented Oct 11, 2024

	// AllBootstrapped returns a channel that is closed when all chains have been bootstrapped
	// AllBootstrapped returns a channel that is closed when all chains in this
	// subnet have been bootstrapped

	// Preempt causes any past and future call of Listen return a closed channel.
	// Preempt causes any past and future calls of Listen to return a closed channel.

		initClock: func(_ chan time.Time) {},
		advanceTime: func(_ chan time.Time) {},

Make bootstrapping handle its own timeouts #3410

Are you sure you want to change the base?

Make bootstrapping handle its own timeouts #3410

Conversation

yacovm commented Sep 24, 2024 • edited Loading

Why this should be merged

How this works

How this was tested

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StephenButtolph Oct 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yacovm commented Oct 11, 2024

yacovm commented Sep 24, 2024 •

edited

Loading

StephenButtolph Oct 2, 2024 •

edited

Loading