Call the ShutdownWorker API as part of workflow worker cleanup #1645

lina-temporal · 2024-09-20T19:41:35Z

What was changed

The ShutdownWorker API is now called from the Go SDK as part of workflow worker shutdown.

Why?

ShutdownWorker is primarily used to signal to the Temporal Matching service that a given sticky task queue will no longer be polled. This lets Matching immediately direct tasks to the regular task queue when a workflow worker shuts down, rather than waiting for the sticky queue's timeout to elapse.

Checklist

How was this tested:

Updated tests/mocks
go test ./...
cd internal/cmd/build && go run . check

Any docs updates needed?

No

Quinn-With-Two-Ns · 2024-09-20T19:58:02Z

internal/internal_worker.go

+	// Best-effort cleanup, used to signal to the server that the sticky queue will
+	// no longer be polled. Errors are ignored.
+	defer func() {
+		_ = ww.poller.Cleanup()


Stop should not complete until this call is finished. The contract for stopping a worker is once a worker is stopped all resources associated with the worker should be cleaned up and the worker will no longer use the connection/client.

defer is synchronous, Stop won't complete until ww.poller.Cleanup() does

You'll probably want to call this after all the pollers have stopped as well, otherwise couldn't a sticky poll request come in after you call Shutdown?

That's why this call is in a defer; is the implication that workflowTaskPoller is still running after workflowWorker? It shouldn't be, since:

baseWorker.Stop() is synchronous, and responsible for new calls to PollTask, and

close(bw.stopCh) occurs within baseWorker.Stop(),

All polling requests are select'd against their results, as well as bw.stopCh, to immediately cancel the gRPC. Additionally, before a task is actually processed, isStopping() is checked, to account for the interleaving between polling and processing.

There's a slight interleaving where the gRPC call completes service side, right as a gRPC context cancel occurs client-side, but I'm not worried about that as an edge case, given that this is a best-effort optimization anyways.

is the implication that workflowTaskPoller is still running after workflowWorker?

Sorry, I should have clarified a task worker could still respond to a sticky poll request, since your running this regardless of if the worker stop actually succeeded. It is probably pointless to make this call if the worker didn't stop right?

Don't see any issue with it being in the base worker , that was what I did in my POC unless I am missing something?

Note: My branch is incorrect because it is not waiting for the pollers to shutdown

Oh, I didn't realize you could access the task worker from bw.options. Sure, I'll update.

One thing I noticed after running the tests is that the default WorkerStopTimeout is 0; that means that a timer.After on it will return immediately, preventing our cleanup goroutine from getting a chance to run. Should we set a default WorkerStopTimeout (or even enforce a tiny minimum, like 1-3s) for users?

Should we set a default WorkerStopTimeout (or even enforce a tiny minimum, like 1-3s) for users?

Hm we can't start enforcing a minimum now, that would break uses who aren't setting one. We could add a default let me double check what other SDKs do.

Discussed offline I think your current approach is fine , apologize for wasting time on another path

cretz · 2024-09-24T14:20:06Z

internal/internal_task_pollers.go

+	grpcCtx, cancel := newGRPCContext(ctx, grpcMetricsHandler(wtp.metricsHandler))
+	defer cancel()
+
+	_, err := wtp.service.ShutdownWorker(grpcCtx, &workflowservice.ShutdownWorkerRequest{


Is there any way for us to test in our integration tests that this works and has effect? Even if not now, can we open an issue for when it is available server side to confirm it has effect (e.g. that sticky was reset across two workers)? My fear is that we may do something wrong on the request building or something and this gRPC call always errors and we would just never know that it never worked.

Is there any way for us to test in our integration tests that this works and has effect? Even if not now, can we open an issue for when it is available server side to confirm it has effect (e.g. that sticky was reset across two workers)?

Not exactly. The call signals Matching to route tasks away from a shutdown worker's sticky queue (where the total timeout is 10 seconds). However, the caller that would actually see the effect from the rerouting isn't the end client, it's History service, which handles the rerouting. The end result is that any future tasks previously destined for the sticky queue will have a lower schedule-to-start latency.

The important takeaway is that this isn't very observable from a testing point of view this far away in the stack - the signals between success and failure are fuzzy, it's just latency. If the call fails, the worst case scenario is that schedule-to-start is more latent (by up to 10s). If the call succeeds, we know schedule-to-start is hopefully less latent, but that isn't guaranteed by any means. ShutdownWorker could work perfectly, and the task's ScheduleToStart may be slow from an entirely unrelated reason in History or Matching services.

My fear is that we may do something wrong on the request building or something and this gRPC call always errors and we would just never know that it never worked.

Because the effect of this call takes place entirely within the service, asynchronously, IMO the testing needs to verify two independent facts:

Does the worker call ShutdownWorker and receive positive acknowledgement from the service? The call doesn't have an effect directly visible to the worker, so here we instead need to verify that the server will validate and accept the request.

Does the service process ShutdownWorker properly/does the call have an effect? Because we're testing deeper within the service, we get a strong discrete signal for if the call had an effect; I can directly call MatchingService and assert that it returns StickyWorkerUnavailable for a shut down queue. This is a much better signal for test success than merely trying to guess based on downstream latency.

👍 I wonder if we can demonstrate a user-visible effect, e.g. such as having two workers, and the worker that got a sticky workflow is able to be shutdown and have the next workflow task immediately go to the next worker without penalty (meaning confirm it happens in <= 9s). I think that kind of <= 9s leeway is ok to assert in the tests, but agreed we don't want to assert timing to any detail because of the other moving parts.

Regardless, I understand we cannot assert this behavior in integration tests yet anyways, so this would be a future issue (or a features repo test).

@lina-temporal did you add any test to verify ShutdownWoker is only treated as best effort? I was expecting to see a unit test that mocked the service stub returned an error for ShutdownWoker and verify that from the users perspective the shutdown still succeeded. If I missed it please just point me at the test.

Sure, added a test to assert that Cleanup is best effort (TestCleanupIsBestEffort).

cretz · 2024-09-24T14:24:51Z

If at all possible, would like this to wait on temporalio/temporal#6511 to be merged just to confirm we don't need any backwards-incompatible API changes.

cretz

Marking approved because I made a comment or two on this PR and don't want there to be confusion that any of it is blocking. Please wait for @Quinn-With-Two-Ns's approval before merging.

lina-temporal · 2024-09-27T21:34:31Z

FYI, server implementation of the API has been merged.

Call the ShutdownWorker API as part of workflow worker cleanup

d09ddb7

lina-temporal requested a review from a team as a code owner September 20, 2024 19:41

go mod tidy everywhere :^)

d1825c4

Quinn-With-Two-Ns reviewed Sep 20, 2024

View reviewed changes

cretz reviewed Sep 24, 2024

View reviewed changes

cretz mentioned this pull request Sep 24, 2024

Call ShutdownWorker API on worker shutdown temporalio/features#542

Open

11 tasks

move cleanup call to baseWorker

da45592

cretz approved these changes Sep 25, 2024

View reviewed changes

lina-temporal added 2 commits September 26, 2024 12:13

add TestCleanupIsBestEffort

1bc166d

fix mock value

61d1e9c

Quinn-With-Two-Ns approved these changes Oct 1, 2024

View reviewed changes

lina-temporal enabled auto-merge (squash) October 1, 2024 21:02

lina-temporal added 2 commits October 1, 2024 14:02

Merge branch 'master' into shutdown-worker

d180b49

Merge branch 'master' into shutdown-worker

5bce920

lina-temporal merged commit 974ccc0 into master Oct 1, 2024
14 checks passed

lina-temporal deleted the shutdown-worker branch October 1, 2024 23:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Call the ShutdownWorker API as part of workflow worker cleanup #1645

Call the ShutdownWorker API as part of workflow worker cleanup #1645

lina-temporal commented Sep 20, 2024

Quinn-With-Two-Ns Sep 20, 2024 •

edited

Loading

lina-temporal Sep 20, 2024

Quinn-With-Two-Ns Sep 20, 2024

lina-temporal Sep 20, 2024

Quinn-With-Two-Ns Sep 20, 2024

Quinn-With-Two-Ns Sep 24, 2024

lina-temporal Sep 24, 2024

lina-temporal Sep 24, 2024

Quinn-With-Two-Ns Sep 24, 2024

Quinn-With-Two-Ns Sep 24, 2024

cretz Sep 24, 2024 •

edited

Loading

lina-temporal Sep 24, 2024

cretz Sep 24, 2024 •

edited

Loading

Quinn-With-Two-Ns Sep 24, 2024

lina-temporal Sep 26, 2024

cretz commented Sep 24, 2024

cretz left a comment

lina-temporal commented Sep 27, 2024

Call the ShutdownWorker API as part of workflow worker cleanup #1645

Call the ShutdownWorker API as part of workflow worker cleanup #1645

Conversation

lina-temporal commented Sep 20, 2024

What was changed

Why?

Checklist

Quinn-With-Two-Ns Sep 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cretz Sep 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cretz Sep 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cretz commented Sep 24, 2024

cretz left a comment

Choose a reason for hiding this comment

lina-temporal commented Sep 27, 2024

Quinn-With-Two-Ns Sep 20, 2024 •

edited

Loading

cretz Sep 24, 2024 •

edited

Loading

cretz Sep 24, 2024 •

edited

Loading