Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Call the ShutdownWorker API as part of workflow worker cleanup #1645

Merged
merged 7 commits into from
Oct 1, 2024

Conversation

lina-temporal
Copy link
Contributor

What was changed

  • The ShutdownWorker API is now called from the Go SDK as part of workflow worker shutdown.

Why?

  • ShutdownWorker is primarily used to signal to the Temporal Matching service that a given sticky task queue will no longer be polled. This lets Matching immediately direct tasks to the regular task queue when a workflow worker shuts down, rather than waiting for the sticky queue's timeout to elapse.

Checklist

  1. How was this tested:
  • Updated tests/mocks
  • go test ./...
  • cd internal/cmd/build && go run . check
  1. Any docs updates needed?
  • No

@lina-temporal lina-temporal requested a review from a team as a code owner September 20, 2024 19:41
// Best-effort cleanup, used to signal to the server that the sticky queue will
// no longer be polled. Errors are ignored.
defer func() {
_ = ww.poller.Cleanup()
Copy link
Contributor

@Quinn-With-Two-Ns Quinn-With-Two-Ns Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stop should not complete until this call is finished. The contract for stopping a worker is once a worker is stopped all resources associated with the worker should be cleaned up and the worker will no longer use the connection/client.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defer is synchronous, Stop won't complete until ww.poller.Cleanup() does

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll probably want to call this after all the pollers have stopped as well, otherwise couldn't a sticky poll request come in after you call Shutdown?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why this call is in a defer; is the implication that workflowTaskPoller is still running after workflowWorker? It shouldn't be, since:

  1. baseWorker.Stop() is synchronous, and responsible for new calls to PollTask, and
  2. close(bw.stopCh) occurs within baseWorker.Stop(),
  3. All polling requests are select'd against their results, as well as bw.stopCh, to immediately cancel the gRPC. Additionally, before a task is actually processed, isStopping() is checked, to account for the interleaving between polling and processing.

There's a slight interleaving where the gRPC call completes service side, right as a gRPC context cancel occurs client-side, but I'm not worried about that as an edge case, given that this is a best-effort optimization anyways.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the implication that workflowTaskPoller is still running after workflowWorker?

Sorry, I should have clarified a task worker could still respond to a sticky poll request, since your running this regardless of if the worker stop actually succeeded. It is probably pointless to make this call if the worker didn't stop right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't see any issue with it being in the base worker , that was what I did in my POC unless I am missing something?

Note: My branch is incorrect because it is not waiting for the pollers to shutdown

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I didn't realize you could access the task worker from bw.options. Sure, I'll update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I noticed after running the tests is that the default WorkerStopTimeout is 0; that means that a timer.After on it will return immediately, preventing our cleanup goroutine from getting a chance to run. Should we set a default WorkerStopTimeout (or even enforce a tiny minimum, like 1-3s) for users?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we set a default WorkerStopTimeout (or even enforce a tiny minimum, like 1-3s) for users?

Hm we can't start enforcing a minimum now, that would break uses who aren't setting one. We could add a default let me double check what other SDKs do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline I think your current approach is fine , apologize for wasting time on another path

grpcCtx, cancel := newGRPCContext(ctx, grpcMetricsHandler(wtp.metricsHandler))
defer cancel()

_, err := wtp.service.ShutdownWorker(grpcCtx, &workflowservice.ShutdownWorkerRequest{
Copy link
Member

@cretz cretz Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way for us to test in our integration tests that this works and has effect? Even if not now, can we open an issue for when it is available server side to confirm it has effect (e.g. that sticky was reset across two workers)? My fear is that we may do something wrong on the request building or something and this gRPC call always errors and we would just never know that it never worked.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way for us to test in our integration tests that this works and has effect? Even if not now, can we open an issue for when it is available server side to confirm it has effect (e.g. that sticky was reset across two workers)?

Not exactly. The call signals Matching to route tasks away from a shutdown worker's sticky queue (where the total timeout is 10 seconds). However, the caller that would actually see the effect from the rerouting isn't the end client, it's History service, which handles the rerouting. The end result is that any future tasks previously destined for the sticky queue will have a lower schedule-to-start latency.

The important takeaway is that this isn't very observable from a testing point of view this far away in the stack - the signals between success and failure are fuzzy, it's just latency. If the call fails, the worst case scenario is that schedule-to-start is more latent (by up to 10s). If the call succeeds, we know schedule-to-start is hopefully less latent, but that isn't guaranteed by any means. ShutdownWorker could work perfectly, and the task's ScheduleToStart may be slow from an entirely unrelated reason in History or Matching services.

My fear is that we may do something wrong on the request building or something and this gRPC call always errors and we would just never know that it never worked.

Because the effect of this call takes place entirely within the service, asynchronously, IMO the testing needs to verify two independent facts:

  • Does the worker call ShutdownWorker and receive positive acknowledgement from the service? The call doesn't have an effect directly visible to the worker, so here we instead need to verify that the server will validate and accept the request.
  • Does the service process ShutdownWorker properly/does the call have an effect? Because we're testing deeper within the service, we get a strong discrete signal for if the call had an effect; I can directly call MatchingService and assert that it returns StickyWorkerUnavailable for a shut down queue. This is a much better signal for test success than merely trying to guess based on downstream latency.

Copy link
Member

@cretz cretz Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I wonder if we can demonstrate a user-visible effect, e.g. such as having two workers, and the worker that got a sticky workflow is able to be shutdown and have the next workflow task immediately go to the next worker without penalty (meaning confirm it happens in <= 9s). I think that kind of <= 9s leeway is ok to assert in the tests, but agreed we don't want to assert timing to any detail because of the other moving parts.

Regardless, I understand we cannot assert this behavior in integration tests yet anyways, so this would be a future issue (or a features repo test).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lina-temporal did you add any test to verify ShutdownWoker is only treated as best effort? I was expecting to see a unit test that mocked the service stub returned an error for ShutdownWoker and verify that from the users perspective the shutdown still succeeded. If I missed it please just point me at the test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, added a test to assert that Cleanup is best effort (TestCleanupIsBestEffort).

@cretz
Copy link
Member

cretz commented Sep 24, 2024

If at all possible, would like this to wait on temporalio/temporal#6511 to be merged just to confirm we don't need any backwards-incompatible API changes.

Copy link
Member

@cretz cretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking approved because I made a comment or two on this PR and don't want there to be confusion that any of it is blocking. Please wait for @Quinn-With-Two-Ns's approval before merging.

@lina-temporal
Copy link
Contributor Author

FYI, server implementation of the API has been merged.

@lina-temporal lina-temporal enabled auto-merge (squash) October 1, 2024 21:02
@lina-temporal lina-temporal merged commit 974ccc0 into master Oct 1, 2024
14 checks passed
@lina-temporal lina-temporal deleted the shutdown-worker branch October 1, 2024 23:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants