Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deployment workflow improvements #6882

Merged
merged 6 commits into from
Nov 25, 2024
Merged

Conversation

dnr
Copy link
Member

@dnr dnr commented Nov 23, 2024

What changed?

  • Start series workflow only once instead of trying on every update
  • Use validator to reject update if already present
  • Don't update local state if sync to user data fails
  • Extract anonymous functions to methods
  • Fix escaping

Why?

Reduce overhead, avoid unnecessary calls

How did you test it?

existing tests

@dnr dnr requested a review from a team as a code owner November 23, 2024 07:47
@@ -191,7 +151,70 @@ func (d *DeploymentWorkflowRunner) run() error {

d.logger.Debug("Deployment doing continue-as-new")
return workflow.NewContinueAsNewError(d.ctx, DeploymentWorkflow, d.DeploymentWorkflowArgs)
}

func (d *DeploymentWorkflowRunner) validateRegisterWorker(args *deploymentspb.RegisterWorkerInDeploymentArgs) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just pasting this here since this is something I learned and might help somebody - I was curious on understanding what would happen if the validator were to be called before the function conducting the update was - to the naked eye, this would panic since there could be a chance args.TaskQueue may not have a value present in the map

turns out all panics from validator function will be returned as errors which is sweet since nothing shall break

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validator is called before the update, that's the whole point.

If args.TaskQueueName isn't in the map, the lookup will return nil. GetTaskQueues() on nil returns nil, and a lookup on that nil map will return not found. It will not panic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lookup on that nil map will return not found.

I thought over here, we would receive a panic since we could be trying to do a nil lookup on a nil map right - my apologies if my assumption was wrong...

}

func (d *DeploymentWorkflowRunner) handleRegisterWorker(ctx workflow.Context, args *deploymentspb.RegisterWorkerInDeploymentArgs) error {
// Note: use ctx in here (provided by update) instead of d.ctx
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding, what would be the difference?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know! The update handler is passed a context but I don't know if it's different from the context passed to the runner. Probably not but it seems better to use it until I confirm it's okay

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious too - although I do think it's d.ctx being passed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed with sdk team that update handlers must used the Context passed to them, not the Context of the top-level workflow. This is a pretty good argument for not putting the Context in the struct actually

// wait until series workflow started
err = workflow.Await(ctx, func() bool { return d.DeploymentLocalState.StartedSeriesWorkflow })
if err != nil {
d.logger.Error("Update canceled before series workflow started")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if the series fails to create because say it exceeds a limit on the number of series, wf will return on line 126 and in here we'd return this error to matching? Not for now, but would it be possible to propagate the "series count limit exceeded" error to matching and eventually pollers somehow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how we're implementing those limits.. StartWorkflowExecution or SignalWithStart couldn't fail for some application-defined limit, we'd have to change it to update-with-start? In any case, yeah I'd expect this to change when we add limits

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ShahabT - i might be wrong but the client kickstarts an update-with-start which shall, thanks to David's changes, kickstart a series workflow. If we enforce a limit check and return an error in the series workflow definition (per namespace limit), we reach line 126 from which the whole multi-operation should fail - in that case, the update should not even be executed right? moreover, that error shall be propagated to pollers too

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess my question is if matching would receive a specific error message such as "series count limit exceeded" instead of a generic message such as "Update canceled before series workflow started" or any other generic error may be sent by server because of the multi-operation failed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no multi operation at this level... matching -> deployment wf is an update-with-start, but deployment wf -> deployment series wf is currently just a start. if we want to return an error, that has to change to update-with-start also.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if this deployment wf exceeds some limit, what happens to the wf? Does it stick around to tell matching "too many"? If not, then matching will try to re-create it every time, which is bad. But if it does stick around, that's using resources that shouldn't be used. Seems like we need a cache somewhere?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking matching would remember in memory and not retry registration, at least until a few minutes.

@dnr dnr merged commit 6cf162d into temporalio:versioning-3 Nov 25, 2024
29 of 40 checks passed
@dnr dnr deleted the v3workflow branch November 25, 2024 19:11
dnr added a commit that referenced this pull request Nov 26, 2024
## What changed?
- Start series workflow only once instead of trying on every update
- Use validator to reject update if already present
- Don't update local state if sync to user data fails
- Extract anonymous functions to methods
- Fix escaping

## Why?
Reduce overhead, avoid unnecessary calls

## How did you test it?
existing tests
dnr added a commit that referenced this pull request Nov 26, 2024
## What changed?
- Start series workflow only once instead of trying on every update
- Use validator to reject update if already present
- Don't update local state if sync to user data fails
- Extract anonymous functions to methods
- Fix escaping

## Why?
Reduce overhead, avoid unnecessary calls

## How did you test it?
existing tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants