Revert raw history change #7309

prathyushpv · 2025-02-10T23:12:29Z

What changed?

Revert the change to return raw history events from history to frontend.

Why?

In GetWorkflowExecutionHistory request, there is a filter HISTORY_EVENT_FILTER_TYPE_CLOSE_EVENT. History service used to apply this filter and return only the last event. Now history service returns the last history batch and frontend gets the last event from that blob and return that.
When cluster is downgrading, frontend is downgraded first. If a worker is talking to an old frontend, which then talks to a new history, frontend does not have the logic to filter out last event.

How did you test it?

Existing tests

Potential risks

Is hotfix candidate?

Yes

## What changed?  Refactor: remove `cluster.yaml` config. ## Why?  It is the same as `es_cluster.yaml` but doesn't have `esConfig` section. The intent was to use this config file when Elasticsearch is not used. But this section is not used if SQL persistence is used, so there is no harm to always use `es_cluster.yaml`. I am going to significantly decrease number of config files for functional tests (if not remove them all). This is small step in this direction. ## How did you test it?  Run tests. ## Potential risks  No risks. ## Documentation  No. ## Is hotfix candidate?  No.

## What changed?  ## Why?  ## How did you test it?  ## Potential risks  ## Documentation  ## Is hotfix candidate?

## What changed?  Moved code from `common/namespace` to `common/namespace/nsreplication`. Next to the existing `common/namespace/nsregistry`. ## Why?  The goal is to remove all references from `common/namespace` to `common/persistence`. One step towards using `common/dynamicconfig` from `common/namespace`. ## How did you test it?  No behavior changes here. ## Potential risks  ## Documentation  ## Is hotfix candidate?

## What changed?  Moved code from `common/namespace` to `common/namespace/nsattr`. ## Why?  This removes the dependency from `common/namespace` on `common/cluster`. One step towards using `common/dynamicconfig` from `common/namespace`. ## How did you test it?  No behavior changes here. ## Potential risks  ## Documentation  ## Is hotfix candidate?

@stephanos

## What changed? - Add generic hook interface for fine-grained control of behavior in tests - Use the hooks for matching varying behavior tests (force load balancer to target partitions and disable sync match) - Use the hooks to force a race condition in an update-with-start test (by @stephanos) ## Why? To write integration/functional tests that require tweaking behavior of code under test, without affecting non-test builds. ## Potential risks Hooks are disabled by default, so there should be zero risk to production code, and zero overhead (assuming the Go compiler can do very basic inlining and dead code elimination). The downside is that functional tests now have to be run with `-tags test_dep` everywhere. --------- Co-authored-by: Stephan Behnke <[email protected]>

…atedEvent (#7091) ## What changed? use UnsetVersioningOverride field in ApplyWorkflowExecutionOptionsUpdatedEvent ## Why? So that users of this event don't need to load VersioningOverride from mutable state every time they create this event. Now, a nil Versioning Override in this event means "no change" instead of "remove". This reduces the chance that someone accidentally unsets an override in the future, and also is more efficient. We've discussed this change internally in the server team and are ok with changing the meaning of this history event, because it is such a small change and the scope of impact is small (pre-release versioning users who have unset a versioning override and are building mutable state from that history). ## How did you test it? Made sure that versioning override functional tests pass. ## Potential risks Now, a nil Versioning Override in this event means "no change" instead of "remove". If an event exists with the previous meaning and the mutable state is rebuilt, the Versioning Override would not be removed. But the chance of that happening is very low. ## Documentation  ## Is hotfix candidate?

## What changed?  Add a where to parse "SQL_like" queries and evaluate them agains a mutable state ## Why?  There is a need for arbitrary mutable state filtering. The immidiate need - to support filtering when applyting namespace-level rules (in development) ## How did you test it?  add unit tests ## Potential risks  n/a ## Documentation  not yet ## Is hotfix candidate?  no --------- Co-authored-by: Rodrigo Zhou <[email protected]>

## What changed?  Simplifies the gauge-related functions so they don't take references to PhysicalTaskQueueManager or TaskQueueManager anymore. ## Why?  - makes it clearer what inputs the functions really need - enables more refactorings to break up cyclical dependencies ## How did you test it?  Pure refactoring. Existing tests. ## Potential risks  ## Documentation  ## Is hotfix candidate?

## What changed?  Moved `TaskTokenSerializer` to the already existing package `common/tasktoken`. ## Why?  - the serializer's dependency on `utf8validator` prevents a refactoring I plan to do; moving it out of `common` helps - serializer belongs in that package - allows to drop the "TaskToken" part ## How did you test it?  Pure refactoring; no behavior change. ## Potential risks  This will break (at least) two references in other repos I've identified. I'll update those once this is merged. ## Documentation  ## Is hotfix candidate?

## What changed? Set delete_on_completion to reserved ## Why? delete_on_completion is no longer needed ## How did you test it? make ## Potential risks  ## Documentation  ## Is hotfix candidate? No

## What changed? - Extended tombstone tracking to capture HSM state machine deletions from Operation Log - Added `StateMachinePath` tombstone type for HSM deletions alongside existing activity/timer/etc. tombstones - Maintained existing tombstone batch management and cleanup mechanisms ## Why? Without this: - HSM state machine deletions would not be properly replicated across clusters - Could lead to state inconsistencies between active/passive clusters when using HSM - No way to know if an HSM node was intentionally deleted or is missing due to an error ## Potential risks - Potential memory pressure from large HSM deletion batches - New panic path if operation log access fails Risk Mitigation: - Operation log errors surface quickly via panic rather than silent failure ## Documentation No changes required - this extends an internal implementation detail of how deletions are tracked. ## Is hotfix candidate? No - this should follow normal deployment process.

Reinstate what was reverted in #7024 ## What changed? Add deletion calls for Nexus operation nodes once they reach a terminal state (Completed, Failed, Canceled, Timed Out). Previously, terminal operations lingered in the state machine, potentially causing confusion and wasting resources. In addition, this PR addresses previously existing TODOs related to node cleanup. ### Key Changes - **Automatic Node Deletion on Terminal States:** Operations are now removed from the state machine immediately after they complete, fail, cancel, or time out. - **Handler Updates:** The command handlers (schedule and cancel) now reflect the new terminal state deletion. Attempts to cancel an operation that has already reached a terminal state will produce a clear, non-retryable error. - **Tests for Terminal State Deletion:** Introduced two new test suites: - `TestTerminalStatesDeletion` to verify that applying terminal events removes the operation node as expected. - `TestOperationNodeDeletionOnTerminalEvents` to ensure no further modifications (like cancelation) can be made after an operation is terminal. ### Areas for Review Focus 1. **Deletion Semantics:** Verify that node removal is triggered correctly for all terminal states and that no non-terminal states inadvertently trigger deletion. 2. **Handler Consistency:** Ensure that the updated schedule/cancel handlers function correctly when dealing with non-existent or already terminated operations. 3. **Error Handling:** Confirm that attempts to modify a deleted operation produce the intended error behavior. 4. **Test Coverage:** Review the newly added test suites to ensure all terminal states and edge cases are thoroughly tested. ## Why? - Prevent confusion and wasted resources by removing irrelevant operation nodes. - Prevent customers from experiencing workflow task failures prematurely. ## How did you test it? - **New Test Suites:** Added `TestTerminalStatesDeletion` and `TestOperationNodeDeletionOnTerminalEvents` to specifically target and validate the new deletion behavior. - **Existing Tests:** All pre-existing tests have been re-run to ensure no regressions. ## Potential risks - **Assumptions About Completed Operations:** If any internal or external code implicitly relied on the presence of completed operations, its assumptions might break. Such usage was never officially supported. ## Documentation No external or user-facing documentation changes are required since this feature matches the intended design. ## Is hotfix candidate? No. This change is an enhancement to internal logic and state management, not a fix for a critical bug, so it should follow the normal release process. --------- Co-authored-by: Roey Berman <[email protected]> Co-authored-by: Roey Berman <[email protected]> ## What changed?  ## Why?  ## How did you test it?  ## Potential risks  ## Documentation  ## Is hotfix candidate?  Co-authored-by: Roey Berman <[email protected]> Co-authored-by: Roey Berman <[email protected]>

## What changed?  Added support for OpenTelemetry (OTEL) to task processing. ## Why?  To extend the OTEL coverage and gain insight into task processing. ## How did you test it?  Ran Grafana Tempo locally: <img width="1664" alt="Screenshot 2025-01-16 at 6 07 51 PM" src="https://github.com/user-attachments/assets/c445a64e-8ba8-468b-9fec-4f90cf37888f" /> ## Potential risks  I don't expect any negative performance impact on task processing because most operations are noops. ## Documentation  ## Is hotfix candidate?

## What changed?  Do not send backfill task when event is empty ## Why?  Backfill tasks should always have associated events. If there is no event(i.e. for a state only transition), then no backfill to perform and we should skip the task ## How did you test it?  n/a ## Potential risks  n/a ## Documentation  n/a ## Is hotfix candidate?  no

#7113) ## What changed?  Added an Eventually condition to getting callback info to account for delays in the workflow retry in `TestWorkflowNexusCallbacks_CarriedOver/WorkflowFailureRetry` ## Why?  Fix flake ## How did you test it?  Ran a couple hundred times locally. Failed within 50 runs before change. ## Potential risks  ## Documentation  ## Is hotfix candidate?

…#7136) This commit references a repository GitHub actions variable, `SHOULD_TRIGGER_DOCKER_BUILD`, and only triggers a docker build if this variable is set to `true`. This is to prevent forks of this repository from triggering builds in the `temporalio/docker-builds` repository, which is almost certainly not what those forks want.

) ## What changed?  Embed `Unimplemented*ServiceServer` instead of `Unsafe*ServiceServer` in all gRPC handlers. ## Why?  Embedding `Unimplemented*ServiceServer` is the right way to keep backward compatibility with future interface changes. ## How did you test it?  Build and run. ## Potential risks  No risks. ## Documentation  No. ## Is hotfix candidate?  No.

## What changed?  Introduces vars for the noop tracer and trace provider. ## Why?  Communicates clearer what/where they are. ## How did you test it?  ## Potential risks  ## Documentation  ## Is hotfix candidate?

…ails (#6936) ## What changed? Added target cluster to explicitly clarify error message for future debugging. ## Why?  ## How did you test it?  ## Potential risks  ## Documentation  ## Is hotfix candidate?

) Previous behavior was to not retry if number of failures was more than 10. We want to be more forgiving.

## What changed?  Moved all telemetry resource names to one place. ## Why?  Having them in one place. ## How did you test it?  ## Potential risks  ## Documentation  ## Is hotfix candidate?

## What changed?  See title. ## Why?  Flaky tests no good! ## How did you test it?  ## Potential risks  ## Documentation  ## Is hotfix candidate?

## What changed? Wait for clusters to be synced instead of using Sleep. Make sure second run is started before failover. ## Why? Current implementation depends on the timing which is not reliable. ## How did you test it? Repeatedly run the test locally and no failure found. ## Potential risks  ## Documentation  ## Is hotfix candidate?

## What changed? Just move some code around. ## Why? Separate concerns.

## What changed? Remove this obsolete config. ## Why? It's not used anymore.

Closes #7106

## What changed?  - update old deployment component, for existing pre-release versioning-3 customers, so that it always registers itself when versioning-3 is enabled on the namespace ## Why?  - fix any potential delays in worker registration in the near future ## How did you test it?  - existing suite of tests ## Potential risks  - not a lot since this feature is still pre-release ## Documentation  ## Is hotfix candidate?  - No

## What changed?  Removed dependency of `common/persistence` from `common/namespace` package. And removed dependency of `common/namespace` on `common/namespace/nsreplication`. ## Why?  Last step towards using `common/namespace` from `common/dynamicconfig`. (circular dependency right now) ## How did you test it?  No behavior changes here. ## Potential risks  ## Documentation  ## Is hotfix candidate?

## What changed?  Instead of passing `namespaceID string`, dynamicconfig receives `ns namespace.ID` now. _NOTE: I'll do `namespace string` in a follow-up PR. I want to get this out of the door to establish the dependency of `common/dynamicconfig` on `common/namespace` as it already took several PRs to make that possible. And it's less risky._ ## Why?  Type safety. ## How did you test it?  Compiler, as there is no behavior change. ## Potential risks  ## Documentation  ## Is hotfix candidate?

## What changed?  - make deletes idempotent ## Why?  ## How did you test it?  - added functional tests - also changes to functional test to make them have a stricter consistency check ## Potential risks  ## Documentation  ## Is hotfix candidate?

…oyments excludes closed (#7267) ## What changed? Make describe closed wf return not found and add tests ## Why?  ## How did you test it?  ## Potential risks  ## Documentation  ## Is hotfix candidate?  --------- Co-authored-by: ShahabT <[email protected]> Co-authored-by: Shivam Saraf <[email protected]>

## What changed? Make updates and signals use args Try to delete oldest to newest version if we hit limit ## Why?  ## How did you test it? New functional test ## Potential risks  ## Documentation  ## Is hotfix candidate?  --------- Co-authored-by: Shivam <[email protected]> Co-authored-by: Shahab Tajik <[email protected]> Co-authored-by: Shivam Saraf <[email protected]> Co-authored-by: ShahabT <[email protected]>

## What changed?  - Skip drainage ## Why?  - Versioning-3.1 ## How did you test it?  - Added a new test which passes when poller_history config is changed ## Potential risks  ## Documentation  ## Is hotfix candidate?  --------- Co-authored-by: ShahabT <[email protected]> Co-authored-by: Carly de Frondeville <[email protected]>

## What changed?  - reset drainageInfo to nil after workflow starts accepting new workflows again ## Why?  ## How did you test it?  - existing suite ## Potential risks  ## Documentation  ## Is hotfix candidate?  --------- Co-authored-by: Carly de Frondeville <[email protected]>

…7268) ## What changed? Use MutableSideEffect to access dynamic config from inside workflows. ## Why? To prevent non-determinism in the workflow code. To allow the workflows to use the latest dynamic config values at all times. ## How did you test it? Tests are failing for other reasons, so I'm unable to test until I pull in those changes. ## Potential risks  ## Documentation  ## Is hotfix candidate?  --------- Co-authored-by: ShahabT <[email protected]>

## What changed?  As of #7254 a total in-flight payload size limit is enforced. The default is 10MB right now, this PR makes it 20MB. As the max payload size for Updates is 2MB and the current Update in-flight limit is 10, the effective limit right now is (about) 20MB. ## Why?  By enforcing a payload-based limit, we can raise the total in-flight update limit from 10 to ... more, eventually. But we decided to couple these together, ie increasing the in-flight number when introducing the in-flight size limit. Therefore we'll set it to be a noop effectively for now. ## How did you test it?  ## Potential risks  ## Documentation  ## Is hotfix candidate?

## What changed?  ## Why?  ## How did you test it?  ## Potential risks  ## Documentation  ## Is hotfix candidate?

## What changed?  - Log error instead of panic for existing tally users when prometheus listener can't be created. - No behavior change for existing OTEL users ## Why?  - For backward compatibility

## What changed? Move history batch validations from ExecutionManager to history/api package. ## Why? ReadRawHistoryBranch in ExecutionManager may not return full list of history events in some situations. So we cannot check if the history is complete in that layer. Moving this check to a layer above that which will have a full list of history events ## How did you test it? Unit tests

## What changed? Handle state machine deletion for state-based replication. ## Why? Deletion is needed as we want to provide support for more nexus operations. ## How did you test it? unit test. ## Potential risks  ## Documentation  ## Is hotfix candidate?

## What changed? This PR gates this feature behind the feature flag before the full release. ## Why? The feature is brand new and we may not have identified all edge cases. So it's better to roll it out slowly. ## How did you test it? Manually flipped the flag off and verified that the old child was not terminated. ## Potential risks N/A ## Documentation N/A ## Is hotfix candidate? No.

## What changed?  The lib version accidentally went back in a previous PR. Fixing it here. ## Why?  ## How did you test it?  ## Potential risks  ## Documentation  ## Is hotfix candidate?

## What changed?  - forceCAN for worker-deployment ## Why?  ## How did you test it?  - added test ## Potential risks  ## Documentation  ## Is hotfix candidate?

## What changed?  ## Why?  ## How did you test it?  ## Potential risks  ## Documentation  ## Is hotfix candidate?

## What changed?  - Eliminate the DescribeWorkerDeployment fan-out to Versions for getting the summaries. - Sync drainage status with the deployment workflow when it changes. - Drainage WF now can CaN! - Move drainage args to it's own message to be future proof. - Made WorkerDeploymentLocalState.versions a map for easier access to versions. ## Why?  ## How did you test it?  ## Potential risks  ## Documentation  ## Is hotfix candidate?

## What changed? Write to WorkerDeployment, WorkerDeploymentVersion, and WorkflowVersioningBehavior Search Attributes in all the places where we used to only write to BuildIds SA. If using worker deployments and pinned, write `pinned:version` to BuildIds list instead of `pinned:series_name:build_id`. Use `pinned:version` instead of `pinned:series_name:build_id` for deployment version drainage status. ## Why?  ## How did you test it? - Confirmed that all Deployment Reachability tests pass - New tests (in progress) ## Potential risks  ## Documentation  ## Is hotfix candidate?

## What changed?  - Log prometheus listener error at Warn level ## Why?  - For backward compatibility. There might be some CI systems monitoring logs and fail tests on Error logs.

## What changed?  Making sure the following SAs are populated correctly for Versioning v3.1 - TemporalWorkerDeployment - TemporalWorkerDeploymentVersion - TemporalWorkflowVersioningBehavior - BuildIds ## Why?  ## How did you test it?  ## Potential risks  ## Documentation  ## Is hotfix candidate?

## What changed? Validate the Nexus operation tokens don't exceed a configured length in all APIs that accept it. Also tidied up code in `completions.go` where we applied the start event no via the event definition, skipping the `MachineTransition` call. There won't be any behavior change since this transition did not generate tasks. ## Why? It's dangerous for us to accept strings without limiting their length. ## How did you test it? Added all of the relevant tests.

This reverts commit 5a9bdb6.

This reverts commit e4cfc5a.

CLAassistant · 2025-02-10T23:12:35Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
8 out of 9 committers have signed the CLA.

✅ pdoerner
✅ carlydf
✅ stephanos
✅ prathyushpv
✅ gow
✅ rodrigozhou
✅ ShahabT
✅ hai719
❌ temporal-data
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

alexshtin and others added 30 commits January 17, 2025 22:21

Retry tests with previous attempt args if more than 10 tests fail (#7141

ff3e7c8

) Previous behavior was to not retry if number of failures was more than 10. We want to be more forgiving.

Remove read workflow API during namespace handover (#7140)

71e93a1

Move taskTracker and circularTaskBuffer to new file (#7147)

e196abb

## What changed? Just move some code around. ## Why? Separate concerns.

Remove MatchingLoadUserData config (#7148)

03683d3

## What changed? Remove this obsolete config. ## Why? It's not used anymore.

Include namespace name in NamespaceAlradyExists error message (#7156)

cb4ed09

Closes #7106

Shivs11 and others added 26 commits February 6, 2025 15:41

Fix versioning flaky tests

f53f6cd

ForceCAN bug fix (#7282)

ea3e879

Bump Server version to 1.27.0-128.0

e370a1e

Bump Server version to 1.27.0-128.1

c4f8c4c

Revert "Fix history batch validations (#7281)"

50d1d76

This reverts commit 5a9bdb6.

Revert "Send raw history blobs from history service to frontend (#7179)"

7f1c985

This reverts commit e4cfc5a.

prathyushpv requested a review from a team as a code owner February 10, 2025 23:12

prathyushpv changed the base branch from main to cloud/v1.27.0-127 February 10, 2025 23:13

prathyushpv closed this Feb 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert raw history change #7309

Revert raw history change #7309

prathyushpv commented Feb 10, 2025 •

edited

Loading

CLAassistant commented Feb 10, 2025 •

edited

Loading

Revert raw history change #7309

Revert raw history change #7309

Conversation

prathyushpv commented Feb 10, 2025 • edited Loading

What changed?

Why?

How did you test it?

Potential risks

Is hotfix candidate?

CLAassistant commented Feb 10, 2025 • edited Loading

prathyushpv commented Feb 10, 2025 •

edited

Loading

CLAassistant commented Feb 10, 2025 •

edited

Loading