Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert raw history change #7309

Closed
wants to merge 160 commits into from
Closed

Conversation

prathyushpv
Copy link
Contributor

@prathyushpv prathyushpv commented Feb 10, 2025

What changed?

Revert the change to return raw history events from history to frontend.

Why?

In GetWorkflowExecutionHistory request, there is a filter HISTORY_EVENT_FILTER_TYPE_CLOSE_EVENT. History service used to apply this filter and return only the last event. Now history service returns the last history batch and frontend gets the last event from that blob and return that.
When cluster is downgrading, frontend is downgraded first. If a worker is talking to an old frontend, which then talks to a new history, frontend does not have the logic to filter out last event.

How did you test it?

Existing tests

Potential risks

Is hotfix candidate?

Yes

alexshtin and others added 30 commits January 17, 2025 22:21
## What changed?
<!-- Describe what has changed in this PR -->
Refactor: remove `cluster.yaml` config.

## Why?
<!-- Tell your future self why have you made these changes -->
It is the same as `es_cluster.yaml` but doesn't have `esConfig` section.
The intent was to use this config file when Elasticsearch is not used.
But this section is not used if SQL persistence is used, so there is no
harm to always use `es_cluster.yaml`.

I am going to significantly decrease number of config files for
functional tests (if not remove them all). This is small step in this
direction.

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
Run tests.

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->
No risks.

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->
No.

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
No.
## What changed?
<!-- Describe what has changed in this PR -->

## Why?
<!-- Tell your future self why have you made these changes -->

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
## What changed?
<!-- Describe what has changed in this PR -->

Moved code from `common/namespace` to `common/namespace/nsreplication`.
Next to the existing `common/namespace/nsregistry`.

## Why?
<!-- Tell your future self why have you made these changes -->

The goal is to remove all references from `common/namespace` to
`common/persistence`. One step towards using `common/dynamicconfig` from
`common/namespace`.

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->

No behavior changes here.

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
## What changed?
<!-- Describe what has changed in this PR -->

Moved code from `common/namespace` to `common/namespace/nsattr`.

## Why?
<!-- Tell your future self why have you made these changes -->

This removes the dependency from `common/namespace` on `common/cluster`.
One step towards using `common/dynamicconfig` from `common/namespace`.

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->

No behavior changes here.

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
## What changed?
- Add generic hook interface for fine-grained control of behavior in
tests
- Use the hooks for matching varying behavior tests (force load balancer
to target partitions and disable sync match)
- Use the hooks to force a race condition in an update-with-start test
(by @stephanos)

## Why?
To write integration/functional tests that require tweaking behavior of
code under test, without affecting non-test builds.

## Potential risks
Hooks are disabled by default, so there should be zero risk to
production code, and zero overhead (assuming the Go compiler can do very
basic inlining and dead code elimination).

The downside is that functional tests now have to be run with `-tags
test_dep` everywhere.

---------

Co-authored-by: Stephan Behnke <[email protected]>
…atedEvent (#7091)

## What changed?
use UnsetVersioningOverride field in
ApplyWorkflowExecutionOptionsUpdatedEvent

## Why?
So that users of this event don't need to load VersioningOverride from
mutable state every time they create this event.
Now, a nil Versioning Override in this event means "no change" instead
of "remove".
This reduces the chance that someone accidentally unsets an override in
the future, and also is more efficient.
We've discussed this change internally in the server team and are ok
with changing the meaning of this history event, because it is such a
small change and the scope of impact is small (pre-release versioning
users who have unset a versioning override and are building mutable
state from that history).

## How did you test it?
Made sure that versioning override functional tests pass.

## Potential risks
Now, a nil Versioning Override in this event means "no change" instead
of "remove".
If an event exists with the previous meaning and the mutable state is
rebuilt, the Versioning Override would not be removed.
But the chance of that happening is very low.

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
## What changed?
<!-- Describe what has changed in this PR -->
Add a where to parse "SQL_like" queries and evaluate them agains a
mutable state

## Why?
<!-- Tell your future self why have you made these changes -->
There is a need for arbitrary mutable state filtering.
The immidiate need - to support filtering when applyting namespace-level
rules (in development)

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
add unit tests

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->
n/a

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->
not yet


## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
no

---------

Co-authored-by: Rodrigo Zhou <[email protected]>
## What changed?
<!-- Describe what has changed in this PR -->

Simplifies the gauge-related functions so they don't take references to
PhysicalTaskQueueManager or TaskQueueManager anymore.

## Why?
<!-- Tell your future self why have you made these changes -->

- makes it clearer what inputs the functions really need
- enables more refactorings to break up cyclical dependencies 

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->

Pure refactoring. Existing tests.

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
## What changed?
<!-- Describe what has changed in this PR -->

Moved `TaskTokenSerializer` to the already existing package
`common/tasktoken`.

## Why?
<!-- Tell your future self why have you made these changes -->

- the serializer's dependency on `utf8validator` prevents a refactoring
I plan to do; moving it out of `common` helps
- serializer belongs in that package
- allows to drop the "TaskToken" part

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->

Pure refactoring; no behavior change.

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

This will break (at least) two references in other repos I've
identified. I'll update those once this is merged.

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
## What changed?
Set delete_on_completion to reserved

## Why?
delete_on_completion is no longer needed

## How did you test it?
make

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
No
## What changed?
- Extended tombstone tracking to capture HSM state machine deletions
from Operation Log
- Added `StateMachinePath` tombstone type for HSM deletions alongside
existing activity/timer/etc. tombstones
- Maintained existing tombstone batch management and cleanup mechanisms

## Why?
Without this:
- HSM state machine deletions would not be properly replicated across
clusters
- Could lead to state inconsistencies between active/passive clusters
when using HSM
- No way to know if an HSM node was intentionally deleted or is missing
due to an error

## Potential risks
- Potential memory pressure from large HSM deletion batches
- New panic path if operation log access fails

Risk Mitigation:
- Operation log errors surface quickly via panic rather than silent
failure

## Documentation
No changes required - this extends an internal implementation detail of
how deletions are tracked.

## Is hotfix candidate?
No - this should follow normal deployment process.
Reinstate what was reverted in
#7024

## What changed?

Add deletion calls for Nexus operation nodes once they reach a terminal
state (Completed, Failed, Canceled, Timed Out). Previously, terminal
operations lingered in the state machine, potentially causing confusion
and wasting resources. In addition, this PR addresses previously
existing TODOs related to node cleanup.

### Key Changes
- **Automatic Node Deletion on Terminal States:**  
Operations are now removed from the state machine immediately after they
complete, fail, cancel, or time out.
- **Handler Updates:**  
The command handlers (schedule and cancel) now reflect the new terminal
state deletion. Attempts to cancel an operation that has already reached
a terminal state will produce a clear, non-retryable error.
- **Tests for Terminal State Deletion:**  
  Introduced two new test suites:
- `TestTerminalStatesDeletion` to verify that applying terminal events
removes the operation node as expected.
- `TestOperationNodeDeletionOnTerminalEvents` to ensure no further
modifications (like cancelation) can be made after an operation is
terminal.

### Areas for Review Focus
1. **Deletion Semantics:** Verify that node removal is triggered
correctly for all terminal states and that no non-terminal states
inadvertently trigger deletion.
2. **Handler Consistency:** Ensure that the updated schedule/cancel
handlers function correctly when dealing with non-existent or already
terminated operations.
3. **Error Handling:** Confirm that attempts to modify a deleted
operation produce the intended error behavior.
4. **Test Coverage:** Review the newly added test suites to ensure all
terminal states and edge cases are thoroughly tested.

## Why?

- Prevent confusion and wasted resources by removing irrelevant
operation nodes.
- Prevent customers from experiencing workflow task failures
prematurely.

## How did you test it?

- **New Test Suites:**  
Added `TestTerminalStatesDeletion` and
`TestOperationNodeDeletionOnTerminalEvents` to specifically target and
validate the new deletion behavior.
- **Existing Tests:**  
  All pre-existing tests have been re-run to ensure no regressions.

## Potential risks

- **Assumptions About Completed Operations:**  
If any internal or external code implicitly relied on the presence of
completed operations, its assumptions might break. Such usage was never
officially supported.
  
## Documentation

No external or user-facing documentation changes are required since this
feature matches the intended design.

## Is hotfix candidate?

No. This change is an enhancement to internal logic and state
management, not a fix for a critical bug, so it should follow the normal
release process.

---------

Co-authored-by: Roey Berman <[email protected]>
Co-authored-by: Roey Berman <[email protected]>

## What changed?
<!-- Describe what has changed in this PR -->

## Why?
<!-- Tell your future self why have you made these changes -->

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->

Co-authored-by: Roey Berman <[email protected]>
Co-authored-by: Roey Berman <[email protected]>
## What changed?
<!-- Describe what has changed in this PR -->

Added support for OpenTelemetry (OTEL) to task processing.

## Why?
<!-- Tell your future self why have you made these changes -->

To extend the OTEL coverage and gain insight into task processing.

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->

Ran Grafana Tempo locally:

<img width="1664" alt="Screenshot 2025-01-16 at 6 07 51 PM"
src="https://github.com/user-attachments/assets/c445a64e-8ba8-468b-9fec-4f90cf37888f"
/>


## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

I don't expect any negative performance impact on task processing
because most operations are noops.

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
## What changed?
<!-- Describe what has changed in this PR -->
Do not send backfill task when event is empty
## Why?
<!-- Tell your future self why have you made these changes -->
Backfill tasks should always have associated events. If there is no
event(i.e. for a state only transition), then no backfill to perform and
we should skip the task
## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
n/a
## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->
n/a
## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->
n/a
## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
no
#7113)

## What changed?
<!-- Describe what has changed in this PR -->
Added an Eventually condition to getting callback info to account for
delays in the workflow retry in
`TestWorkflowNexusCallbacks_CarriedOver/WorkflowFailureRetry`

## Why?
<!-- Tell your future self why have you made these changes -->
Fix flake

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
Ran a couple hundred times locally. Failed within 50 runs before change.

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
…#7136)

This commit references a repository GitHub actions variable,
`SHOULD_TRIGGER_DOCKER_BUILD`, and only triggers a docker build if this
variable is set to `true`. This is to prevent forks of this repository
from triggering builds in the `temporalio/docker-builds` repository,
which is almost certainly not what those forks want.
)

## What changed?
<!-- Describe what has changed in this PR -->
Embed `Unimplemented*ServiceServer` instead of `Unsafe*ServiceServer` in
all gRPC handlers.

## Why?
<!-- Tell your future self why have you made these changes -->
Embedding `Unimplemented*ServiceServer` is the right way to keep
backward compatibility with future interface changes.

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
Build and run.

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->
No risks.

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->
No.

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
No.
## What changed?
<!-- Describe what has changed in this PR -->

Introduces vars for the noop tracer and trace provider.

## Why?
<!-- Tell your future self why have you made these changes -->

Communicates clearer what/where they are.

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
…ails (#6936)

## What changed?
Added target cluster to explicitly clarify error message for future
debugging.

## Why?
<!-- Tell your future self why have you made these changes -->

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
)

Previous behavior was to not retry if number of failures was more than
10. We want to be more forgiving.
## What changed?
<!-- Describe what has changed in this PR -->

Moved all telemetry resource names to one place.

## Why?
<!-- Tell your future self why have you made these changes -->

Having them in one place.

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
## What changed?
<!-- Describe what has changed in this PR -->
See title.

## Why?
<!-- Tell your future self why have you made these changes -->
Flaky tests no good!

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
## What changed?
Wait for clusters to be synced instead of using Sleep. Make sure second
run is started before failover.

## Why?
Current implementation depends on the timing which is not reliable.

## How did you test it?
Repeatedly run the test locally and no failure found.

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
## What changed?
Just move some code around.

## Why?
Separate concerns.
## What changed?
Remove this obsolete config.

## Why?
It's not used anymore.
## What changed?
<!-- Describe what has changed in this PR -->
- update old deployment component, for existing pre-release versioning-3
customers, so that it always registers itself when versioning-3 is
enabled on the namespace

## Why?
<!-- Tell your future self why have you made these changes -->
- fix any potential delays in worker registration in the near future

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
- existing suite of tests

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->
- not a lot since this feature is still pre-release

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
- No
## What changed?
<!-- Describe what has changed in this PR -->

Removed dependency of `common/persistence` from `common/namespace`
package.

And removed dependency of `common/namespace` on
`common/namespace/nsreplication`.

## Why?
<!-- Tell your future self why have you made these changes -->

Last step towards using `common/namespace` from `common/dynamicconfig`.
(circular dependency right now)

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->

No behavior changes here.

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
## What changed?
<!-- Describe what has changed in this PR -->

Instead of passing `namespaceID string`, dynamicconfig receives `ns
namespace.ID` now.

_NOTE: I'll do `namespace string` in a follow-up PR. I want to get this
out of the door to establish the dependency of `common/dynamicconfig` on
`common/namespace` as it already took several PRs to make that possible.
And it's less risky._

## Why?
<!-- Tell your future self why have you made these changes -->

Type safety.

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->

Compiler, as there is no behavior change.

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
Shivs11 and others added 26 commits February 6, 2025 15:41
## What changed?
<!-- Describe what has changed in this PR -->
- make deletes idempotent

## Why?
<!-- Tell your future self why have you made these changes -->

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
- added functional tests 
- also changes to functional test to make them have a stricter
consistency check

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
…oyments excludes closed (#7267)

## What changed?
Make describe closed wf return not found and add tests

## Why?
<!-- Tell your future self why have you made these changes -->

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->

---------

Co-authored-by: ShahabT <[email protected]>
Co-authored-by: Shivam Saraf <[email protected]>
## What changed?
Make updates and signals use args
Try to delete oldest to newest version if we hit limit

## Why?
<!-- Tell your future self why have you made these changes -->

## How did you test it?
New functional test

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->

---------

Co-authored-by: Shivam <[email protected]>
Co-authored-by: Shahab Tajik <[email protected]>
Co-authored-by: Shivam Saraf <[email protected]>
Co-authored-by: ShahabT <[email protected]>
## What changed?
<!-- Describe what has changed in this PR -->
- Skip drainage

## Why?
<!-- Tell your future self why have you made these changes -->
- Versioning-3.1

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
- Added a new test which passes when poller_history config is changed

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->

---------

Co-authored-by: ShahabT <[email protected]>
Co-authored-by: Carly de Frondeville <[email protected]>
## What changed?
<!-- Describe what has changed in this PR -->
- reset drainageInfo to nil after workflow starts accepting new
workflows again

## Why?
<!-- Tell your future self why have you made these changes -->

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
- existing suite

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->

---------

Co-authored-by: Carly de Frondeville <[email protected]>
…7268)

## What changed?
Use MutableSideEffect to access dynamic config from inside workflows.

## Why?
To prevent non-determinism in the workflow code.
To allow the workflows to use the latest dynamic config values at all
times.

## How did you test it?
Tests are failing for other reasons, so I'm unable to test until I pull
in those changes.

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->

---------

Co-authored-by: ShahabT <[email protected]>
## What changed?
<!-- Describe what has changed in this PR -->

As of #7254 a total in-flight
payload size limit is enforced. The default is 10MB right now, this PR
makes it 20MB.

As the max payload size for Updates is 2MB and the current Update
in-flight limit is 10, the effective limit right now is (about) 20MB.

## Why?
<!-- Tell your future self why have you made these changes -->

By enforcing a payload-based limit, we can raise the total in-flight
update limit from 10 to ... more, eventually.

But we decided to couple these together, ie increasing the in-flight
number when introducing the in-flight size limit.

Therefore we'll set it to be a noop effectively for now.

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
## What changed?
<!-- Describe what has changed in this PR -->

## Why?
<!-- Tell your future self why have you made these changes -->

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
## What changed?
<!-- Describe what has changed in this PR -->
- Log error instead of panic for existing tally users when prometheus
listener can't be created.
- No behavior change for existing OTEL users

## Why?
<!-- Tell your future self why have you made these changes -->
- For backward compatibility
## What changed?
Move history batch validations from ExecutionManager to history/api
package.

## Why?
ReadRawHistoryBranch in ExecutionManager may not return full list of
history events in some situations. So we cannot check if the history is
complete in that layer.
Moving this check to a layer above that which will have a full list of
history events

## How did you test it?
Unit tests
## What changed?
Handle state machine deletion for state-based replication.

## Why?
Deletion is needed as we want to provide support for more nexus
operations.

## How did you test it?
unit test.

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
## What changed?
This PR gates this feature behind the feature flag before the full
release.

## Why?
The feature is brand new and we may not have identified all edge cases.
So it's better to roll it out slowly.

## How did you test it?
Manually flipped the flag off and verified that the old child was not
terminated.

## Potential risks
N/A

## Documentation
N/A

## Is hotfix candidate?
No.
## What changed?
<!-- Describe what has changed in this PR -->
The lib version accidentally went back in a previous PR. Fixing it here.

## Why?
<!-- Tell your future self why have you made these changes -->

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
## What changed?
<!-- Describe what has changed in this PR -->
- forceCAN for worker-deployment

## Why?
<!-- Tell your future self why have you made these changes -->

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
- added test

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
## What changed?
<!-- Describe what has changed in this PR -->

## Why?
<!-- Tell your future self why have you made these changes -->

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
## What changed?
<!-- Describe what has changed in this PR -->
- Eliminate the DescribeWorkerDeployment fan-out to Versions for getting
the summaries.
- Sync drainage status with the deployment workflow when it changes.
- Drainage WF now can CaN!
- Move drainage args to it's own message to be future proof.
- Made WorkerDeploymentLocalState.versions a map for easier access to
versions.
 
## Why?
<!-- Tell your future self why have you made these changes -->

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
## What changed?
Write to WorkerDeployment, WorkerDeploymentVersion, and
WorkflowVersioningBehavior Search Attributes in all the places where we
used to only write to BuildIds SA.

If using worker deployments and pinned, write `pinned:version` to
BuildIds list instead of `pinned:series_name:build_id`.

Use `pinned:version` instead of `pinned:series_name:build_id` for
deployment version drainage status.

## Why?
<!-- Tell your future self why have you made these changes -->

## How did you test it?
- Confirmed that all Deployment Reachability tests pass
- New tests (in progress)

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
## What changed?
<!-- Describe what has changed in this PR -->
- Log prometheus listener error at Warn level

## Why?
<!-- Tell your future self why have you made these changes -->
- For backward compatibility. There might be some CI systems monitoring
logs and fail tests on Error logs.
## What changed?
<!-- Describe what has changed in this PR -->
Making sure the following SAs are populated correctly for Versioning
v3.1
- TemporalWorkerDeployment 
- TemporalWorkerDeploymentVersion 
- TemporalWorkflowVersioningBehavior
- BuildIds

## Why?
<!-- Tell your future self why have you made these changes -->

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
## What changed?

Validate the Nexus operation tokens don't exceed a configured length in
all APIs that accept it.
Also tidied up code in `completions.go` where we applied the start event
no via the event definition, skipping the `MachineTransition` call.
There won't be any behavior change since this transition did not
generate tasks.

## Why?

It's dangerous for us to accept strings without limiting their length.

## How did you test it?

Added all of the relevant tests.
@prathyushpv prathyushpv requested a review from a team as a code owner February 10, 2025 23:12
@CLAassistant
Copy link

CLAassistant commented Feb 10, 2025

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
8 out of 9 committers have signed the CLA.

✅ pdoerner
✅ carlydf
✅ stephanos
✅ prathyushpv
✅ gow
✅ rodrigozhou
✅ ShahabT
✅ hai719
❌ temporal-data
You have signed the CLA already but the status is still pending? Let us recheck it.

@prathyushpv prathyushpv changed the base branch from main to cloud/v1.27.0-127 February 10, 2025 23:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.