[history] Adding more metrics for replication #6673

3vilhamster · 2025-02-17T14:53:14Z

What changed?
Added more metrics for replication.
Simplified a bit code to avoid extra allocations and readability.

Why?
We've observed two cases:

If passive cluster cannot fetch tasks, does not report replication lag - I've introduced replication_tasks_lag_raw that will be reporting the delay of the most recent task processed.
The delay in taskIDs does not explain the real issue - so I've introduced replication_tasks_delay - that will help to see a replication delay in seconds. In the best case all tasks should be replicated immediately, but in reality there is always a delay. This metric will allow to track this delay per shard.

How did you test it?
Unit tests.

Potential risks

Release notes

Documentation Changes

taylanisikdemir · 2025-02-18T03:18:37Z

service/history/replication/task_ack_manager.go

 		readLevel = task.TaskID
 		if replicationTask != nil {
 			replicationTasks = append(replicationTasks, replicationTask)
 		}
 	}
 	taskGeneratedTimer.Stop()

+	t.scope.RecordTimer(metrics.ReplicationTasksLagRaw, time.Duration(t.ackLevels.GetTransferMaxReadLevel()-oldestUnprocessedTaskID))


I might be missing how this will help with "If passive cluster cannot fetch tasks, does not report replication lag". oldestUnprocessedTaskID is almost identical to readLevel except when the loop breaks. So it can be off by 1?
I'd assume we would want to report the lag when t.reader.Read() returns err based on lastReadTaskID. That part is untouched.

Good point. However, the issue was:
We never update readLevel if we fail in t.store.Get (f.e. due to hitting MaxQPS or Cass failures). So, we never reported any lag since the value was never updated.
I'll try it out and add a a check to the test case to ensure what we get from the metric.

I've updated the logic, so we always pick the oldest task from the batch and report no delay if there are no tasks to replicate.

should we also report the metrics before returning at line 98?

If we fail to read history from the active side, we don't know the delay (as far as I understand), and we should be notified when we fail to pull the history.

I see. I thought we could still report lag based on lastReadTaskID

taylanisikdemir · 2025-02-19T17:24:25Z

service/history/replication/task_ack_manager.go

 		readLevel = task.TaskID
 		if replicationTask != nil {
 			replicationTasks = append(replicationTasks, replicationTask)
 		}
 	}
 	taskGeneratedTimer.Stop()

+	t.scope.RecordTimer(metrics.ReplicationTasksLagRaw, time.Duration(t.ackLevels.GetTransferMaxReadLevel()-oldestUnprocessedTaskID))


I see. I thought we could still report lag based on lastReadTaskID

3vilhamster requested review from Shaddoll, neil-xie, davidporter-id-au, Groxx, shijiesheng, jakobht, sankari165, dkrotx, taylanisikdemir and demirkayaender as code owners February 17, 2025 14:53

taylanisikdemir reviewed Feb 18, 2025

View reviewed changes

jakobht approved these changes Feb 18, 2025

View reviewed changes

taylanisikdemir approved these changes Feb 19, 2025

View reviewed changes

3vilhamster added 3 commits February 20, 2025 11:32

[history] Adding more metrics for replication

f45772e

fix delay metric name

1526aea

change the behavious so we always pick the oldest task.

5232f2a

3vilhamster force-pushed the replication-lag-time branch from 9399745 to 5232f2a Compare February 20, 2025 10:32

3vilhamster merged commit eeba22d into cadence-workflow:master Feb 21, 2025
22 checks passed

3vilhamster deleted the replication-lag-time branch February 21, 2025 12:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[history] Adding more metrics for replication #6673

[history] Adding more metrics for replication #6673

3vilhamster commented Feb 17, 2025

taylanisikdemir Feb 18, 2025

3vilhamster Feb 18, 2025

3vilhamster Feb 18, 2025

taylanisikdemir Feb 18, 2025

3vilhamster Feb 19, 2025

taylanisikdemir Feb 19, 2025

taylanisikdemir Feb 19, 2025

[history] Adding more metrics for replication #6673

[history] Adding more metrics for replication #6673

Conversation

3vilhamster commented Feb 17, 2025

taylanisikdemir Feb 18, 2025

Choose a reason for hiding this comment

3vilhamster Feb 18, 2025

Choose a reason for hiding this comment

3vilhamster Feb 18, 2025

Choose a reason for hiding this comment

taylanisikdemir Feb 18, 2025

Choose a reason for hiding this comment

3vilhamster Feb 19, 2025

Choose a reason for hiding this comment

taylanisikdemir Feb 19, 2025

Choose a reason for hiding this comment

taylanisikdemir Feb 19, 2025

Choose a reason for hiding this comment