Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[history] Adding more metrics for replication #6673

Merged
merged 3 commits into from
Feb 21, 2025

Conversation

3vilhamster
Copy link
Contributor

What changed?
Added more metrics for replication.
Simplified a bit code to avoid extra allocations and readability.

Why?
We've observed two cases:

  1. If passive cluster cannot fetch tasks, does not report replication lag - I've introduced replication_tasks_lag_raw that will be reporting the delay of the most recent task processed.
  2. The delay in taskIDs does not explain the real issue - so I've introduced replication_tasks_delay - that will help to see a replication delay in seconds. In the best case all tasks should be replicated immediately, but in reality there is always a delay. This metric will allow to track this delay per shard.

How did you test it?
Unit tests.

Potential risks

Release notes

Documentation Changes

readLevel = task.TaskID
if replicationTask != nil {
replicationTasks = append(replicationTasks, replicationTask)
}
}
taskGeneratedTimer.Stop()

t.scope.RecordTimer(metrics.ReplicationTasksLagRaw, time.Duration(t.ackLevels.GetTransferMaxReadLevel()-oldestUnprocessedTaskID))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing how this will help with "If passive cluster cannot fetch tasks, does not report replication lag". oldestUnprocessedTaskID is almost identical to readLevel except when the loop breaks. So it can be off by 1?
I'd assume we would want to report the lag when t.reader.Read() returns err based on lastReadTaskID. That part is untouched.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. However, the issue was:
We never update readLevel if we fail in t.store.Get (f.e. due to hitting MaxQPS or Cass failures). So, we never reported any lag since the value was never updated.
I'll try it out and add a a check to the test case to ensure what we get from the metric.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the logic, so we always pick the oldest task from the batch and report no delay if there are no tasks to replicate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also report the metrics before returning at line 98?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we fail to read history from the active side, we don't know the delay (as far as I understand), and we should be notified when we fail to pull the history.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I thought we could still report lag based on lastReadTaskID

readLevel = task.TaskID
if replicationTask != nil {
replicationTasks = append(replicationTasks, replicationTask)
}
}
taskGeneratedTimer.Stop()

t.scope.RecordTimer(metrics.ReplicationTasksLagRaw, time.Duration(t.ackLevels.GetTransferMaxReadLevel()-oldestUnprocessedTaskID))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I thought we could still report lag based on lastReadTaskID

@3vilhamster 3vilhamster merged commit eeba22d into cadence-workflow:master Feb 21, 2025
22 checks passed
@3vilhamster 3vilhamster deleted the replication-lag-time branch February 21, 2025 12:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants