-
Notifications
You must be signed in to change notification settings - Fork 815
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[history] Adding more metrics for replication #6673
[history] Adding more metrics for replication #6673
Conversation
readLevel = task.TaskID | ||
if replicationTask != nil { | ||
replicationTasks = append(replicationTasks, replicationTask) | ||
} | ||
} | ||
taskGeneratedTimer.Stop() | ||
|
||
t.scope.RecordTimer(metrics.ReplicationTasksLagRaw, time.Duration(t.ackLevels.GetTransferMaxReadLevel()-oldestUnprocessedTaskID)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be missing how this will help with "If passive cluster cannot fetch tasks, does not report replication lag". oldestUnprocessedTaskID
is almost identical to readLevel
except when the loop breaks. So it can be off by 1?
I'd assume we would want to report the lag when t.reader.Read()
returns err based on lastReadTaskID
. That part is untouched.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. However, the issue was:
We never update readLevel if we fail in t.store.Get (f.e. due to hitting MaxQPS or Cass failures). So, we never reported any lag since the value was never updated.
I'll try it out and add a a check to the test case to ensure what we get from the metric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the logic, so we always pick the oldest task from the batch and report no delay if there are no tasks to replicate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also report the metrics before returning at line 98?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we fail to read history from the active side, we don't know the delay (as far as I understand), and we should be notified when we fail to pull the history.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I thought we could still report lag based on lastReadTaskID
readLevel = task.TaskID | ||
if replicationTask != nil { | ||
replicationTasks = append(replicationTasks, replicationTask) | ||
} | ||
} | ||
taskGeneratedTimer.Stop() | ||
|
||
t.scope.RecordTimer(metrics.ReplicationTasksLagRaw, time.Duration(t.ackLevels.GetTransferMaxReadLevel()-oldestUnprocessedTaskID)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I thought we could still report lag based on lastReadTaskID
9399745
to
5232f2a
Compare
What changed?
Added more metrics for replication.
Simplified a bit code to avoid extra allocations and readability.
Why?
We've observed two cases:
How did you test it?
Unit tests.
Potential risks
Release notes
Documentation Changes