Add label to staleness metrics to distinguish published vs duplicate entries #93
+18
−9
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Oplogtoredis runs two replicas in parallel which both write to redis, with dedup being done by a lua script. However, both of these replicas report the latency in time to reach redis, regardless of whether it was the first to reach redis or not. This means that is one replica is lagging behind, it will report larger latency values than what’s actually reaching redis, and we don’t have a way to filter these since they will be reported at a later time than the earlier successful write from the other replica.
This adds a return value to the dedup script and uses it as a label for the latency metrics.
Note: for some reason returning true/false from the lua script was causing the performance and fault_injection tests to fail (after an extra long time). Switching to int, returning 1 or 2, worked. Testing showed that returning false was causing issues, and a search found this which seems to align with this issue, where the lua script returning false is treated as an error, which then caused OTR to retry (until the dedupe token expired). I opened an issue with go-redis.