-
Notifications
You must be signed in to change notification settings - Fork 815
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added exponential retry to the domain cache #6676
Added exponential retry to the domain cache #6676
Conversation
ac53f8f
to
7991588
Compare
} | ||
err := c.refreshDomains() | ||
if err != nil { | ||
c.logger.Error("Error refreshing domain cache", tag.Error(err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could consider logging fatal here? Since we already retryed quite a lot, and we still failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave it like this. It may have failed due to persistence rps limits and in those cases we don't want to get into crash loop. Ideally fetching domains shouldn't be subject to the same limits. @davidporter-id-au will look into granular rate limits area.
common/util.go
Outdated
@@ -88,6 +88,10 @@ const ( | |||
taskCompleterMaxInterval = 10 * time.Second | |||
taskCompleterExpirationInterval = 5 * time.Minute | |||
|
|||
domainCacheInitialInterval = 1 * time.Second | |||
domainCacheMaxInterval = 20 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I request we make this lower, like 5 seconds I think? I think there's the problem that this logic is smeared around in a few places (at least
cadence/common/cache/domainCache.go
Line 63 in e36c818
// DomainCacheRefreshFailureRetryInterval is the wait time |
cadence/service/history/task/task.go
Line 311 in e36c818
if t.timeSource.Now().Sub(t.submitTime) > 5*cache.DomainCacheRefreshInterval { |
I agree it's the subject of unnecessary load, but the domain cache (in)consistency across hosts during transfer tasks is a subject of some pretty difficult-to-figure out bugs during failover (because a child workflow might go to a different history host, with a different view of which region is active, and if hte domain cache doesn't update within a reasonable time-period we have no clear resolution strategy)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure - that's also a good justification for looking into this more. Since in the end, there is no guarantee that this will succeed in any reasonable time when it has already started failing.
7991588
to
64f975f
Compare
a8bd03f
to
fb5586b
Compare
What changed? Added a proper retry policy to the domain cache so it does not just retry every second Why? The load from retrying every second can cause too much load on the database. How did you test it? Unit test Potential risks Might cause domain data to be out of date, but we are already in that situation when it's failing. Release notes Documentation Changes
What changed? Added a proper retry policy to the domain cache so it does not just retry every second Why? The load from retrying every second can cause too much load on the database. How did you test it? Unit test Potential risks Might cause domain data to be out of date, but we are already in that situation when it's failing. Release notes Documentation Changes
What changed?
Added a proper retry policy to the domain cache so it does not just retry every second
Why?
The load from retrying every second can cause too much load on the database.
How did you test it?
Unit test
Potential risks
Might cause domain data to be out of date, but we are already in that situation when it's failing.
Release notes
Documentation Changes