-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Configurable error handling and retrying in HTTP target #330
Conversation
Drafting: * Flexible configuration allowing us to categorize HTTP responses and match specific retrying strategies. * Very simple application health store as global variable. * Very simple monitoring interface used to send alerts during retries in HTTP target. It doesn't compile now! It's just to show how such configuration could look like and validate such approach.
target { | ||
use "http" { | ||
|
||
// by default no custom invalid/retryable rules so we use current default retrying strategy for target failures in `sourceWriteFunc` (in cli.go) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the idea that we have retries in the target, but also retain the retry that wraps it in the soruceWriteFunc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that was initial idea. Don't touch sourceWriteFunc
and treat it as fallback. Then I know I don't break anything for other targets.
But now looking at other targets (e.g. for kinesis)....clients that we use there already have retrying implemented under the hood, correct? So as we have now HTTP target covered with retries, do we need this another layer of retrying in sourceWriteFunc
at all and now simply make it explicit that retrying is responsibility of a target.
That would be similar to what we have in loaders.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well we will make sure we don't break anything for other targets regardless :D
My initial assumption about what we'd do here was that it should replace the way we currently do retries, but I hadn't really considered what you're suggesting here.
Thinking about it now, my initial thoughts are that having a retry wrapped in a retry makes it more complicated to follow - and one of the things I'm keen for us to improve about Snowbridge is that it's quite complicated to support and there are lots of little details like thisthat are hard to follow.
I'm not sure about the kinesis implementation but the EventHubs target is a great example of this - there's a MaxAutoRetries
config option. This is the max retry setting for the EH client itself. But that is also wrapped in a retry - the API is very confusing here. One would reasonably expect that setting to dictate how many retries are made in total.
But it's not, the actual max number of retries in total is 5x that setting. Why 5x? Because that's what we hardcoded into the app long before we built that target. How do we expect a user, or on of our own support team to understand this when there's some issue? Well 🤷 - we do have it in the docs I believe but good design is when the app does what someone would reasonably expect by intuition.
My gut says that there's a nice design to be found to make retries the responsibility of the target, like how you've described it. So if it's possible one outcome of this spike would be to flesh that out and get a grip on how hard it is to do (and how we can do it with confidence that we don't break things). :)
|
||
//not only http status but check if response body matches. Extract some part of body (e.g. error code) and check if it's equal/not equal to value. | ||
body { | ||
path: "some jq here??" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting way to solve the problem. Might well work. Might be suitable for other targets too - worth figuring that out.
] | ||
|
||
// For these retryable errors (assuming we have health check): | ||
// - set app status as unhealthy for each failed retry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do the streaming loaders have this concept of being set as unhealthy after a failed retry? Are there very few retries?
Would like to understand/think through this bit together. We have an app here that makes thousands of requests per second, and depending on the target, 10-15% of them might get rejected and retried.
If we make it configurable this might not be a problem but would like to understand it better regardless. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do the streaming loaders have this concept of being set as unhealthy after a failed retry? Are there very few retries?
Yes, kind of. But in loaders we do that for synchronous actions like init table, create columns, open channel. If it's transient error or setup error, we mark app as unhealthy.
Here I think it's a bit different. We can have a lot of concurrent writes to HTTP target, right? So we may end up in situation like you described - 15% of request are retried (some transient issue) and they mark app as unhealthy. The rest 85% are OK and they set it back to healthy.
Sooo we may have a situation when we flip back and forth health toggle and we are unlucky with Kubernetes periodic liveness probe when it sees only unhealthy state....and kills the pod, even though overall everything looks pretty good?
It would make sense to make it unhealthy only for those persistent setup-like issues. I could add configuration option like toggle_health
to retry strategy and set to our default setup
one as true
. Or instead of boolean use threshold, like unhealthy_after
with number of retries after which we mark app as unhealthy....?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I think it's a bit different. We can have a lot of concurrent writes to HTTP target, right? So we may end up in situation like you described - 15% of request are retried (some transient issue) and they mark app as unhealthy. The rest 85% are OK and they set it back to healthy.
Pretty much - what's more we can expect to sometimes be in that state for a long time (eg. when being rate limited on a spike of traffic), and we actually rely on getting failures in those scenarios. So something feels somewhat odd about considering individual failures to mean 'unhealthy' status.
Sooo we may have a situation when we flip back and forth health toggle and we are unlucky with Kubernetes periodic liveness probe when it sees only unhealthy state....and kills the pod, even though overall everything looks pretty good?
I'm not sure how much of a real concern this scenario is, becuase we'd likely have many more successes than failures. But even still something feels odd about the design in this scenario.
It would make sense to make it unhealthy only for those persistent setup-like issues. I could add configuration option like toggle_health to retry strategy and set to our default setup one as true. Or instead of boolean use threshold, like unhealthy_after with number of retries after which we mark app as unhealthy....?
If we're only concerned with setup issues, then one option is a global flag to mark whether any request has been successful would work - on any success we'd mark it as true, on failure we wouldn't mark it as false but would mark the app as unhealthy.
But I'm not sure about this - would you mind jotting down a quick synopsis of this scenario? (maybe add a details section to the existing doc - but anywhere is fine)
I feel like getting a grip on the desired behaviour for this type of app might lead us to a more confident solution. :)
// - if we run out of attempts => exit/crash/restart | ||
// - if we don't have max attempts (e.g. like setup errors), we don't get stuck retrying indefinitely, because we are protected by health check. If app is unhealthy for extended period time, it will be killed eventually by kubernetes. | ||
retryable = [ | ||
{ http_status: ["503"], strategy: "transient"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does 'strategy' mean in this config? It's 'type of error to treat this as'. Got it.
@@ -49,6 +84,9 @@ type HTTPTargetConfig struct { | |||
OAuth2ClientSecret string `hcl:"oauth2_client_secret,optional" env:"TARGET_HTTP_OAUTH2_CLIENT_SECRET"` | |||
OAuth2RefreshToken string `hcl:"oauth2_refresh_token,optional" env:"TARGET_HTTP_OAUTH2_REFRESH_TOKEN"` | |||
OAuth2TokenURL string `hcl:"oauth2_token_url,optional" env:"TARGET_HTTP_OAUTH2_TOKEN_URL"` | |||
|
|||
//we have flat config structure everywhere so not sure if it's good idea to add struct here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No objection on principle. As long as the config itself is legible.
return nil | ||
} | ||
|
||
if ht.isInvalid(response) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh this is very interesting. Configuring the flow of data around the app as well. Nice idea
} | ||
|
||
func (ht *HTTPTarget) retry(config RetryConfig, action func() (interface{}, error)) interface{} { | ||
// Should we use some third-party retrying library? Where we can configure more stuff. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we find a well supported library that does it, let's look at that - especially if it can simplify our implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also have the option of adding what we need to this package
This is great @pondzix! I left some thoughts - broadly love the ideas & direction here. Two things I'm thinking about on this design:
I'm not sure I have a firm opinion on the second yet, just that I would be keen to understand it and think it through. :) |
b3c04cb
to
d688908
Compare
Implemented partially in 9fdf373 |
Jira ref: PDP-1203
Drafting:
It doesn't compile now! It's just to show how such configuration could look like and validate such approach.
In general the goal is to make default error handling and retries similar to what we have in our new streaming loaders.