Configurable error handling and retrying in HTTP target #330

pondzix · 2024-06-12T07:49:27Z

Jira ref: PDP-1203

Drafting:

Flexible configuration allowing us to categorize HTTP responses and match specific retrying strategies.
Very simple application health store as global variable.
Very simple monitoring interface used to send alerts during retries in HTTP target.

It doesn't compile now! It's just to show how such configuration could look like and validate such approach.

In general the goal is to make default error handling and retries similar to what we have in our new streaming loaders.

Drafting: * Flexible configuration allowing us to categorize HTTP responses and match specific retrying strategies. * Very simple application health store as global variable. * Very simple monitoring interface used to send alerts during retries in HTTP target. It doesn't compile now! It's just to show how such configuration could look like and validate such approach.

colmsnowplow · 2024-06-13T14:41:19Z

assets/docs/configuration/targets/default.hcl

+target {
+  use "http" {
+
+    // by default no custom invalid/retryable rules so we use current default retrying strategy for target failures in `sourceWriteFunc` (in cli.go) 


Is the idea that we have retries in the target, but also retain the retry that wraps it in the soruceWriteFunc?

Yes, that was initial idea. Don't touch sourceWriteFunc and treat it as fallback. Then I know I don't break anything for other targets.

But now looking at other targets (e.g. for kinesis)....clients that we use there already have retrying implemented under the hood, correct? So as we have now HTTP target covered with retries, do we need this another layer of retrying in sourceWriteFunc at all and now simply make it explicit that retrying is responsibility of a target.

That would be similar to what we have in loaders.

Well we will make sure we don't break anything for other targets regardless :D

My initial assumption about what we'd do here was that it should replace the way we currently do retries, but I hadn't really considered what you're suggesting here.

Thinking about it now, my initial thoughts are that having a retry wrapped in a retry makes it more complicated to follow - and one of the things I'm keen for us to improve about Snowbridge is that it's quite complicated to support and there are lots of little details like thisthat are hard to follow.

I'm not sure about the kinesis implementation but the EventHubs target is a great example of this - there's a MaxAutoRetries config option. This is the max retry setting for the EH client itself. But that is also wrapped in a retry - the API is very confusing here. One would reasonably expect that setting to dictate how many retries are made in total.

But it's not, the actual max number of retries in total is 5x that setting. Why 5x? Because that's what we hardcoded into the app long before we built that target. How do we expect a user, or on of our own support team to understand this when there's some issue? Well 🤷 - we do have it in the docs I believe but good design is when the app does what someone would reasonably expect by intuition.

My gut says that there's a nice design to be found to make retries the responsibility of the target, like how you've described it. So if it's possible one outcome of this spike would be to flesh that out and get a grip on how hard it is to do (and how we can do it with confidence that we don't break things). :)

colmsnowplow · 2024-06-13T14:43:19Z

assets/docs/configuration/targets/http_response_handler.hcl

+
+          //not only http status but check if response body matches. Extract some part of body (e.g. error code) and check if it's equal/not equal to value.
+          body {
+            path: "some jq here??"


Interesting way to solve the problem. Might well work. Might be suitable for other targets too - worth figuring that out.

colmsnowplow · 2024-06-13T14:47:49Z

assets/docs/configuration/targets/http_response_handler.hcl

+      ]
+
+      // For these retryable errors (assuming we have health check):
+      // - set app status as unhealthy for each failed retry


Do the streaming loaders have this concept of being set as unhealthy after a failed retry? Are there very few retries?

Would like to understand/think through this bit together. We have an app here that makes thousands of requests per second, and depending on the target, 10-15% of them might get rejected and retried.

If we make it configurable this might not be a problem but would like to understand it better regardless. :)

Do the streaming loaders have this concept of being set as unhealthy after a failed retry? Are there very few retries?

Yes, kind of. But in loaders we do that for synchronous actions like init table, create columns, open channel. If it's transient error or setup error, we mark app as unhealthy.

Here I think it's a bit different. We can have a lot of concurrent writes to HTTP target, right? So we may end up in situation like you described - 15% of request are retried (some transient issue) and they mark app as unhealthy. The rest 85% are OK and they set it back to healthy.

Sooo we may have a situation when we flip back and forth health toggle and we are unlucky with Kubernetes periodic liveness probe when it sees only unhealthy state....and kills the pod, even though overall everything looks pretty good?

It would make sense to make it unhealthy only for those persistent setup-like issues. I could add configuration option like toggle_health to retry strategy and set to our default setup one as true. Or instead of boolean use threshold, like unhealthy_after with number of retries after which we mark app as unhealthy....?

Here I think it's a bit different. We can have a lot of concurrent writes to HTTP target, right? So we may end up in situation like you described - 15% of request are retried (some transient issue) and they mark app as unhealthy. The rest 85% are OK and they set it back to healthy.

Pretty much - what's more we can expect to sometimes be in that state for a long time (eg. when being rate limited on a spike of traffic), and we actually rely on getting failures in those scenarios. So something feels somewhat odd about considering individual failures to mean 'unhealthy' status.

Sooo we may have a situation when we flip back and forth health toggle and we are unlucky with Kubernetes periodic liveness probe when it sees only unhealthy state....and kills the pod, even though overall everything looks pretty good?

I'm not sure how much of a real concern this scenario is, becuase we'd likely have many more successes than failures. But even still something feels odd about the design in this scenario.

It would make sense to make it unhealthy only for those persistent setup-like issues. I could add configuration option like toggle_health to retry strategy and set to our default setup one as true. Or instead of boolean use threshold, like unhealthy_after with number of retries after which we mark app as unhealthy....?

If we're only concerned with setup issues, then one option is a global flag to mark whether any request has been successful would work - on any success we'd mark it as true, on failure we wouldn't mark it as false but would mark the app as unhealthy.

But I'm not sure about this - would you mind jotting down a quick synopsis of this scenario? (maybe add a details section to the existing doc - but anywhere is fine)

I feel like getting a grip on the desired behaviour for this type of app might lead us to a more confident solution. :)

colmsnowplow · 2024-06-13T14:48:32Z

assets/docs/configuration/targets/http_response_handler.hcl

+      // - if we run out of attempts =>  exit/crash/restart
+      // - if we don't have max attempts (e.g. like setup errors), we don't get stuck retrying indefinitely, because we are protected by health check. If app is unhealthy for extended period time, it will be killed eventually by kubernetes. 
+      retryable = [
+        { http_status: ["503"], strategy: "transient"}, 


~~What does 'strategy' mean in this config?~~ It's 'type of error to treat this as'. Got it.

colmsnowplow · 2024-06-13T14:51:37Z

pkg/target/http.go

@@ -49,6 +84,9 @@ type HTTPTargetConfig struct {
 	OAuth2ClientSecret string `hcl:"oauth2_client_secret,optional" env:"TARGET_HTTP_OAUTH2_CLIENT_SECRET"`
 	OAuth2RefreshToken string `hcl:"oauth2_refresh_token,optional" env:"TARGET_HTTP_OAUTH2_REFRESH_TOKEN"`
 	OAuth2TokenURL     string `hcl:"oauth2_token_url,optional" env:"TARGET_HTTP_OAUTH2_TOKEN_URL"`
+
+	//we have flat config structure everywhere so not sure if it's good idea to add struct here?


No objection on principle. As long as the config itself is legible.

colmsnowplow · 2024-06-13T14:53:46Z

pkg/target/http.go

+		return nil
+	}
+
+	if ht.isInvalid(response) {


Oh this is very interesting. Configuring the flow of data around the app as well. Nice idea

colmsnowplow · 2024-06-13T14:57:02Z

pkg/target/http.go

+}
+
+func (ht *HTTPTarget) retry(config RetryConfig, action func() (interface{}, error)) interface{} {
+	// Should we use some third-party retrying library? Where we can configure more stuff.


If we find a well supported library that does it, let's look at that - especially if it can simplify our implementation.

Also have the option of adding what we need to this package

colmsnowplow · 2024-06-13T15:03:36Z

This is great @pondzix!

I left some thoughts - broadly love the ideas & direction here. Two things I'm thinking about on this design:

Can we use this or similar mechanism for all the targets, and replace the retry in sourceWriteFunc? (and what does design look like, if we can't there's implications etc).
Can marking as unhealthy work similarly to loaders for our use case? (maybe for some cases and not others)

I'm not sure I have a firm opinion on the second yet, just that I would be keen to understand it and think it through. :)

pondzix · 2024-11-06T17:39:03Z

Implemented partially in 9fdf373

colmsnowplow reviewed Jun 13, 2024

View reviewed changes

colmsnowplow force-pushed the develop branch from b3c04cb to d688908 Compare September 11, 2024 10:02

pondzix force-pushed the develop branch from 22ac029 to 29b007e Compare September 30, 2024 11:20

pondzix force-pushed the develop branch from cc30121 to f55fc5c Compare November 6, 2024 17:26

pondzix closed this Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configurable error handling and retrying in HTTP target #330

Configurable error handling and retrying in HTTP target #330

pondzix commented Jun 12, 2024 •

edited by AlexBenny

Loading

colmsnowplow Jun 13, 2024

pondzix Jun 14, 2024

colmsnowplow Jun 14, 2024

colmsnowplow Jun 13, 2024

colmsnowplow Jun 13, 2024

pondzix Jun 14, 2024

colmsnowplow Jun 14, 2024

colmsnowplow Jun 13, 2024 •

edited

Loading

colmsnowplow Jun 13, 2024

colmsnowplow Jun 13, 2024

colmsnowplow Jun 13, 2024

colmsnowplow Jun 13, 2024

colmsnowplow commented Jun 13, 2024

pondzix commented Nov 6, 2024

Configurable error handling and retrying in HTTP target #330

Configurable error handling and retrying in HTTP target #330

Conversation

pondzix commented Jun 12, 2024 • edited by AlexBenny Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

colmsnowplow Jun 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

colmsnowplow commented Jun 13, 2024

pondzix commented Nov 6, 2024

pondzix commented Jun 12, 2024 •

edited by AlexBenny

Loading

colmsnowplow Jun 13, 2024 •

edited

Loading