Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configurable error handling and retrying in HTTP target #330

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions assets/docs/configuration/targets/default.hcl
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
target {
use "http" {

// by default no custom invalid/retryable rules so we use current default retrying strategy for target failures in `sourceWriteFunc` (in cli.go)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea that we have retries in the target, but also retain the retry that wraps it in the soruceWriteFunc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that was initial idea. Don't touch sourceWriteFunc and treat it as fallback. Then I know I don't break anything for other targets.

But now looking at other targets (e.g. for kinesis)....clients that we use there already have retrying implemented under the hood, correct? So as we have now HTTP target covered with retries, do we need this another layer of retrying in sourceWriteFunc at all and now simply make it explicit that retrying is responsibility of a target.

That would be similar to what we have in loaders.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well we will make sure we don't break anything for other targets regardless :D

My initial assumption about what we'd do here was that it should replace the way we currently do retries, but I hadn't really considered what you're suggesting here.

Thinking about it now, my initial thoughts are that having a retry wrapped in a retry makes it more complicated to follow - and one of the things I'm keen for us to improve about Snowbridge is that it's quite complicated to support and there are lots of little details like thisthat are hard to follow.

I'm not sure about the kinesis implementation but the EventHubs target is a great example of this - there's a MaxAutoRetries config option. This is the max retry setting for the EH client itself. But that is also wrapped in a retry - the API is very confusing here. One would reasonably expect that setting to dictate how many retries are made in total.

But it's not, the actual max number of retries in total is 5x that setting. Why 5x? Because that's what we hardcoded into the app long before we built that target. How do we expect a user, or on of our own support team to understand this when there's some issue? Well 🤷 - we do have it in the docs I believe but good design is when the app does what someone would reasonably expect by intuition.

My gut says that there's a nice design to be found to make retries the responsibility of the target, like how you've described it. So if it's possible one outcome of this spike would be to flesh that out and get a grip on how hard it is to do (and how we can do it with confidence that we don't break things). :)

response_handler {
success = [{ http_status: ["2**"]}]
invalid/badrow = []
retryable = []

retry_strategies {
transient {
policy: "exponential"
max_attempts: 5
delay: "2 seconds"
}
setup {
policy: "exponential"
max_attempts: 10000 //very high value making it basically unlimited
delay: "30 seconds"
}
}
}
}
}
60 changes: 60 additions & 0 deletions assets/docs/configuration/targets/http_response_handler.hcl
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
target {
use "http" {

response_handler {

// everything is fine, we can ack our data from source.
// Can such configuration be useful or it's overkill and we should commit to 2** all the time?
success = [
//we can have concrete status or wildcard using *
{ http_status: ["2**"]}
]

// data is for some reason not accepted by target, there is no point of retrying. Just send it to bad/failed target and unblock processing.
invalid/badrow = [

//first rule...
{ http_status: ["41*"]},

//second rule...
{
http_status: ["499", "424"],

//not only http status but check if response body matches. Extract some part of body (e.g. error code) and check if it's equal/not equal to value.
body {
path: "some jq here??"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting way to solve the problem. Might well work. Might be suitable for other targets too - worth figuring that out.

equal/not_equal: "some value"
}
}
]

// For these retryable errors (assuming we have health check):
// - set app status as unhealthy for each failed retry
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the streaming loaders have this concept of being set as unhealthy after a failed retry? Are there very few retries?

Would like to understand/think through this bit together. We have an app here that makes thousands of requests per second, and depending on the target, 10-15% of them might get rejected and retried.

If we make it configurable this might not be a problem but would like to understand it better regardless. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the streaming loaders have this concept of being set as unhealthy after a failed retry? Are there very few retries?

Yes, kind of. But in loaders we do that for synchronous actions like init table, create columns, open channel. If it's transient error or setup error, we mark app as unhealthy.

Here I think it's a bit different. We can have a lot of concurrent writes to HTTP target, right? So we may end up in situation like you described - 15% of request are retried (some transient issue) and they mark app as unhealthy. The rest 85% are OK and they set it back to healthy.

Sooo we may have a situation when we flip back and forth health toggle and we are unlucky with Kubernetes periodic liveness probe when it sees only unhealthy state....and kills the pod, even though overall everything looks pretty good?

It would make sense to make it unhealthy only for those persistent setup-like issues. I could add configuration option like toggle_health to retry strategy and set to our default setup one as true. Or instead of boolean use threshold, like unhealthy_after with number of retries after which we mark app as unhealthy....?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I think it's a bit different. We can have a lot of concurrent writes to HTTP target, right? So we may end up in situation like you described - 15% of request are retried (some transient issue) and they mark app as unhealthy. The rest 85% are OK and they set it back to healthy.

Pretty much - what's more we can expect to sometimes be in that state for a long time (eg. when being rate limited on a spike of traffic), and we actually rely on getting failures in those scenarios. So something feels somewhat odd about considering individual failures to mean 'unhealthy' status.

Sooo we may have a situation when we flip back and forth health toggle and we are unlucky with Kubernetes periodic liveness probe when it sees only unhealthy state....and kills the pod, even though overall everything looks pretty good?

I'm not sure how much of a real concern this scenario is, becuase we'd likely have many more successes than failures. But even still something feels odd about the design in this scenario.

It would make sense to make it unhealthy only for those persistent setup-like issues. I could add configuration option like toggle_health to retry strategy and set to our default setup one as true. Or instead of boolean use threshold, like unhealthy_after with number of retries after which we mark app as unhealthy....?

If we're only concerned with setup issues, then one option is a global flag to mark whether any request has been successful would work - on any success we'd mark it as true, on failure we wouldn't mark it as false but would mark the app as unhealthy.

But I'm not sure about this - would you mind jotting down a quick synopsis of this scenario? (maybe add a details section to the existing doc - but anywhere is fine)

I feel like getting a grip on the desired behaviour for this type of app might lead us to a more confident solution. :)

// - set app status as healthy after successful retry
// - if we run out of attempts => exit/crash/restart
// - if we don't have max attempts (e.g. like setup errors), we don't get stuck retrying indefinitely, because we are protected by health check. If app is unhealthy for extended period time, it will be killed eventually by kubernetes.
retryable = [
{ http_status: ["503"], strategy: "transient"},
Copy link
Collaborator

@colmsnowplow colmsnowplow Jun 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does 'strategy' mean in this config? It's 'type of error to treat this as'. Got it.

{
http_status: ["403"],
strategy: "setup",
alert: "This is alert sent to ops1. You can configure message for specific error code"
}
]

// and the list of retry strategies, you can define how many you want and then reference them in your 'retryable' rules
retry_strategies {
transient {
policy: "exponential"
max_attempts: 5
delay: "2 seconds"
}
setup {
policy: "exponential"
max_attempts: 10000
delay: "30 seconds"
}
}
}
}
}
30 changes: 30 additions & 0 deletions pkg/health/health_check.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
/**
* Copyright (c) 2020-present Snowplow Analytics Ltd.
* All rights reserved.
*
* This software is made available by Snowplow Analytics, Ltd.,
* under the terms of the Snowplow Limited Use License Agreement, Version 1.0
* located at https://docs.snowplow.io/limited-use-license-1.0
* BY INSTALLING, DOWNLOADING, ACCESSING, USING OR DISTRIBUTING ANY PORTION
* OF THE SOFTWARE, YOU AGREE TO THE TERMS OF SUCH LICENSE AGREEMENT.
*/

package health

import (
"sync/atomic"
)

var isHealthy atomic.Bool

func SetHealthy() {
isHealthy.Store(true)
}

func SetUnhealthy() {
isHealthy.Store(false)
}

func IsHealthy() bool {
return isHealthy.Load()
}
45 changes: 45 additions & 0 deletions pkg/monitoring/monitoring.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
/**
* Copyright (c) 2020-present Snowplow Analytics Ltd.
* All rights reserved.
*
* This software is made available by Snowplow Analytics, Ltd.,
* under the terms of the Snowplow Limited Use License Agreement, Version 1.0
* located at https://docs.snowplow.io/limited-use-license-1.0
* BY INSTALLING, DOWNLOADING, ACCESSING, USING OR DISTRIBUTING ANY PORTION
* OF THE SOFTWARE, YOU AGREE TO THE TERMS OF SUCH LICENSE AGREEMENT.
*/

package monitoring

import (
"bytes"
"context"
"encoding/json"
"fmt"
"net"
"net/http"
"net/url"
"time"

"github.com/hashicorp/go-multierror"
"github.com/pkg/errors"
log "github.com/sirupsen/logrus"

"github.com/snowplow/snowbridge/pkg/common"
"github.com/snowplow/snowbridge/pkg/models"

"golang.org/x/oauth2"
)

type Alert struct {
Message string
}

type Monitoring struct {
client *http.Client
httpURL string
}

func (m *Monitoring) SendAlert (alert Alert) {
m.client.Do(....)
}
Loading
Loading