Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add backoff strategy interfaces and implementations #252

Closed

Conversation

ViToni
Copy link
Contributor

@ViToni ViToni commented Apr 19, 2024

This PR introduces the ability to use a backoff strategy for reconnect attempts.

As default a backoff strategy is used whichs return the same backoff value mimicking the previous behaviour.
In addition there is an exponential backoff available which will increase the max value on each attempt up to a configurable limit.
The created backoff is a random value between a configured non-zero minimum and the increasing max (up to the configured limit). Using a random value from an increasing range helps spread client reconnect attempts.

Testing

Existing code has been changed to use the same duration as before but now uses the constant backoff implementation instead. All tests pass.

Additionally there are dedicated tests for:

  • Constant backoff
  • Exponential backoff

Closing issues

closes #249 `

@ViToni ViToni force-pushed the feature/add_backoff_strategy branch from 9284b81 to fda2b50 Compare April 20, 2024 17:37
Copy link
Contributor

@MattBrittan MattBrittan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry - I'm suggesting a big change here to deal with an issue that is completly missed by the current ConnectRetryDelay . Please feel free to say "no"!

ConnectRetryDelay time.Duration // How long to wait between connection attempts (defaults to 10s)
ConnectTimeout time.Duration // How long to wait for the connection process to complete (defaults to 10s)
WebSocketCfg *WebSocketConfig // Enables customisation of the websocket connection
ReconnectBackoffStrategy BackoffStrategy // How long to wait between connection attempts (defaults to 10s)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing ConnectRetryDelay makes this a breaking change. I wonder if it's best to keep ConnectRetryDelay (perhaps mark as depreciated) and use it (with a NewConstantBackoffStrategy) if ReconnectBackoffStrategyisnil`.

Copy link
Contributor Author

@ViToni ViToni Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to have only one way to support backoffs to avoid double code. Using the new interface would mean a one-time effort on the consumer side. But it's your call.
(Acutally I have one more idea for a breaking change: #253)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we are still pre-V1 a breaking change like this is possible; however my preference would be to make as many breaking changes as possible in one release (there are a few options that have changed and are now marked as depreciated). I'm happy either way as I we are getting fairly close to V1, but if you do make the breaking change then I'll hold off making a release until the other breaking changes are made (think it's best to put as many of these in one release as possible so users don't need to update their code with every release).

autopaho/auto.go Outdated Show resolved Hide resolved
@@ -0,0 +1,51 @@
/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that this adds much value? I'm just trying to think what a user would find relevant; they would expect an example like this to show how to use the backup strategy with Autopaho rather than this (which might be more appropriate as a test?)

Copy link
Contributor Author

@ViToni ViToni May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it helped (me) visualize the range of the ExponentialBackoffStrategy. There was a tiny bug while refactoring, which would have been difficult to catch due to the randomness of values but was easily spotable using this file.

@@ -106,7 +107,7 @@ func establishServerConnection(ctx context.Context, cfg ClientConfig, firstConne

// Delay before attempting another connection
select {
case <-time.After(cfg.ConnectRetryDelay):
case <-time.After(backoff.Next()):
Copy link
Contributor

@MattBrittan MattBrittan Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One potential issue here (not handled by the current solution either) is that the broker might reject the CONNECT packet meaning that establishServerConnection is called rapidly. I wonder if we can address this issue at the same time - see comments below.


// The BackoffStrategy is a container for the configuration of a given strategy whereas the returned Backoff is the actual implementation
// As such BackoffStrategy instances are supposed to be thread safe and freely sharable between users.
type BackoffStrategy interface {
Copy link
Contributor

@MattBrittan MattBrittan Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm suggesting fairly major changes to handle an issue you were not aware of so feel free to tell me to get lost!

Consider the following possibility:

  1. Establish connection and send CONNECT (so establishServerConnection has completed)
  2. Broker rejects CONNECT (say bad password) so loop back to step 1

This can result in us hammering the broker with CONNECT packets (not a good look!). It would be nice if we could solve this issue at the same time as dealing with the backoff on the network level connection.

I've had a quick think about options and wonder if the following might work (and provide a solution that's a bit simpler than your approach):

ConnectionBackoff func(ctx Context.Context, establishConnection func(ctx Context.Context) error) error // Passed function that will establish network connection and calls it until a `nil` error is returned or the context is cancelled.

Then in establishServerConnection you would do something like:

var cli *paho.Client,
var connack *paho.Connack)

err := cfg.ConnectionBackoff( ctx, func(ctx Context.Context) error {
     for _, u := range cfg.ServerUrls {
		connectionCtx, cancelConnCtx := context.WithTimeout(ctx, cfg.ConnectTimeout)
....
    }
}
if err != nil { // this would only happen when context cancelled

}
return cli, connack

The advantage of this is that it means that the ConnectionBackoff has complete control of the process (including resetting, or not, between connections). ConstantBackoff could be very simple (pseudo code!):

type ConstantBackoff struct {
	Delay       time.Duration
	lastAttempt time.Time
}

func (cb *ConstantBackoff) Run(ctx context.Context, establishConnection func(ctx Context.Context) error) {
	// The previous connection attempt may have been very recent; we want at least delay between attempts
	select {
	case time.After(cb.Delay - time.Since(cb.lastAttempt)): // Might be negative; i.e. no delay
	case <-ctx.Done():
		return ctx.Err()
	}

	for {
		err := establishConnection(ctx)
                cb.lastAttempt = time.Now()
		if err == nil || ctx.Err() != nil {
			return err
		}
		select {
		case <-time.After(cb.Delay):
		case <-ctx.Done():
			return ctx.Err()
		}
	}
}

// This would be used as follows:
cb := ConstantBackoff{Delay: 10 * time.Second}
pahoCfg.ConnectionBackoff = cb.Run

Other backoffs could be a bit more complex (and take into account whatever factors are relevant).

Copy link
Contributor Author

@ViToni ViToni Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My approach was to have a simple implementation. Therefore one can re-use / share the BackoffStrategy while the actual Backoff is "throwaway". It doesn't need a "reset" or anything fancy.
In my experience these mechanisms, where the server could potentially say: "come back in one hour", sound nice but are not worth the hassle. Often the "one hour" is either to early or to late.
Having a rate limiter on the server side and an exponential backoff on the client side has worked for me with planned and unplanned outages in the past. But that's only my use case... so your mileage might vary.

One aspect which is not covered by the CONNECT approach is mentiond in #253. One needs to get a token first before receiving a CONNECT and a fallback in case the server is just gone and doesn't return a CONNECT, the token provider endpoint should be taken care of, too.

Either way, IMO something like taking care of CONNECT should have a dedicated PR and discussion (as it seems to me way more complex than the backoff itself).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO something like taking care of CONNECT should have a dedicated PR and discussion (as it seems to me way more complex than the backoff itself).

Sorry - I made an error here (the issue with working on multiple clients concurrently!). In this client establishServerConnection only returns after the CONNACK is received. As such this is not the issue I thought it was (and your current strategy would work).

Note that the reason I raised this (and want to put some time into considering this change) is due to experience with the V3 client where we hit issues because the options became very complicated, so I'm careful with new options in this project (once an option is added it's very hard to remove!).

where the server could potentially say: "come back in one hour",

The one thing that may be a little bit like the scenario you mention ("come back later") is duplicate client id connect loops. This is very common for users new to the protocol where they have two clients with the same client ID and end up with a connect loop (two connections "fighting"). With v3.1 this was difficult to detect (the server just dropped the connection) but with V5 a specific Disconnect would be sent (Reason Code 0x8E - Session taken over) so there is a potential that we could handle it in the future. I only mention this as being able to prevent connection storms when this is happening would be beneficial longer term (so would be nice to have the option of adding support for this).

My approach was to have a simple implementation. Therefore one can re-use / share the BackoffStrategy while the actual Backoff is "throwaway". It doesn't need a "reset" of anything fancy.

I understand that; however part of my reaction here might have been because your code does look a bit Java'ish. I avoided this in the last set of comments because it's highly subjective; but as you specifically asked "So if I hit any no-Go-s" here are a few thoughts (I'll add a few more comments in the code):

  • The structure made me think of the Java factory pattern; obviously this pattern is used in Go but it's a lot less common than in Java.
  • Ive not seen use of underscore prefixes (i.e. _Duration) other than in unexported globals. I'd expect this variable to be called delay or similar (with the lower case d indicating it's private).

My suggestion would be to simplify this to (using the Reset option you felt complicated things):

// backoff represents a configured instance of the strategy which computes the next backoff duration.
type backoff interface {
        Reset() 
	NextBackOff() time.Duration
}

There are a few reasons for this:

It's simple.

It's compatible with cenkalti/backoff which is the most well known back-off library I'm aware of (so would provide a range of options users could simply plug in). Note: I would prefer to avoid a dependency so the basic strategies would be implemented in this library (which you have done).

It allows authors to consider previous state (e.g. never attempt to connect twice in two minutes) which is doable, but more difficult, with the current structure.

We can extend it in the future without breaking users code. For example to handle the connection fight situation mentioned above we could add:

type backoffWithDisconnected interface {
     backoff 
    Disconnected(*package.Disconnect) // called when Disconnect packet received
}

Then, when a disconnect packet is received, we would do something like:

if on, dwd := cfg.Backoff.(backoffWithDisconnected) ; ok {
   dwd.Disconnected(pkt)
}

This is just an off the cuff thought; so treat as pseudo code but aims to show why this provided more flexibility.

Note: Once more I'm not saying that this is the right approach but do want to be careful with this because once an interface is selected I think we need to consider it locked in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that; however part of my reaction here might have been because your code does look a bit Java'ish.

You are totally right. Funny how it shows. I've been doing Java for many years and as you correctly guessed this code originates from Java. In Java it was very simple:

public interface BackoffStrategy {
    Iterator<Duration> backoff();
}

As the custom Iterator was a nested class in the implementation it could easily/safely resuse final fields from the outer class.

But somehow I was not able to implement something similar simple in Go.

Copy link
Contributor

@MattBrittan MattBrittan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few of those style/possible java influences I mentioned.


// The ConstantBackoffStrategy implements the BackoffStrategy interface and provides instances a constant duration.
type ConstantBackoffStrategy struct {
BackoffStrategy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why embed BackoffStrategy? I suspect this may be a fallback to Java where you would need to declare what interface you are implementing but it's not needed here (and is harmful because it would prevent the compiler from warning you if your implementation of GetBackoff() has the wrong signature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks for pointintg this out. I wasn't aware this would be actuall harmful. This was definitely my Java way of thinking...


// The ConstantBackoff implements the Backoff interface and provides a constant duration.
type ConstantBackoff struct {
Backoff
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to embed the interface here. Doing this will prevent the compiler from identifying issues (and could lead to a runtime crash see here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again thanks for pointing it out (and for the example code showing the issue in action)

// The ConstantBackoff implements the Backoff interface and provides a constant duration.
type ConstantBackoff struct {
Backoff
_Duration time.Duration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An underscore prefix is not a go standard; just use lower case for the first letter (looks a bit weird for a while but you get used to it).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing it out. Seems I didn't get far with reading on Go style. I read only about the globals / export part.
Will try to make it be more natural Go.

@ViToni ViToni force-pushed the feature/add_backoff_strategy branch from fda2b50 to 0100371 Compare May 7, 2024 16:46
@ViToni
Copy link
Contributor Author

ViToni commented May 7, 2024

@MattBrittan The PR now has been updated. Hopefully it looks more Go-ish now.
I'd be more than happy for any feedback.

Things covered in this update:

  • backwards compatibility and fallback to ConnectRetryDelay if it is set and no strategy was configured
  • changes to the code making it (hopefully) simpler and less Java-ish

@ViToni ViToni force-pushed the feature/add_backoff_strategy branch 2 times, most recently from ac3a681 to 5230ed3 Compare May 8, 2024 07:43
@MattBrittan
Copy link
Contributor

MattBrittan commented May 8, 2024

@MattBrittan The PR now has been updated. Hopefully it looks more Go-ish now. I'd be more than happy for any feedback.

Thanks - I'll try to have a look next week (sorry, have limited time to spend on this currently).
Update 20th May - still struggling for time; took a quick look but really need to pull the repo and work through it in detail. I am really keen to see this address the situation where the broker drops the connection soon after it comes up (as this is quite a common issue).

@ViToni
Copy link
Contributor Author

ViToni commented Aug 11, 2024

Superseded by #258.

@ViToni ViToni closed this Aug 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support exponential backoff for reconnects
2 participants