-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
spv: delay reconnect attempts #2442
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -480,7 +480,9 @@ var errBreaksMinVersionTarget = errors.New("peer uses too low version to satisif | |
|
||
// connectAndRunPeer connects to and runs the syncing process with the specified | ||
// peer. It blocks until the peer disconnects and logs any errors. | ||
func (s *Syncer) connectAndRunPeer(ctx context.Context, raddr string, persistent bool) { | ||
// connectAndRunPeer returns backoff flag indicating whether it is advised to | ||
// wait a bit before attempting to connect to this peer again. | ||
func (s *Syncer) connectAndRunPeer(ctx context.Context, raddr string, persistent bool) (backoff bool) { | ||
// Attempt connection to peer. | ||
rp, err := s.lp.ConnectOutbound(ctx, raddr, reqSvcs) | ||
if err != nil { | ||
|
@@ -491,6 +493,7 @@ func (s *Syncer) connectAndRunPeer(ctx context.Context, raddr string, persistent | |
s.remotesMu.Unlock() | ||
if !errors.Is(err, context.Canceled) { | ||
log.Warnf("Peering attempt failed: %v", err) | ||
backoff = true | ||
} | ||
return | ||
} | ||
|
@@ -531,6 +534,7 @@ func (s *Syncer) connectAndRunPeer(ctx context.Context, raddr string, persistent | |
if err != nil { | ||
if !errors.Is(err, context.Canceled) { | ||
log.Warnf("Unable to complete startup sync with peer %v: %v", raddr, err) | ||
backoff = true | ||
} else { | ||
log.Infof("Lost peer %v", raddr) | ||
} | ||
|
@@ -545,6 +549,7 @@ func (s *Syncer) connectAndRunPeer(ctx context.Context, raddr string, persistent | |
} else { | ||
log.Infof("Lost peer %v", raddr) | ||
} | ||
return | ||
} | ||
|
||
func (s *Syncer) breaksMinVersionTarget(rp *p2p.RemotePeer) bool { | ||
|
@@ -569,13 +574,17 @@ func (s *Syncer) breaksMinVersionTarget(rp *p2p.RemotePeer) bool { | |
|
||
func (s *Syncer) connectToPersistent(ctx context.Context, raddr string) error { | ||
for { | ||
s.connectAndRunPeer(ctx, raddr, true) | ||
|
||
// Retry persistent peer after 5 seconds. | ||
select { | ||
case <-ctx.Done(): | ||
return ctx.Err() | ||
case <-time.After(5 * time.Second): | ||
default: | ||
} | ||
|
||
backoff := s.connectAndRunPeer(ctx, raddr, true) | ||
if backoff { | ||
// Delay next connect attempt to save resources and not overwhelm peers, | ||
// it's unlikely to succeed anyway if we do retry immediately. | ||
time.Sleep(5 * time.Second) | ||
Comment on lines
+580
to
+587
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This was already sleeping 5 seconds. The change would make it spin more? Also blocking shutdown here. I don't think anything needs to change here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I believe it was sleeping in some previous code version, then it got updated/rewritten and I'm essentially adding this sleep back here.
Shutdown on context canceled ? In that case I'm using we want to back off when:
we don't want to back off when:
That's the general idea. It might be overcomplicating it and I might not have implemented it 100% correctly, but I don't see obvious mistakes with it, perhaps you can point those out. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These are persistent peers which I don't guess many people set. I think it can stay simple. It won't spin up the cpu with the sleep that was already there. sleep and what was already there do the same thing, except one does not block shutdown. Not blocking shutdown is the correct thing to do here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I don't disagree with that, I've adjusted the way
We could just ignore We could also just remove |
||
} | ||
} | ||
} | ||
|
@@ -607,9 +616,14 @@ func (s *Syncer) connectToCandidates(ctx context.Context) error { | |
|
||
wg.Add(1) | ||
go func() { | ||
defer wg.Done() | ||
raddr := na.String() | ||
s.connectAndRunPeer(ctx, raddr, false) | ||
wg.Done() | ||
backoff := s.connectAndRunPeer(ctx, raddr, false) | ||
if backoff { | ||
// Delay next connect attempt to save resources and not overwhelm peers, | ||
// it's unlikely to succeed anyway if we do retry immediately. | ||
time.Sleep(5 * time.Second) | ||
} | ||
Comment on lines
+621
to
+626
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is blocking. If you want to sleep you need to use a pattern like above:
I think a better way to solve this is to detect that we are not connected and pause this loop until we think we are connected. Unsure how to go about that though. Maybe a combination of sniffing out the error There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, maybe what you are doing is best. If you could try to sniff out that particular error and change the sleeps I think this is ok. ofc jrick may have better ideas. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yeah, essentially it would be best to classify all the errors that can happen in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems to me like this should mirror the backoff functionality for permanent (persistent) peers in dcrd, which is actually a backoff versus a static retry every 5 seconds. The code structure is obviously a bit different here, but the key point is that it keeps a retry count and increases the bacoff timeout by the configured number of seconds (5 seconds here) times the retry count and is clamped to a maximum so it doesn't grow indefinitely (that max is 5 minutes in dcrd). In other words, it'll sleep 5 seconds, then 10 seconds, then 15 seconds, and continue increasing by 5 seconds all the way to 5 minutes. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another thing that dcrd does is keep a global failed attempts counter for non-persistent peers that is reset to 0 any time there is a successful connection. Whenever that global failed attempt counter reaches or exceeds a maximum number of failed attempts (dcrd uses 25), it imposes a static delay with the retry duration (5 seconds) on all non-persistent connection attempts too. That ensures when the network goes down, it doesn't go nuts trying to connect thousands of times per second due to the connection attempts immediately failing. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yep, essentially this protects us from overwhelming our peer(s), so that they don't ban us (or just have too much connections/requests which slows them) ? And yes, if timeous keep increasing, at some point we want to reset them or something to not starve ourselves of connections for too long. I certainly can implement that, but probably better get some feedback on this from @jrick first to see if I implemented the |
||
<-sem | ||
}() | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Essentially,
backoff
flag will help the caller to decide whether to retry immediately or wait (seems like there are cases for both).