Possible relay discovery defect #2676

mkermani144 · 2024-08-31T09:24:13Z

Version: 1.9.2

Platform: Darwin / Linux

Subsystem: Circuit relay transport

Severity: High

Description:

I have a p2p network of some relay and non-relay nodes. The non-relay nodes are configured to discover 3 relays, configured both with and without reservation concurrency. For some reason (maybe connection issues, or anything else - it's irrelevant for the main issue), the connection between nodes and relays breaks, and then after a time, the node is disconnected from all of the relays.

After checking the logs and digging into the implementation of the circuit relay transport, I found that there is an already-implemented relay discovery mechanism, and it works this way in simple terms:

When a relay disconnects, check the number of connected relays against discoverRelays
If not enough relays are connected, start the discovery process, and grab discovery lock, by setting a running flag in RelayDiscovery instance
After discovering relays, try to connect them until we reach discoverRelays
If enough relays are connected, release the lock, letting the discovery to be run again on relays disconnection

I think there is a critical issue here. If for any reason, the discovery doesn't discover enough relays, or we cannot connect the discovered ones, the relays count keeps under the discoverRelays, while the discovery is locked and cannot be run again. Over time, relays disconnect one by one, and we don't reconnect to them.

As an example:

Suppose a discoverRelays of 3
We connect to 3 relays
We disconnect from 2 for some reason
The discovery starts, and finds 4 new relays
For some reason, we only connect to 1 of 4, resulting in 1 new + 1 old connected relays, which is below 3 threshold, and so the discovery won't be run again
The process continues, until no relays are connected

I wonder why other possible solutions are not implemented, for example, changing the condition with which the lock is released, or running the discovery periodically.

Steps to reproduce the error:

It is clarified in description. There are some other configurations that may (or may not) affect the scenario, though. As an example, the auto dial should be disabled.

The text was updated successfully, but these errors were encountered:

mkermani144 · 2024-09-01T10:29:47Z

I found out that the random walk part of the discovery is essentially an infinite async iterable (until the relevant signal is aborted), so it's not exactly the case that I described in my previous comment. By the way, shouldn't we add peer-routing to the list of the transport dependencies if discoverRelay > 0, as of identify? In addition, the peer store search part of the discovery is still run once, which I think, may cause problems. It's possibility is probably quite rare, though.

achingbrain · 2024-09-03T14:48:41Z

Over time, relays disconnect one by one, and we don't reconnect to them.
the random walk part of the discovery is essentially an infinite async iterable

Yes, so it should continue to discover peers until it eventually finds enough relays to bring the number with reservations up to discoverRelays.

It sounds like you're seeing something different?

By the way, shouldn't we add peer-routing to the list of the transport dependencies if discoverRelay > 0, as of identify?

I think that would make it easier for users to discover misconfigurations, yes.

In addition, the peer store search part of the discovery is still run once, which I think, may cause problems

Could you expand on this a little?

mkermani144 · 2024-09-03T15:52:06Z

It sounds like you're seeing something different?

What I'm saying here is that, in my opinion, it's wrong to suppose random walk always succeeds. If any error is thrown inside of the random walk, peer routing, or any related code, we get a failed when finding relays on the network log and the whole relay discovery process ends forever, and there is no mechanism to restart it automatically. Or maybe I'm wrong about it?

I think that would make it easier for users to discover misconfigurations, yes.

If it's as simple as adding it to the list of transport serviceDependences symbol , I can open a PR.

Could you expand on this a little?

Based on what I understood from the relay discovery code, it has two parts: It first tries to find some relays in the peer store, and then starts the random walk. Putting my answer to the first question aside and supposing random walk always succeeds, the peer store search is still only run once. It's neither an infinite iterable like random walk, nor is re-run through some other mechanism. We may not be able to connect to peer store relays at the moment of disconnection for a reason (say a simple network partition), but we may connect them a moment after if we try. In other word, there may be no need to run random walk for a long time at all: we simply need to give the peer store another chance.

Because no peer router is configured for RoseNet, we should restart the relay discovery service manually so that it is not halted forever. More details can be found here: libp2p/js-libp2p#2676

mkermani144 · 2024-09-10T07:06:28Z

My previous comment contains the answers to @achingbrain questions.

mkermani144 added the need/triage Needs initial labeling and prioritization label Aug 31, 2024

achingbrain added need/author-input Needs input from the original author and removed need/triage Needs initial labeling and prioritization labels Sep 3, 2024

This comment was marked as resolved.

Sign in to view

github-actions bot added the kind/stale label Sep 10, 2024

achingbrain removed kind/stale need/author-input Needs input from the original author labels Sep 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible relay discovery defect #2676

Possible relay discovery defect #2676

mkermani144 commented Aug 31, 2024 •

edited

Loading

mkermani144 commented Sep 1, 2024

achingbrain commented Sep 3, 2024

mkermani144 commented Sep 3, 2024

This comment was marked as resolved.

mkermani144 commented Sep 10, 2024

Possible relay discovery defect #2676

Possible relay discovery defect #2676

Comments

mkermani144 commented Aug 31, 2024 • edited Loading

Severity: High

Description:

Steps to reproduce the error:

mkermani144 commented Sep 1, 2024

achingbrain commented Sep 3, 2024

mkermani144 commented Sep 3, 2024

This comment was marked as resolved.

mkermani144 commented Sep 10, 2024

mkermani144 commented Aug 31, 2024 •

edited

Loading