Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible relay discovery defect #2676

Open
mkermani144 opened this issue Aug 31, 2024 · 5 comments
Open

Possible relay discovery defect #2676

mkermani144 opened this issue Aug 31, 2024 · 5 comments

Comments

@mkermani144
Copy link

mkermani144 commented Aug 31, 2024

  • Version: 1.9.2
  • Platform: Darwin / Linux
  • Subsystem: Circuit relay transport

Severity: High

Description:

I have a p2p network of some relay and non-relay nodes. The non-relay nodes are configured to discover 3 relays, configured both with and without reservation concurrency. For some reason (maybe connection issues, or anything else - it's irrelevant for the main issue), the connection between nodes and relays breaks, and then after a time, the node is disconnected from all of the relays.

After checking the logs and digging into the implementation of the circuit relay transport, I found that there is an already-implemented relay discovery mechanism, and it works this way in simple terms:

  1. When a relay disconnects, check the number of connected relays against discoverRelays
  2. If not enough relays are connected, start the discovery process, and grab discovery lock, by setting a running flag in RelayDiscovery instance
  3. After discovering relays, try to connect them until we reach discoverRelays
  4. If enough relays are connected, release the lock, letting the discovery to be run again on relays disconnection

I think there is a critical issue here. If for any reason, the discovery doesn't discover enough relays, or we cannot connect the discovered ones, the relays count keeps under the discoverRelays, while the discovery is locked and cannot be run again. Over time, relays disconnect one by one, and we don't reconnect to them.

As an example:

  1. Suppose a discoverRelays of 3
  2. We connect to 3 relays
  3. We disconnect from 2 for some reason
  4. The discovery starts, and finds 4 new relays
  5. For some reason, we only connect to 1 of 4, resulting in 1 new + 1 old connected relays, which is below 3 threshold, and so the discovery won't be run again
  6. The process continues, until no relays are connected

I wonder why other possible solutions are not implemented, for example, changing the condition with which the lock is released, or running the discovery periodically.

Steps to reproduce the error:

It is clarified in description. There are some other configurations that may (or may not) affect the scenario, though. As an example, the auto dial should be disabled.

@mkermani144 mkermani144 added the need/triage Needs initial labeling and prioritization label Aug 31, 2024
@mkermani144
Copy link
Author

I found out that the random walk part of the discovery is essentially an infinite async iterable (until the relevant signal is aborted), so it's not exactly the case that I described in my previous comment. By the way, shouldn't we add peer-routing to the list of the transport dependencies if discoverRelay > 0, as of identify? In addition, the peer store search part of the discovery is still run once, which I think, may cause problems. It's possibility is probably quite rare, though.

@achingbrain
Copy link
Member

Over time, relays disconnect one by one, and we don't reconnect to them.
the random walk part of the discovery is essentially an infinite async iterable

Yes, so it should continue to discover peers until it eventually finds enough relays to bring the number with reservations up to discoverRelays.

It sounds like you're seeing something different?

By the way, shouldn't we add peer-routing to the list of the transport dependencies if discoverRelay > 0, as of identify?

I think that would make it easier for users to discover misconfigurations, yes.

In addition, the peer store search part of the discovery is still run once, which I think, may cause problems

Could you expand on this a little?

@achingbrain achingbrain added need/author-input Needs input from the original author and removed need/triage Needs initial labeling and prioritization labels Sep 3, 2024
@mkermani144
Copy link
Author

It sounds like you're seeing something different?

What I'm saying here is that, in my opinion, it's wrong to suppose random walk always succeeds. If any error is thrown inside of the random walk, peer routing, or any related code, we get a failed when finding relays on the network log and the whole relay discovery process ends forever, and there is no mechanism to restart it automatically. Or maybe I'm wrong about it?

I think that would make it easier for users to discover misconfigurations, yes.

If it's as simple as adding it to the list of transport serviceDependences symbol , I can open a PR.

Could you expand on this a little?

Based on what I understood from the relay discovery code, it has two parts: It first tries to find some relays in the peer store, and then starts the random walk. Putting my answer to the first question aside and supposing random walk always succeeds, the peer store search is still only run once. It's neither an infinite iterable like random walk, nor is re-run through some other mechanism. We may not be able to connect to peer store relays at the moment of disconnection for a reason (say a simple network partition), but we may connect them a moment after if we try. In other word, there may be no need to run random walk for a long time at all: we simply need to give the peer store another chance.

zargarzadehm pushed a commit to rosen-bridge/rosenet that referenced this issue Sep 8, 2024
Because no peer router is configured for RoseNet, we should restart the
relay discovery service manually so that it is not halted forever. More
details can be found here:
libp2p/js-libp2p#2676

This comment was marked as resolved.

@mkermani144
Copy link
Author

My previous comment contains the answers to @achingbrain questions.

@achingbrain achingbrain removed kind/stale need/author-input Needs input from the original author labels Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants