Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iroh::net::Endpoint::close should not be fallible #2867

Open
CGamesPlay opened this issue Oct 30, 2024 · 7 comments
Open

iroh::net::Endpoint::close should not be fallible #2867

CGamesPlay opened this issue Oct 30, 2024 · 7 comments

Comments

@CGamesPlay
Copy link

CGamesPlay commented Oct 30, 2024

I was advised to open an issue in #2860. The problem is that Endpoint::close is async and fallible.

Fallible. I investigated, the method cannot fail unless sending Shutdown to the Actor fails, but this isn't actually a failure: this send can only fail if all recievers are dropped, and the receiver is owned by Actor::run. Logically, if this method fails, the socket is already technically shut down.

Further, the Actor::run method cannot fail except via panicking or if the network monitor fails to initialize. It seems clear to me that having initialization code fail after the MagicSock is initialized is a different bug: either the error shouldn't stop the entire Actor, or the fallible code should have happened before the initializer returned Ok. The only other way this function can return is by receiving ActorMessage::Shutdown.

Also, if the method fails, self gets dropped anyways, so there's no way to handle the error.

Async. Quinn's API for closing endpoints is better, and I think would cleanly apply to Iroh: close can be called from anywhere and wait_idle can be called from my main server task to gracefully shut down. Discussion below provides rationale for keeping it async.

It's also unclear what happens if you cancel the close. Does the endpoint stay open? Does it close itself in the background?

Also, it's strange that Connection::close is sync, and that works just fine.

@matheus23
Copy link
Contributor

Also, it's strange that Connection::close is sync, and that works just fine.

IMO this is a pitfall: If you call close, then drop the Endpoint, you might not have actually closed the connection gracefully, because the close frame could be lost in transit.
Endpoint::wait_idle (is badly named and) makes sure that there's no more open connections (including connections where we're just waiting for an ACK for a close frame).

I see a major footgun in having iroh_net::Endpoint::close be synchronous without a call to quinn::Endpoint::wait_idle, as then people will expect this to work and just drop the endpoint. And now connection closes are flaky.

IMO it makes sense that a call to Endpoint::close, which does things on the network, is async.

It's also unclear what happens if you cancel the close. Does the endpoint stay open? Does it close itself in the background?

The close call may or may not go through.
It will close in the background iff you don't drop the Endpoint before it manages to finish ACKing.

@CGamesPlay
Copy link
Author

The cancelation story is weird and is why I don't think having this method be async makes sense. If you call close on an Endpoint, and cancel the future midway through (during the wait_idle period), the Endpoint is going to be in an inconsistent state. The server will already be disabled but some connections will still be alive. The state transition from "operational" to "closing" happens in the first poll; after this point the server stops accepting new connections and existing connections are already in the process of being closed. All tasks that were working with this connection will stop themselves now (LocallyClosed error), because they see the graceful shutdown has started. But here msock.close will never get called, just dropped later on (this may lead to a task / memory leak depending on the implementation; I haven't checked).

As a user of the library, I have some task that is accepting connections on this endpoint, and it's conceptually the "manager" of the endpoint. When my application receives a command to shut down, I need to do so gracefully for a time, and then forcefully. Right now, calling close does accomplish this goal, for the reasons above, but the fact that it works feels like an accident rather than by design, and (I haven't tested, but) it might be holding on to the socket, preventing rebinding. The only safe way to do it is to implement an out-of-band stop signal that goes to my "manager" task.

I propose making two methods, one which synchronously transitions the endpoint to "closing", and a second that waits for the endpoint to actually be closed. If this were the case, my shutdown command would call the synchronous method, then the "manager" task, as part of its orderly shutdown, would wait for the connection to be fully closed before exiting.

It seems like we are in agreement that the method shouldn't be fallible, at least. I also think it's very strange for a cheaply-cloned object to move self in an async method. It should probably take &self instead: if the logic of the shutdown process doesn't work with &self, then it also doesn't work with clones of the object.

@matheus23
Copy link
Contributor

Right now, calling close does accomplish this goal, for the reasons above, but the fact that it works feels like an accident rather than by design

Gracefully shutting down is exactly what the Endpoint::close method is designed for.

@matheus23
Copy link
Contributor

I propose making two methods, one which synchronously transitions the endpoint to "closing", and a second that waits for the endpoint to actually be closed. If this were the case, my shutdown command would call the synchronous method, then the "manager" task, as part of its orderly shutdown, would wait for the connection to be fully closed before exiting.

We are intentionally avoiding exposing a synchronous Endpoint::close method, because most people will just not know that they need to have another "manager" task that waits for orderly shutdown before dropping the Endpoint.

@CGamesPlay CGamesPlay changed the title iroh::net::Endpoint::close should not be fallible nor async iroh::net::Endpoint::close should not be fallible Oct 31, 2024
@CGamesPlay
Copy link
Author

CGamesPlay commented Oct 31, 2024

Well, ultimately I think that this is a minor inconvenience (Iroh can easily do something, but doesn't expose the API, but I can semi-easily recreate the behavior), so no need to push on it. I've removed the async portion from the issue description.

I do think the fallibility issue, move-self semantics issue, and cancel safety issue still stand, but feel free to close this if the team disagrees.

[append] The cancel-safety one is hypothetical, actually. There won't be any problem as long as the magic socket shuts itself down when the endpoint is fully dropped.

@flub
Copy link
Contributor

flub commented Oct 31, 2024

The move-self is odd because Endpoint is Clone anyway, so it doesn't really achieve anything. I think it was put in there as a guidance more than anything else.

I also think the fallibility analysis is interesting. I think normally this would never fail, unless there was a bug in the implementation already. So the fallibility is only there to catch implementation bugs. Question is whether you want the application to know there was an implementation bug or not? It can't do much about it, it may sometimes result in a bug report but I guess in the vast majority of cases that wouldn't be the case.

I guess cancel-safety needs to be documented clearly so users can understand the semantics.

@CGamesPlay
Copy link
Author

Question is whether you want the application to know there was an implementation bug or not?

I would think that anyone interested in detecting implementation bugs would have a panic handler set up to catch them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

3 participants