-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve utp Resiliance #69
base: master
Are you sure you want to change the base?
Conversation
973c366
to
9a4a0ee
Compare
I was also getting Connection timeouts, and noticed that the retry timeout was not increasing with each attempt (it was stuck at 1 second each). By doing exponential backoff, I stopped getting connection errors like:
|
I still regularly get the "idle timeout expired while closing" at 200 concurrent connections, which is nearly what @njgheorghita said the bridge is doing. So I think it's still pursuing this track. |
Only need 200 concurrency to reproduce error Bonus cleanup: change the high end of the range to equal the number of concurrent streams
Bonus: show the log for the socket test starting
9a4a0ee
to
96802e9
Compare
Note that it's possible to make the test pass by setting the max_idle_timeout to 60s instead of 10s. This is unsatisfying for a few reasons, for example:
On the other hand, trin is already using a 32 second idle timeout, and is still seeing failures when used in the bridge. These are actually the target failures we care about the most, so maybe it's not top priority to hunt down these idle timeouts. Next up for me is to understand exactly which kinds of errors we're seeing in trin. Also, the spec doesn't specify any idle timeout that I could find, so maybe supporting a small one is unnecessary. |
ffa51f3
to
824217e
Compare
Report if connection is stressed (Connecting, or has available window to send that is still unsent). Using this information, hold off on new connection attempts when stress is higher than the number of worker threads, because we can reasonably deduce a *local* problem keeping up, and so we shouldn't add more load to ourselves with a new connection. Also, report some connection stress stats in the logs.
Now we can reliably handle the launching of many more streams, so let's really stress-test the library.
Longer idle timeout has obvious effect of reducing connection failures due to idle timeout. This idle timeout doesn't seem to be specified anywhere, we are just doing it. 10s seemed reasonable, but when fighting against tokio which is not evenly distributing tasks, it gets less reasonable. The retry timeout should be small enough that if it maxes out, there are a few chances at reattempts before the idle timeout. Setting it to /4 of the idle timeout means we should get at least 3 attempts (the 4th will race with the idle). Was seeing regular connection failures, even with the stress limiter. By bumping up to 5 attempts, it seemed to effectively disappear. I don't have a strong model for how often SYNs are dropped in this test. To get an estimate of the allowed failure rate, making a silly-ish assumption that each attempt has an independently random chance of failing, then to have the test pass 99.9% of the time, we need each connection attempt to succeed at least 75% of the time.
824217e
to
4244f87
Compare
@KolbyML posted an interesting idea that tokio was not doing a fair round-robin processing of tasks, inspired by: https://users.rust-lang.org/t/tokio-round-robin-50-000-async-tasks-fairly/74120/13 This was definitely worth investigating, but it appears it's not the source of our problem (where the proposed problem was: there are a bunch of globally queued tasks getting ignored by the async runtime because there are so many notified tasks on the local queue). I had thought that the local notified tasks was The list of notified tasks on the local thread is This is where the metrics are generated, for anyone who wants to deep dive: https://docs.rs/tokio/latest/src/tokio/runtime/scheduler/multi_thread/worker.rs.html#891-901 |
Here are example metrics when the test is about to time out from being overloaded: Here are some example metrics when the UtpSocket was stressed, but not so stressed that it failed (because of the backpressure against launching all connections at the same time): There are some bigger numbers in the more-stressed log steal operations, and local schedule count, but it's hard to see why that would be the tipping point. |
53c6425
to
a6e9d31
Compare
Also, target stress should be about half the number of cores, since we tend to overshoot the value in practice.
10k tasks was allocating >20GB of RAM. It was unnecessary, since all the test data was the same anyway.
This error path was coming up with some RecvError, which needs its own investigation. For now, at least show which connection id failed. Example log before this change: ERROR utp_rs::socket: Failed to open connection with cid err=RecvError(()) Example after: ERROR utp_rs::socket: Failed to open connection with ConnectionId { send: 8729, recv: 8728, peer: 127.0.0.1:3400 } err=RecvError(())
a6e9d31
to
0ca2a02
Compare
When matching the number of stressed connections to the number of cores, I was still failing at 10k transfers, regularly. After targeting half the number of cores, I failed 1/4 10k runs. Not a great result, but it fails a lot more at exactly the number of cores. I guess I'll leave it at half the number of cores for now. I'm not convinced this 10k test should go in CI even though it's finding issues (it takes 10-15 minutes to run on my laptop). I'd prefer to find other ways to trigger the problems, hopefully much faster and more reliably. That's why I added #73 as follow-up work. |
There are a variety of issues that show up by running the test @njgheorghita wrote, that ramps up concurrent transfers.
This PR:
resolves the bug about the retry timeout flying sky-high after a few missed packets in a row, which then triggers idle timeoutsThe merged limit congestion control timeout amplification #70 seems to handle the most common situation of this happeningstress_rx
in theUtpSocket
idle timeout expired while closing... unacked=[21347] local_fin=Some(21347) remote_fin=Some(41499)
RecvError
during connection. Weird that it's not like the other connection timeouts. Example:This is too many things in one PR. I'll peel off parts of it for review, when the code settles in a bit