Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve utp Resiliance #69

Draft
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

carver
Copy link
Contributor

@carver carver commented Jun 5, 2023

There are a variety of issues that show up by running the test @njgheorghita wrote, that ramps up concurrent transfers.

This PR:

  • adds that concurrent transfer test, plus some cleanups to it
  • Logs a variety of helpful information
  • Removes or downgrades (to debug/trace) some logs that probably should not be info+, like the 40 ms packet insertion delay (maybe even the 400ms)
  • Avoids some panics on stream shutdown, preferring an error log
  • resolves the bug about the retry timeout flying sky-high after a few missed packets in a row, which then triggers idle timeouts The merged limit congestion control timeout amplification #70 seems to handle the most common situation of this happening
  • Resolves the issue of idle timeouts at 200 concurrency (presumably for similar reasons as the 20-concurrency: dropped/unacked packets aren't retried quickly enough before the idle timeout expires). Possible solutions:
    • avoid starting new connections when not keeping up with existing ones. For example: delay sending SYN when experiencing stress (open question: how to measure stress? possibly the sum of SendBuffers)
    • increase max idle timeout
  • cut the retry timeout to less than the idle timeout (rather than equal to it), to avoid a race condition on retry vs timeout
  • clamp and warn in the logs, if the configured max retry timeout is higher than the max idle timeout
  • adds exponential backoff for the connection retry delays
  • consider better implementation options for stress monitoring
  • if keeping the current stress monitoring architecture, be sure to garbage-collect the stress_rx in the UtpSocket
  • Why are there so many of these logs, even in passing tests? idle timeout expired while closing... unacked=[21347] local_fin=Some(21347) remote_fin=Some(41499)
  • Reduce memory usage when running a bunch of transfers in tests/socket
  • Investigate RecvError during connection. Weird that it's not like the other connection timeouts. Example:
2023-06-10T20:37:13.195927Z  WARN uTP{send=8729 recv=8728}: utp_rs::conn: idle timeout expired while connecting, closing... unacked=[55255]
2023-06-10T20:37:13.196030Z ERROR utp_rs::socket: Failed to open connection with cid ConnectionId { send: 8729, recv: 8728, peer: 127.0.0.1:3400 } err=RecvError(())

This is too many things in one PR. I'll peel off parts of it for review, when the code settles in a bit

@carver carver force-pushed the improve-utp-resiliance branch from 973c366 to 9a4a0ee Compare June 6, 2023 18:53
@carver
Copy link
Contributor Author

carver commented Jun 7, 2023

I was also getting Connection timeouts, and noticed that the retry timeout was not increasing with each attempt (it was stuck at 1 second each). By doing exponential backoff, I stopped getting connection errors like:

2023-06-06T23:47:04.095345Z ERROR utp_rs::socket: Got error when trying to open connection with cid err=Kind(TimedOut)

@carver
Copy link
Contributor Author

carver commented Jun 7, 2023

I still regularly get the "idle timeout expired while closing" at 200 concurrent connections, which is nearly what @njgheorghita said the bridge is doing. So I think it's still pursuing this track.

carver added 3 commits June 7, 2023 10:47
Only need 200 concurrency to reproduce error

Bonus cleanup: change the high end of the range to equal the number of
concurrent streams
Bonus: show the log for the socket test starting
@carver carver force-pushed the improve-utp-resiliance branch from 9a4a0ee to 96802e9 Compare June 7, 2023 18:14
@carver
Copy link
Contributor Author

carver commented Jun 7, 2023

Note that it's possible to make the test pass by setting the max_idle_timeout to 60s instead of 10s. This is unsatisfying for a few reasons, for example:

  • we're allowing users to configure this value. As much as possible, utp should not fail due to a (seemingly reasonable configuration choice
  • as a sender, we can't control the receiver-side idle timeout, we need to work either way

On the other hand, trin is already using a 32 second idle timeout, and is still seeing failures when used in the bridge. These are actually the target failures we care about the most, so maybe it's not top priority to hunt down these idle timeouts. Next up for me is to understand exactly which kinds of errors we're seeing in trin.

Also, the spec doesn't specify any idle timeout that I could find, so maybe supporting a small one is unnecessary.

@carver carver self-assigned this Jun 8, 2023
@carver carver force-pushed the improve-utp-resiliance branch from ffa51f3 to 824217e Compare June 10, 2023 05:03
carver added 3 commits June 9, 2023 22:30
Report if connection is stressed (Connecting, or has available window to
send that is still unsent).

Using this information, hold off on new connection attempts when stress
is higher than the number of worker threads, because we can reasonably
deduce a *local* problem keeping up, and so we shouldn't add more load
to ourselves with a new connection.

Also, report some connection stress stats in the logs.
Now we can reliably handle the launching of many more streams, so let's
really stress-test the library.
Longer idle timeout has obvious effect of reducing connection failures
due to idle timeout. This idle timeout doesn't seem to be specified
anywhere, we are just doing it. 10s seemed reasonable, but when fighting
against tokio which is not evenly distributing tasks, it gets less
reasonable.

The retry timeout should be small enough that if it maxes out, there are
a few chances at reattempts before the idle timeout. Setting it to /4 of
the idle timeout means we should get at least 3 attempts (the 4th will
race with the idle).

Was seeing regular connection failures, even with the stress limiter. By
bumping up to 5 attempts, it seemed to effectively disappear. I don't
have a strong model for how often SYNs are dropped in this test. To get
an estimate of the allowed failure rate, making a silly-ish assumption
that each attempt has an independently random chance of failing, then to
have the test pass 99.9% of the time, we need each connection attempt to
succeed at least 75% of the time.
@carver carver force-pushed the improve-utp-resiliance branch from 824217e to 4244f87 Compare June 10, 2023 05:31
@carver
Copy link
Contributor Author

carver commented Jun 10, 2023

@KolbyML posted an interesting idea that tokio was not doing a fair round-robin processing of tasks, inspired by: https://users.rust-lang.org/t/tokio-round-robin-50-000-async-tasks-fairly/74120/13

This was definitely worth investigating, but it appears it's not the source of our problem (where the proposed problem was: there are a bunch of globally queued tasks getting ignored by the async runtime because there are so many notified tasks on the local queue). I had thought that the local notified tasks was worker_local_schedule_count, but I was wrong...

The list of notified tasks on the local thread is *_local_queue_depth, and the global queue is injection_queue_depth. The local depth is never getting near 256, and the global queue is almost never greater than 0, even when we are experiencing timeout errors.

This is where the metrics are generated, for anyone who wants to deep dive: https://docs.rs/tokio/latest/src/tokio/runtime/scheduler/multi_thread/worker.rs.html#891-901

@carver
Copy link
Contributor Author

carver commented Jun 10, 2023

Here are example metrics when the test is about to time out from being overloaded:
RuntimeMetrics { workers_count: 16, total_park_count: 96400, max_park_count: 7225, min_park_count: 3670, total_noop_count: 81511, max_noop_count: 6169, min_noop_count: 3121, total_steal_count: 19421, max_steal_count: 1484, min_steal_count: 779, total_steal_operations: 18424, max_steal_operations: 1415, min_steal_operations: 0, num_remote_schedules: 0, total_local_schedule_count: 39349, max_local_schedule_count: 3715, min_local_schedule_count: 1844, total_overflow_count: 0, max_overflow_count: 0, min_overflow_count: 0, total_polls_count: 39018, max_polls_count: 2971, min_polls_count: 1589, total_busy_duration: 6.622269162s, max_busy_duration: 532.767938ms, min_busy_duration: 318.29104ms, injection_queue_depth: 0, total_local_queue_depth: 1, max_local_queue_depth: 1, min_local_queue_depth: 0, elapsed: 1.000996484s, budget_forced_yield_count: 710, io_driver_ready_count: 119450 }

Here are some example metrics when the UtpSocket was stressed, but not so stressed that it failed (because of the backpressure against launching all connections at the same time):
RuntimeMetrics { workers_count: 16, total_park_count: 85106, max_park_count: 10173, min_park_count: 412, total_noop_count: 76496, max_noop_count: 9303, min_noop_count: 392, total_steal_count: 7067, max_steal_count: 834, min_steal_count: 25, total_steal_operations: 7041, max_steal_operations: 833, min_steal_operations: 0, num_remote_schedules: 0, total_local_schedule_count: 22777, max_local_schedule_count: 2645, min_local_schedule_count: 270, total_overflow_count: 0, max_overflow_count: 0, min_overflow_count: 0, total_polls_count: 22655, max_polls_count: 2754, min_polls_count: 65, total_busy_duration: 4.627062199s, max_busy_duration: 545.967506ms, min_busy_duration: 55.829089ms, injection_queue_depth: 0, total_local_queue_depth: 0, max_local_queue_depth: 0, min_local_queue_depth: 0, elapsed: 1.00097427s, budget_forced_yield_count: 675, io_driver_ready_count: 124233 }

There are some bigger numbers in the more-stressed log steal operations, and local schedule count, but it's hard to see why that would be the tipping point.

@carver carver force-pushed the improve-utp-resiliance branch from 53c6425 to a6e9d31 Compare June 10, 2023 20:42
carver added 4 commits June 10, 2023 13:59
Also, target stress should be about half the number of cores, since we
tend to overshoot the value in practice.
10k tasks was allocating >20GB of RAM. It was unnecessary, since all the
test data was the same anyway.
This error path was coming up with some RecvError, which needs its own
investigation. For now, at least show which connection id failed.

Example log before this change:
ERROR utp_rs::socket: Failed to open connection with cid err=RecvError(())

Example after:
ERROR utp_rs::socket: Failed to open connection with ConnectionId { send: 8729, recv: 8728, peer: 127.0.0.1:3400 } err=RecvError(())
@carver carver force-pushed the improve-utp-resiliance branch from a6e9d31 to 0ca2a02 Compare June 10, 2023 21:00
@carver
Copy link
Contributor Author

carver commented Jun 10, 2023

When matching the number of stressed connections to the number of cores, I was still failing at 10k transfers, regularly. After targeting half the number of cores, I failed 1/4 10k runs. Not a great result, but it fails a lot more at exactly the number of cores. I guess I'll leave it at half the number of cores for now.

I'm not convinced this 10k test should go in CI even though it's finding issues (it takes 10-15 minutes to run on my laptop). I'd prefer to find other ways to trigger the problems, hopefully much faster and more reliably. That's why I added #73 as follow-up work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants