Improve utp Resiliance #69

carver · 2023-06-05T15:57:41Z

There are a variety of issues that show up by running the test @njgheorghita wrote, that ramps up concurrent transfers.

This PR:

2023-06-10T20:37:13.195927Z  WARN uTP{send=8729 recv=8728}: utp_rs::conn: idle timeout expired while connecting, closing... unacked=[55255]
2023-06-10T20:37:13.196030Z ERROR utp_rs::socket: Failed to open connection with cid ConnectionId { send: 8729, recv: 8728, peer: 127.0.0.1:3400 } err=RecvError(())

This is too many things in one PR. I'll peel off parts of it for review, when the code settles in a bit

carver · 2023-06-07T00:15:59Z

I was also getting Connection timeouts, and noticed that the retry timeout was not increasing with each attempt (it was stuck at 1 second each). By doing exponential backoff, I stopped getting connection errors like:

2023-06-06T23:47:04.095345Z ERROR utp_rs::socket: Got error when trying to open connection with cid err=Kind(TimedOut)

carver · 2023-06-07T00:17:40Z

I still regularly get the "idle timeout expired while closing" at 200 concurrent connections, which is nearly what @njgheorghita said the bridge is doing. So I think it's still pursuing this track.

Only need 200 concurrency to reproduce error Bonus cleanup: change the high end of the range to equal the number of concurrent streams

Bonus: show the log for the socket test starting

carver · 2023-06-07T19:40:52Z

Note that it's possible to make the test pass by setting the max_idle_timeout to 60s instead of 10s. This is unsatisfying for a few reasons, for example:

we're allowing users to configure this value. As much as possible, utp should not fail due to a (seemingly reasonable configuration choice
as a sender, we can't control the receiver-side idle timeout, we need to work either way

On the other hand, trin is already using a 32 second idle timeout, and is still seeing failures when used in the bridge. These are actually the target failures we care about the most, so maybe it's not top priority to hunt down these idle timeouts. Next up for me is to understand exactly which kinds of errors we're seeing in trin.

Also, the spec doesn't specify any idle timeout that I could find, so maybe supporting a small one is unnecessary.

Report if connection is stressed (Connecting, or has available window to send that is still unsent). Using this information, hold off on new connection attempts when stress is higher than the number of worker threads, because we can reasonably deduce a *local* problem keeping up, and so we shouldn't add more load to ourselves with a new connection. Also, report some connection stress stats in the logs.

Now we can reliably handle the launching of many more streams, so let's really stress-test the library.

Longer idle timeout has obvious effect of reducing connection failures due to idle timeout. This idle timeout doesn't seem to be specified anywhere, we are just doing it. 10s seemed reasonable, but when fighting against tokio which is not evenly distributing tasks, it gets less reasonable. The retry timeout should be small enough that if it maxes out, there are a few chances at reattempts before the idle timeout. Setting it to /4 of the idle timeout means we should get at least 3 attempts (the 4th will race with the idle). Was seeing regular connection failures, even with the stress limiter. By bumping up to 5 attempts, it seemed to effectively disappear. I don't have a strong model for how often SYNs are dropped in this test. To get an estimate of the allowed failure rate, making a silly-ish assumption that each attempt has an independently random chance of failing, then to have the test pass 99.9% of the time, we need each connection attempt to succeed at least 75% of the time.

carver · 2023-06-10T19:26:21Z

@KolbyML posted an interesting idea that tokio was not doing a fair round-robin processing of tasks, inspired by: https://users.rust-lang.org/t/tokio-round-robin-50-000-async-tasks-fairly/74120/13

This was definitely worth investigating, but it appears it's not the source of our problem (where the proposed problem was: there are a bunch of globally queued tasks getting ignored by the async runtime because there are so many notified tasks on the local queue). I had thought that the local notified tasks was worker_local_schedule_count, but I was wrong...

The list of notified tasks on the local thread is *_local_queue_depth, and the global queue is injection_queue_depth. The local depth is never getting near 256, and the global queue is almost never greater than 0, even when we are experiencing timeout errors.

This is where the metrics are generated, for anyone who wants to deep dive: https://docs.rs/tokio/latest/src/tokio/runtime/scheduler/multi_thread/worker.rs.html#891-901

carver · 2023-06-10T19:28:45Z

Here are example metrics when the test is about to time out from being overloaded:
RuntimeMetrics { workers_count: 16, total_park_count: 96400, max_park_count: 7225, min_park_count: 3670, total_noop_count: 81511, max_noop_count: 6169, min_noop_count: 3121, total_steal_count: 19421, max_steal_count: 1484, min_steal_count: 779, total_steal_operations: 18424, max_steal_operations: 1415, min_steal_operations: 0, num_remote_schedules: 0, total_local_schedule_count: 39349, max_local_schedule_count: 3715, min_local_schedule_count: 1844, total_overflow_count: 0, max_overflow_count: 0, min_overflow_count: 0, total_polls_count: 39018, max_polls_count: 2971, min_polls_count: 1589, total_busy_duration: 6.622269162s, max_busy_duration: 532.767938ms, min_busy_duration: 318.29104ms, injection_queue_depth: 0, total_local_queue_depth: 1, max_local_queue_depth: 1, min_local_queue_depth: 0, elapsed: 1.000996484s, budget_forced_yield_count: 710, io_driver_ready_count: 119450 }

Here are some example metrics when the UtpSocket was stressed, but not so stressed that it failed (because of the backpressure against launching all connections at the same time):
RuntimeMetrics { workers_count: 16, total_park_count: 85106, max_park_count: 10173, min_park_count: 412, total_noop_count: 76496, max_noop_count: 9303, min_noop_count: 392, total_steal_count: 7067, max_steal_count: 834, min_steal_count: 25, total_steal_operations: 7041, max_steal_operations: 833, min_steal_operations: 0, num_remote_schedules: 0, total_local_schedule_count: 22777, max_local_schedule_count: 2645, min_local_schedule_count: 270, total_overflow_count: 0, max_overflow_count: 0, min_overflow_count: 0, total_polls_count: 22655, max_polls_count: 2754, min_polls_count: 65, total_busy_duration: 4.627062199s, max_busy_duration: 545.967506ms, min_busy_duration: 55.829089ms, injection_queue_depth: 0, total_local_queue_depth: 0, max_local_queue_depth: 0, min_local_queue_depth: 0, elapsed: 1.00097427s, budget_forced_yield_count: 675, io_driver_ready_count: 124233 }

There are some bigger numbers in the more-stressed log steal operations, and local schedule count, but it's hard to see why that would be the tipping point.

Also, target stress should be about half the number of cores, since we tend to overshoot the value in practice.

10k tasks was allocating >20GB of RAM. It was unnecessary, since all the test data was the same anyway.

This error path was coming up with some RecvError, which needs its own investigation. For now, at least show which connection id failed. Example log before this change: ERROR utp_rs::socket: Failed to open connection with cid err=RecvError(()) Example after: ERROR utp_rs::socket: Failed to open connection with ConnectionId { send: 8729, recv: 8728, peer: 127.0.0.1:3400 } err=RecvError(())

carver · 2023-06-10T21:09:21Z

When matching the number of stressed connections to the number of cores, I was still failing at 10k transfers, regularly. After targeting half the number of cores, I failed 1/4 10k runs. Not a great result, but it fails a lot more at exactly the number of cores. I guess I'll leave it at half the number of cores for now.

I'm not convinced this 10k test should go in CI even though it's finding issues (it takes 10-15 minutes to run on my laptop). I'd prefer to find other ways to trigger the problems, hopefully much faster and more reliably. That's why I added #73 as follow-up work.

jacobkaufmann mentioned this pull request Jun 5, 2023

limit congestion control timeout amplification #70

Merged

njgheorghita and others added 6 commits June 6, 2023 11:38

Increase tested transfer count and data size

62c560f

Log if packet waits on socket for too long

0e07388

Log more information during an idle timeout

999c1eb

Log error, don't crash, if socket closes mid-send

6a94988

Prefer clearing the packet queue over timeouts

88970e4

Log connection on read_to_eof error in socket test

d3e323c

carver force-pushed the improve-utp-resiliance branch from 973c366 to 9a4a0ee Compare June 6, 2023 18:53

carver added 3 commits June 7, 2023 10:47

Limit worker threads to reproduce error in CI

c66b872

Only need 200 concurrency to reproduce error Bonus cleanup: change the high end of the range to equal the number of concurrent streams

Log the connection ID when connection fails

e044f91

Bonus: show the log for the socket test starting

Use exponential timeout increase on SYN sends

96802e9

carver force-pushed the improve-utp-resiliance branch from 9a4a0ee to 96802e9 Compare June 7, 2023 18:14

carver self-assigned this Jun 8, 2023

carver force-pushed the improve-utp-resiliance branch from ffa51f3 to 824217e Compare June 10, 2023 05:03

carver added 3 commits June 9, 2023 22:30

Run a higher-concurrency test

86d7392

Now we can reliably handle the launching of many more streams, so let's really stress-test the library.

carver force-pushed the improve-utp-resiliance branch from 824217e to 4244f87 Compare June 10, 2023 05:31

carver force-pushed the improve-utp-resiliance branch from 53c6425 to a6e9d31 Compare June 10, 2023 20:42

carver added 4 commits June 10, 2023 13:59

Stress cleanup: count cores once, variable delay

17d2cfb

Also, target stress should be about half the number of cores, since we tend to overshoot the value in practice.

Reduce memory usage of concurrent socket test

c86e817

10k tasks was allocating >20GB of RAM. It was unnecessary, since all the test data was the same anyway.

test refactor cleanup: improve comment

a799eb4

carver force-pushed the improve-utp-resiliance branch from a6e9d31 to 0ca2a02 Compare June 10, 2023 21:00

KolbyML mentioned this pull request Jun 10, 2023

Test utp against utplib #72

Open

carver mentioned this pull request Jun 16, 2023

test: run 2k concurrent transfers with 1MB of data #81

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve utp Resiliance #69

Improve utp Resiliance #69

carver commented Jun 5, 2023 •

edited

Loading

carver commented Jun 7, 2023

carver commented Jun 7, 2023

carver commented Jun 7, 2023 •

edited

Loading

carver commented Jun 10, 2023

carver commented Jun 10, 2023 •

edited

Loading

carver commented Jun 10, 2023 •

edited

Loading

Improve utp Resiliance #69

Are you sure you want to change the base?

Improve utp Resiliance #69

Conversation

carver commented Jun 5, 2023 • edited Loading

carver commented Jun 7, 2023

carver commented Jun 7, 2023

carver commented Jun 7, 2023 • edited Loading

carver commented Jun 10, 2023

carver commented Jun 10, 2023 • edited Loading

carver commented Jun 10, 2023 • edited Loading

carver commented Jun 5, 2023 •

edited

Loading

carver commented Jun 7, 2023 •

edited

Loading

carver commented Jun 10, 2023 •

edited

Loading

carver commented Jun 10, 2023 •

edited

Loading