Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed throughput issues by fixing lack of guarentee on SCTP -> DTLS packet ordering #513

Merged
merged 3 commits into from
Nov 15, 2023

Conversation

onnoowl
Copy link
Contributor

@onnoowl onnoowl commented Nov 14, 2023

Hi all,

I ran into some severe throughput issues with webrtc-rs, but I've managed to track down the problem. I was using webrtc through the Matchbox crate, and found my total throughput was limited to about 3 Mbps across several different machines and networks.

TLDR

The use of tokio::task::spawn when SCTP sends its packets to the underlying DTLS layer causes packets to send out of order. These out of order packets trigger SCTP's congestion control mechanisms, limiting total throughput to around 3 Mbps on machines I tested. Changing the code to guarantee packet ordering at the time of sending resolves the issue.

Bug Details

Here's the source of the problem:
https://github.com/webrtc-rs/webrtc/pull/363/files#diff-26a945150f33caddc941f57968f74224cba89af5a0ac4ab4eac4d803f50a6a70R524 (snippet included below)

// scpt/src/association/mod.rs:490

let limit = {
    #[cfg(test)]
    {
        1
    }
    #[cfg(not(test))]
    {
        8
    }
};

let sem = Arc::new(Semaphore::new(limit));
while !done.load(Ordering::Relaxed) {

// ...

    let net_conn = Arc::clone(&net_conn);
    let bytes_sent = Arc::clone(&bytes_sent);
    let name2 = Arc::clone(&name);
    let done2 = Arc::clone(&done);
    let sem = Arc::clone(&sem);
    sem.acquire().await.unwrap().forget();
    tokio::task::spawn(async move {
        let mut buf = BytesMut::with_capacity(16 * 1024);
        for raw in packets {
            buf.clear();
            if let Err(err) = raw.marshal_to(&mut buf) {
                log::warn!("[{}] failed to serialize a packet: {:?}", name2, err);
            } else {
                let raw = buf.as_ref();
                if let Err(err) = net_conn.send(raw.as_ref()).await {
                    log::warn!("[{}] failed to write packets on net_conn: {}", name2, err);
                    done2.store(true, Ordering::Relaxed)
                } else {
                    bytes_sent.fetch_add(raw.len(), Ordering::SeqCst);
                }
            }
            //log::debug!("[{}] sending {} bytes done", name, raw.len());
        }
        sem.add_permits(1);
    });

@KillingSpark made some lovely performance improvements in Issue: #360, Pr: #363 to remove mutex contention, but it looks like the tokio::task::spawn that was introduced can cause out of order packet delivery, which has severe consequences for SCTP's congestion control and the resulting throughput.

While I don't see throughput issues when testing locally, the moment I did any testing over the internet, I was seeing 3 Mbps throughput, sometimes up to 10, but never any higher.

To test I sent 10 megabytes of data all at once through a single stream, and timed the difference between delivery of the first and last packets. Turning on logging, I found that there would often be 70 to 100 fast re-transmission events in the SCTP logic during that 10 megabyte delivery. These fast re-transmissions kept the congestion control window very low, causing the poor throughput.

Using Wireshark I was able to confirm that no packets were being re-ordered or lost over the internet in my tests, the packets were already out of order at the time of sending. Specifically, they were out of order between the SCTP and DTLS layers. The DTLS packets were being sent with strictly increasing sequence numbers, as you would expect, but the SCTP sequence numbers were all over the place at the time of sending. Packets were frequently displaced by up to 8 positions. These frequent misorderings were causing the frequent re-transmissions.

The Solution

While simply removing the tokio::task::spawn completely resolved the issue (giving me hundreds of Mbps of throughput), I wanted to address the comment left behind by @KillingSpark.

// We schedule a new task here for a reason:
// If we don't tokio tends to run the write_loop and read_loop of one connection on the same OS thread
// This means that even though we release the lock above, the read_loop isn't able to take it, simply because it is not being scheduled by tokio
// Doing it this way, tokio schedules this to a new thread, this future is suspended, and the read_loop can make progress

Simply setting the semaphore limit to 1 should be a good solution. However, given the issues @KillingSpark found, maybe the best thing to do would be to put the packet marshaling in a tokio::task::spawn_blocking? This would help ensure that the CPU heavy work never interferes with performance in general. I've written a potential implementation with spawn_blocking in this PR for you to take a look at.

…okio::spawn. The ordering mixup triggered SCTPs congestion control, severely limitting throughput in practice.
@rainliu
Copy link
Member

rainliu commented Nov 15, 2023

Thanks for fixing the longstanding sctp throughput issue.

@rainliu
Copy link
Member

rainliu commented Nov 15, 2023

Could fix fmt/clippy errors?

Copy link

codecov bot commented Nov 15, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (d751e94) 61.57% compared to head (6c7abd1) 61.55%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #513      +/-   ##
==========================================
- Coverage   61.57%   61.55%   -0.03%     
==========================================
  Files         529      529              
  Lines       48865    48865              
  Branches    12363    12383      +20     
==========================================
- Hits        30088    30077      -11     
- Misses       9579     9581       +2     
- Partials     9198     9207       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@KillingSpark
Copy link
Contributor

KillingSpark commented Nov 15, 2023

Nice! It's very interesting that this issue only arose when actually communicating over the internet. We only tested this locally when I worked on this. Probably because the congestion window is basically infinite if you stay on your local machine.

Using spawn_blocking is certainly an option, and if it gives the wanted performance benefits that's great! The real issue is still that we can't force actual parallelism between the read and write loop tasks, which is a shame. Maybe this fact should appear somewhere in a comment as well, I don't know why I didn't include this myself back then.

I also think I did not try this version, I had a few where I moved the whole loop over the vector of raw packets into a spawn_blocking and then processed the whole marshalled vector or a channel in another loop in the write task. I think the key here is the frequent calls to spawn_blocking(...).await which allows/makes the write loop to yield the thread to the read loop frequently.

@rainliu rainliu merged commit 361acc4 into webrtc-rs:master Nov 15, 2023
4 of 5 checks passed
@onnoowl
Copy link
Contributor Author

onnoowl commented Nov 15, 2023

Thanks for merging!

spawn_blocking looks to be the best solution out of what I've tried, but it does look like it incurs some overhead, there's a performance regression for the data-channels-flow-control test.

Before before this PR:

Throughput is about 414.001 Mbps
Throughput is about 407.551 Mbps
Throughput is about 407.130 Mbps
Throughput is about 405.792 Mbps
Throughput is about 404.766 Mbps

After my this PR:

Throughput is about 260.132 Mbps
Throughput is about 260.827 Mbps
Throughput is about 263.874 Mbps
Throughput is about 267.051 Mbps
Throughput is about 268.894 Mbps

Of course for a real non-local test, the performance gains are still immense due to the original congestion control issue. But, due to the overhead added by spawn_blocking, it's definitely worth investigating for a better solution.

I tested keeping the original code but setting the semaphore limit to 1, but that had some significant issues I didn't have time to diagnose.

I might circle back to this later when I have time, but let me know if you guys have any ideas for alternatives to try.

@KillingSpark
Copy link
Contributor

@onnoowl at one point I was thinking about using two separate tokio runtimes to separate the read and write loops and guarantee paralleism. That seemed like a very intrusive change, which is why I didn't do it. I don't have things setup to test this quickly but the performance of this would be an interesting data point.

Another thing that is still in the pipeline is performance upgrades in the crc implementation which is iirc one of the last bottlenecks in the marshalling of the packets. But this depends on the maintainers of the crc crate to make a new release, which they are currently working on.

@KillingSpark
Copy link
Contributor

I forgot that we actually made the local throughput measuring into an example! I went ahead and collected some data. Sadly I don't have convenient access to a server right now to test this over the internet, which is an important factor as this PR showed. I think the results are nonetheless interesting:

Using before this PR as a Baseline performances look like this (on a m1 mac):

* limit = 1:        x1.34 throughput
* this PR:          x0.58 throughput
* forced parallel:  x0.88 throughput

It seems like just setting the limit to 1 delivers the best performance in a local scenario on macos but onoowl said that

I tested keeping the original code but setting the semaphore limit to 1, but that had some significant issues I didn't have time to diagnose.

I think I remember that the limit was chosen as 8 because on linux limiting this to 1 tanked performance or rather made it very jittery but on macos it doesn't but I am not totally sure. I think we also never found out why performance on linux behaved so weirdly. Is that what you meant by "significant issues"?

As I expected using separate runtimes to force parallelism brings back some of the performance, but it comes at an big-ish extra of complexity because we would have to ask users to provide a secondary runtime to this library.

Before this PR with limit set to the default 8

Throughput: 116586765 Bytes/s, 1779 pkts, 1779 loops
Throughput: 119273700 Bytes/s, 1820 pkts, 1820 loops
Throughput: 118618350 Bytes/s, 1810 pkts, 1810 loops

Before this PR but with limit set to 1

Throughput: 156497580 Bytes/s, 2388 pkts, 2388 loops
Throughput: 166524435 Bytes/s, 2541 pkts, 2541 loops
Throughput: 125106315 Bytes/s, 1909 pkts, 1909 loops
Throughput: 156825255 Bytes/s, 2393 pkts, 2393 loops

After this PR

Throughput: 67828725 Bytes/s, 1035 pkts, 1035 loops
Throughput: 69663705 Bytes/s, 1063 pkts, 1063 loops
Throughput: 69401565 Bytes/s, 1059 pkts, 1059 loops
Throughput: 69598170 Bytes/s, 1062 pkts, 1062 loops

With forced parallelism

  • Separate tokio::Runtime for the write_loop
  • no extra taks spawning
  • no spawn_blocking for marshalling
Throughput: 102562275 Bytes/s, 1565 pkts, 1565 loops
Throughput: 102365670 Bytes/s, 1562 pkts, 1562 loops
Throughput: 102955485 Bytes/s, 1571 pkts, 1571 loops
Throughput: 104921535 Bytes/s, 1601 pkts, 1601 loops

@KillingSpark
Copy link
Contributor

Another datapoint to consider is the effects of eliminating the crc bottleneck (by using the 3.1.0-beta.1 version of the crate and enabling the "slice16-mem-limit" feature):

Speedup for current master: ~x1.3

Throughput: 67959795 Bytes/s, 1037 pkts, 1037 loops
Throughput: 69729240 Bytes/s, 1064 pkts, 1064 loops
Throughput: 88537785 Bytes/s, 1351 pkts, 1351 loops
Throughput: 90110625 Bytes/s, 1375 pkts, 1375 loops

Speed up for forced parallel: ~x1.9

Throughput: 104069580 Bytes/s, 1588 pkts, 1588 loops
Throughput: 103479765 Bytes/s, 1579 pkts, 1579 loops
Throughput: 203879385 Bytes/s, 3111 pkts, 3111 loops
Throughput: 196408395 Bytes/s, 2997 pkts, 2997 loops

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants