Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Faster UDP/IO on Apple platforms #1993

Open
wants to merge 33 commits into
base: main
Choose a base branch
from

Conversation

larseggert
Copy link
Contributor

@larseggert larseggert commented Sep 20, 2024

This uses Apple's private sendmsg_x and recvmsg_x system calls for multi-packet UDP I/O.

CC @mxinden

@Ralith
Copy link
Collaborator

Ralith commented Sep 20, 2024

Is there interest in seeing TX support via sendmsg_x?

We found there wasn't much performance benefit, and was considerable difficulty taking advantage of, sendmmsg-style batching. IIRC the _x functions on macOS have more to offer than that, though. Will this unblock segmentation offload or other incidental optimizations?

quinn-udp/build.rs Outdated Show resolved Hide resolved
@larseggert
Copy link
Contributor Author

Bench on main:

test large_data_10_streams  ... bench:  27,558,791 ns/iter (+/- 13,459,810) = 380 MB/s
test large_data_1_stream    ... bench:  24,324,266 ns/iter (+/- 19,219,937) = 43 MB/s
test small_data_100_streams ... bench:  19,437,900 ns/iter (+/- 20,065,941)
test small_data_1_stream    ... bench:  11,465,128 ns/iter (+/- 8,699,934)

Bench with this PR:

test large_data_10_streams  ... bench:  28,829,216 ns/iter (+/- 15,924,956) = 363 MB/s
test large_data_1_stream    ... bench:  14,354,999 ns/iter (+/- 20,039,122) = 73 MB/s
test small_data_100_streams ... bench:  14,061,741 ns/iter (+/- 17,311,517)
test small_data_1_stream    ... bench:  19,194,441 ns/iter (+/- 5,012,070)

Surprised that large_data_10_streams and small_data_1_stream are slower...

@Ralith
Copy link
Collaborator

Ralith commented Sep 23, 2024

Those tests tend to be extremely noisy, as the huge variance suggests. A targeted quinn-udp benchmark might be more useful.

@larseggert
Copy link
Contributor Author

We've also found on neqo that multi-packet RX without multi-packet TX has limited benefits, since the RX batch size will be very small.

@larseggert
Copy link
Contributor Author

I added sendmsg_x support, mostly to see what the performance difference would be. But it seems that none of the benches or tests call send with a Transmit struct where segment_size is not None?

@larseggert larseggert marked this pull request as ready for review September 23, 2024 09:10
@mxinden
Copy link
Contributor

mxinden commented Sep 23, 2024

A targeted quinn-udp benchmark might be more useful.

How about using the throughput.rs benchmark @larseggert?

https://github.com/quinn-rs/quinn/blob/main/quinn-udp/benches/throughput.rs

@larseggert
Copy link
Contributor Author

larseggert commented Sep 23, 2024

With @mxinden's benchmark. Baseline:

gso_true/throughput     time:   [58.076 ms 58.230 ms 58.387 ms]
                        thrpt:  [171.27 MiB/s 171.73 MiB/s 172.19 MiB/s]

Only sendmsg_x:

gso_true/throughput     time:   [15.143 ms 15.189 ms 15.236 ms]
                        thrpt:  [656.35 MiB/s 658.36 MiB/s 660.37 MiB/s]
                 change:
                        time:   [-74.028% -73.915% -73.808%] (p = 0.00 < 0.05)
                        thrpt:  [+281.80% +283.36% +285.04%]
                        Performance has improved.

Both sendmsg_x and recvmsg_x:

gso_true/throughput     time:   [12.632 ms 12.682 ms 12.731 ms]
                        thrpt:  [785.46 MiB/s 788.53 MiB/s 791.61 MiB/s]
                 change:
                        time:   [-78.321% -78.221% -78.112%] (p = 0.00 < 0.05)
                        thrpt:  [+356.88% +359.16% +361.27%]
                        Performance has improved.

Both sendmsg_x and recvmsg_x with BATCH_SIZE of 64:

gso_true/throughput     time:   [11.640 ms 11.682 ms 11.725 ms]
                        thrpt:  [852.85 MiB/s 856.00 MiB/s 859.07 MiB/s]
                 change:
                        time:   [-80.030% -79.938% -79.844%] (p = 0.00 < 0.05)
                        thrpt:  [+396.13% +398.45% +400.75%]
                        Performance has improved.

Copy link
Member

@djc djc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Please squash all of the changes into a single commit?

quinn-udp/benches/throughput.rs Outdated Show resolved Hide resolved
quinn-udp/src/unix.rs Outdated Show resolved Hide resolved
quinn-udp/src/unix.rs Outdated Show resolved Hide resolved
quinn-udp/src/unix.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive results. Great to see the MacOS _x syscalls work for QUIC UDP IO.

quinn-udp/src/unix.rs Outdated Show resolved Hide resolved
quinn-udp/src/unix.rs Outdated Show resolved Hide resolved
quinn-udp/src/unix.rs Outdated Show resolved Hide resolved
Ralith
Ralith previously requested changes Sep 23, 2024
Copy link
Collaborator

@Ralith Ralith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we enable real GSO/GRO using these interfaces?

quinn-udp/benches/throughput.rs Outdated Show resolved Hide resolved
@larseggert
Copy link
Contributor Author

No. They are the equivalent of the mmsg Linux calls. AFAIK Apple doesn't have GSO/GRO via the socket interface.

quinn-udp/src/cmsg/unix.rs Outdated Show resolved Hide resolved
quinn-udp/src/unix.rs Outdated Show resolved Hide resolved
quinn-udp/src/unix.rs Show resolved Hide resolved
quinn-udp/src/unix.rs Outdated Show resolved Hide resolved
quinn-udp/src/unix.rs Outdated Show resolved Hide resolved
quinn-udp/benches/throughput.rs Outdated Show resolved Hide resolved
quinn-udp/benches/throughput.rs Outdated Show resolved Hide resolved
@larseggert
Copy link
Contributor Author

Are you waiting on anything from me on this?

Copy link
Contributor

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quinn-udp/benches/throughput.rs will need more changes to still support non-apple platform. @larseggert I believe we will need to either run it multi-threaded, or use some kind of executor, e.g. tokio. I can prepare a commit in the next couple of days. Sorry for missing this in earlier reviews.

Changes itself look good to me.

quinn-udp/benches/throughput.rs Outdated Show resolved Hide resolved
@@ -27,8 +27,12 @@ pub fn criterion_benchmark(c: &mut Criterion) {
// Reverse non-blocking flag set by `UdpSocketState` to make the test non-racy
recv.set_nonblocking(false).unwrap();

let mut receive_buffer = vec![0; MAX_BUFFER_SIZE];
let mut meta = RecvMeta::default();
let mut receive_buffers = vec![[0; SEGMENT_SIZE]; BATCH_SIZE];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We set the recv socket to blocking above, to prevent potential race conditions.

    recv.set_nonblocking(false).unwrap();

This is problematic on Linux, given that:

A blocking recvmmsg() call blocks until vlen messages have been
received or until the timeout expires. A nonblocking call reads
as many messages as are available (up to the limit specified by
vlen) and returns immediately.

https://man7.org/linux/man-pages/man2/recvmmsg.2.html

Given that quinn-udp does not do sendmmsg, send_state.send will never send BATCH_SIZE messages and thus recv_state.recv will never return.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mxinden do you have a suggestion for a fix?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@larseggert will be fixed with larseggert#1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Also, since all the tests are green, I guess quinn is missing a unit test to detect this breakage...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above referenced pull request addressing this issue moved to #2010.

quinn-udp/src/unix.rs Outdated Show resolved Hide resolved
quinn-udp/src/unix.rs Outdated Show resolved Hide resolved
quinn-udp/src/unix.rs Show resolved Hide resolved
@mxinden mxinden mentioned this pull request Oct 8, 2024
quinn-udp/src/unix.rs Outdated Show resolved Hide resolved
quinn-udp/src/unix.rs Outdated Show resolved Hide resolved
@djc
Copy link
Member

djc commented Oct 9, 2024

@Ralith can you do another round on this one?

@larseggert
Copy link
Contributor Author

larseggert commented Oct 10, 2024

Once @mxinden's fix to the bench is in, I will rebase and squash this PR.

@AndrewDryga
Copy link

AndrewDryga commented Oct 10, 2024

Hey guys 👋, is there any chance Apple won't approve apps that are using those private syscalls in the App Store? They are notorious for doing so and even de-listing apps for using anything "undocumented". See 2.5.1 here: https://developer.apple.com/app-store/review/guidelines/

See one of such cases here: https://9to5mac.com/2019/11/04/electron-app-rejections/

How they will find out? Apple employs automated tools to scan apps for the usage of private APIs. If sendmsg_x and recvmsg_x are detected, the app is at risk of being flagged.

@larseggert
Copy link
Contributor Author

The use of the private syscalls is now behind a non-default feature.

@AndrewDryga
Copy link

AndrewDryga commented Oct 10, 2024

@larseggert should we add a big fat warning saying that if you enable this flag you will violate Apple ToS so it's only should be enabled if app is not distributed via App Store (or notarized for EU)?

@larseggert
Copy link
Contributor Author

I have no opinion here - we don't distribute via the App Store.

Could you make a suggestion on what you think would be good to add?

@thomaseizinger
Copy link

The use of the private syscalls is now behind a non-default feature.

Thank you for making it optional! It can be an issue to toggle code-paths like these with cargo features because they get aggregated across the entire dependency tree. Simply adding another dependency that also happens to depend on quinn-udp can thus activate this silently. (As a rule of thumb, cargo features should only add new APIs, not modify existing ones so this doesn't happen.)

For some prior art, curve25519-dalek had to solve similar challenges: https://github.com/dalek-cryptography/curve25519-dalek/tree/main/curve25519-dalek#backends

It looks like it is already abstracted away pretty well. Could we change this to a "plain" rustc cfg that needs to be set at build-time?
I am happy for it to be opt-out. Most apps using this probably aren't in the app store and toggling it off is a one-liner in the build process. Ultimately, I got no opinion though on opt-in/opt-out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants