Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Achieving higher throughput #291

Open
bits0rcerer opened this issue Jul 5, 2024 · 3 comments
Open

Achieving higher throughput #291

bits0rcerer opened this issue Jul 5, 2024 · 3 comments

Comments

@bits0rcerer
Copy link
Contributor

From time to time I am playing around with iouring for networking.. espacially tcp connections.
Today I did a basic comparison on good ol` tokio vs iouring (using registered buffers, buf_ring and recv_multishot). My goal was to receive as many bytes as possible per second.

I was pretty sure I wont reach tokio`s battle-tested performance in an afternoon but even so I was surprised that tokio reached ~2 times the throughput.

Tokio: ~110 Gbit/s
io-uring: ~55 GBit/s

Network: loopback (127.0.0.1)
Kernel: 6.9.6-zen1-1-zen
CPU: AMD Ryzen 9 5900X (24) @ 3.700GHz

I used btop to meassure the throughput and #290 for the iouring buf_ring.

Maybe someone here can point out any obvious skill issues on my side or give some general advice.

iouring_tcp_sink.rs

tokio_tcp_sink.rs

tokio_tcp_src.rs

@DXist
Copy link

DXist commented Sep 15, 2024

Your io_uring based program is single threaded while for tokio you've enabled multi-threaded lifetime.

What you can do:

  • organize shared nothing approach for request processing:
    • one worker & io_uring per core
    • listen sockets are opened with SO_REUSE_PORT option, kernel does connection load balancing
    • use fixed file descriptors to remove per IO reference counting overhead
    • use fixed buffers to remove per IO memory pinning overhead
    • trade energy efficiency for throughput by using io_uring submission queue polling
    • configure network card, irq balancer and task scheduling to collocate packet routing, IRQ/network stack handling and application processing on the same CPU core to minimize cross CPU traffic.

@bits0rcerer
Copy link
Contributor Author

Thank you for taking a look :)

  • one worker & io_uring per core

I thought I already did exactly that by creating an io_uring in every thread (also with single issuer).

  • use fixed buffers to remove per IO memory pinning overhead

I thought I already did that with buf_ring.

  • listen sockets are opened with SO_REUSE_PORT option, kernel does connection load balancing

I'll definitely try that!

  • use fixed file descriptors to remove per IO reference counting overhead

Same here

  • trade energy efficiency for throughput by using io_uring submission queue polling

Same here as well

  • configure network card, irq balancer and task scheduling to collocate packet routing, IRQ/network stack handling and application processing on the same CPU core to minimize cross CPU traffic.

Do you know if there are specific settings that play well with io_uring or is that general advice?

@DXist
Copy link

DXist commented Sep 23, 2024

I like this series of articles:

The first article about receiving side describes multi queue NIC configuration for Receive Side Scaling.
An option to consider is to load balance connection flows using NIC and reserve a separate NIC queue + listening socket in each thread. This setup is similar to ScyllaDB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants