This change demonstrates how quinn could make use of windows
registered IO (RIO), or other kinds of completion based IO.
Upfront note: The change is incomplete, buggy, and will leak all memory.
Don't even think about using it as is. This is just a quick hack to get
some ideas about integration and about achievable performance.
Integrating with registered IO requires the following changes to get
things working:
1. The endpoint is now running on its own dedicated IO thread instead of
running on the shared tokio runtime. This allows it to use any platform
specific IO primitives it requires to use. In this case we are using RIO,
and a custom eventloop which waits for new IO being possible using
a windows ManualResetEvent. Waiting via IOCP, or letting the thread
busy-spinning is also possible.
The actual loop, which is implemented in `EndpointDriver::run`, is not
that different from the existing `EndpointDriver::poll` method.
2. Reading and writing data with RIO is submission+completion oriented,
and requires buffers to be registered with kernel space for the complete
lifetime of the socket. In order to accomodate for those requirements,
a buffer pool is allocated when the socket + endpoint are created, and
reused throughout the lifetime of the endpoint. The endpoint will make
sure the maximum possible amount of concurrent receive operations is
scheduled. Transmit operations get scheduled whenever data to transmit
is available and TX buffers are available.
3. Since remaining quinn is not aware about pinned buffers, and requires
`Vec` to transmit outgoing buffers and `BytesMut` to decode incoming
datagrams, all datagrams are copied once from the IO buffers to those
higher level buffers. This could theoretically be optimized.
4. Since the endpoint is no longer an async task, it can't receive instructions
from connections anymore using an async channel. This adds a custom
channel implemetation for this purpose, which consists of a trivial
synchronized queue and a wakeup of the endpoint eventloop.
5. The endpoint can't use `tokio::spawn` anymore to spawn new connections,
since it is not running inside a tokio context.
Therefore a runtime handle needs to be explicitely propagated.
6. Socket needs to be created with the `WSA_FLAG_REGISTERED_IO`. Therefore
UDP sockets create via `std::net::UdpSocket` unfortunately can't be trivially
forwarded. It would be debatable whether this means the quinn library
should be resopnsible for creating all sockets, or whether it should still
accept external sockets but explictily require that those have been configured
with all the necessary flags.
Most of the points outlined here would also be required to support io_uring
or AF_XDP with buffers pre-registered with the kernel, or just
`sendmsg/sendmmsg` using `MSG_ZEROCOPY`, which has similar
requirements.
Performance with this approach varies. The benchmarks indicate a
throughput somewhere between 180MB/s and 330MB/s. If a benchmark
was started, it will either consistently report the low or the high value.
Some comments in the msquic repository indicate that this might be
due to RSS. Maybe something can be improved here by making sure
the endpoint IO thread runs on the ideal core.
A follow up POC which could be built, but isn't part of this demo,
is to also move the `Connection` handling onto the new dedicated
IO thread.