-
Notifications
You must be signed in to change notification settings - Fork 417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Worker threads idling/stalling when using IOSQE_ASYNC for network I/O #426
Comments
Hi johalun, |
This is with 5.14 and the latest patches from #420. Yes, this is only for socket read/writes (possibly connect/close also). Disk I/O is already offloaded to worker threads so no need for IOSQE_ASYNC there and disk I/O performs well. |
Did the number of worker threads exceed the registered limit with/without IOSQE_ASYNC now? |
Btw @axboe , Have you started working on IOSQE_ASYNC_HYBRID stuff, if not I'd like to give it a shot. |
@HowHsu By all means go ahead, knee deep in a bunch of other things. Thanks! |
Those fixes are in mainline. |
Hi @johalun sqe->flags |= (1 << 6); diff --git a/fs/io_uring.c b/fs/io_uring.c
index 8317c360f7a4..9e84d273c4ec 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -104,7 +104,8 @@
#define IORING_MAX_REG_BUFFERS (1U << 14)
#define SQE_COMMON_FLAGS (IOSQE_FIXED_FILE | IOSQE_IO_LINK | \
- IOSQE_IO_HARDLINK | IOSQE_ASYNC)
+ IOSQE_IO_HARDLINK | IOSQE_ASYNC | \
+ IOSQE_ASYNC_HYBRID)
#define SQE_VALID_FLAGS (SQE_COMMON_FLAGS|IOSQE_BUFFER_SELECT|IOSQE_IO_DRAIN)
@@ -725,6 +726,7 @@ enum {
REQ_F_HARDLINK_BIT = IOSQE_IO_HARDLINK_BIT,
REQ_F_FORCE_ASYNC_BIT = IOSQE_ASYNC_BIT,
REQ_F_BUFFER_SELECT_BIT = IOSQE_BUFFER_SELECT_BIT,
+ REQ_F_ASYNC_HYBRID_BIT = IOSQE_ASYNC_HYBRID_BIT,
/* first byte is taken by user flags, shift it to not overlap */
REQ_F_FAIL_BIT = 8,
@@ -762,6 +764,8 @@ enum {
REQ_F_FORCE_ASYNC = BIT(REQ_F_FORCE_ASYNC_BIT),
/* IOSQE_BUFFER_SELECT */
REQ_F_BUFFER_SELECT = BIT(REQ_F_BUFFER_SELECT_BIT),
+ /* IOSQE_ASYNC_HYBRID */
+ REQ_F_ASYNC_HYBRID = BIT(REQ_F_ASYNC_HYBRID_BIT),
/* fail rest of links */
REQ_F_FAIL = BIT(REQ_F_FAIL_BIT),
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index b270a07b285e..fac037dbcf31 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -70,6 +70,7 @@ enum {
IOSQE_IO_HARDLINK_BIT,
IOSQE_ASYNC_BIT,
IOSQE_BUFFER_SELECT_BIT,
+ IOSQE_ASYNC_HYBRID_BIT,
};
/*
@@ -87,7 +88,8 @@ enum {
#define IOSQE_ASYNC (1U << IOSQE_ASYNC_BIT)
/* select buffer from sqe->buf_group */
#define IOSQE_BUFFER_SELECT (1U << IOSQE_BUFFER_SELECT_BIT)
-
+/* first force async then arm poll */
+#define IOSQE_ASYNC_HYBRID (1U << IOSQE_ASYNC_HYBRID_BIT)
/*
* io_uring_setup() flags
*/ diff --git a/fs/io_uring.c b/fs/io_uring.c
index 9e84d273c4ec..089583cf73f7 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -5635,7 +5635,13 @@ static int io_arm_poll_handler(struct io_kiocb *req)
req->apoll = apoll;
req->flags |= REQ_F_POLLED;
ipt.pt._qproc = io_async_queue_proc;
- io_req_set_refcount(req);
+ /*
+ * REQ_F_REFCOUNT set indicate we are in io-worker context, where we
+ * already explicitly set the submittion and completion ref. So no
+ * need to set refcount here if that is the case.
+ */
+ if (!(req->flags & REQ_F_REFCOUNT))
+ io_req_set_refcount(req);
ret = __io_arm_poll_handler(req, &apoll->poll, &ipt, mask,
io_async_wake);
@@ -6795,8 +6801,11 @@ static void io_wq_submit_work(struct io_wq_work *work)
ret = -ECANCELED;
if (!ret) {
+ bool need_poll = req->flags & REQ_F_ASYNC_HYBRID;
+
do {
- ret = io_issue_sqe(req, 0);
+issue_sqe:
+ ret = io_issue_sqe(req, need_poll ? IO_URING_F_NONBLOCK : 0);
/*
¦* We can get EAGAIN for polled IO even though we're
¦* forcing a sync submission from here, since we can't
@@ -6804,6 +6813,30 @@ static void io_wq_submit_work(struct io_wq_work *work)
¦*/
if (ret != -EAGAIN)
break;
+ if (need_poll) {
+ bool armed = false;
+
+ ret = 0;
+ need_poll = false;
+
+ switch (io_arm_poll_handler(req)) {
+ case IO_APOLL_READY:
+ goto issue_sqe;
+ case IO_APOLL_ABORTED:
+ /*
+ * somehow we failed to arm the poll infra,
+ * fallback it to a normal async worker try.
+ */
+ break;
+ case IO_APOLL_OK:
+ armed = true;
+ break;
+ }
+
+ if (armed)
+ break;
+
+ }
cond_resched();
} while (1);
}
@@ -7003,7 +7036,8 @@ static inline void io_queue_sqe(struct io_kiocb *req)
if (unlikely(req->ctx->drain_active) && io_drain_req(req))
return;
- if (likely(!(req->flags & (REQ_F_FORCE_ASYNC | REQ_F_FAIL)))) {
+ if (likely(!(req->flags & (REQ_F_FORCE_ASYNC | REQ_F_ASYNC_HYBRID |
+ REQ_F_FAIL)))) {
__io_queue_sqe(req);
} else if (req->flags & REQ_F_FAIL) {
io_req_complete_fail_submit(req); |
@HowHsu Thanks for fast work! Unfortunately the behavior is the same with hybrid as the normal IOSQE_ASYNC. |
OK I narrowed down the issue a bit more. If I only do read and writes from/to files and sockets, performance is good with IOSQE_ASYNC_HYBRID enabled for network read and writes. All read and writes are offloaded to worker threads and the main thread spends very little time in io_uring_submit. It could be the case that it is also good with IOSQE_ASYNC (not tested yet). The issue starts when I use sendfile to send cached objects from disk to a client tcp socket. My sendfile is implemented using splice and you can see roughly how I did it here https://github.com/johalun/uring-sendfile/blob/main/server.c#L85-L161 |
@johalun That seems to be a good clue, Pondering how best to handle that, as you're dealing with both types of files for that. For your particular case, the socket write is probably trivial and can generally proceed. Which means it should end up in the bounded queue. Does it make any difference if you mark the splice |
Yep, that seems to be the case. Increasing unbounded workers improved performance a lot! This is with latest patch for hybrid async which is enabled for all I/O, except splice. IOSQE_ASYNC, IOSQE_ASYNC_HYBRID or no flag makes no difference for splice. How does the sendfile(2) implementation handle this? Does it spawn a new thread for each splice call? It seems at least with uring right now we need one unbound worker per concurrent splice if we don't want them waiting on each other. |
On my server it seems 512 unbound workers is enough to reach max throughput while serving files from disk over http using IORING_OP_SPLICE. I find it weird though that I can reach almost the same performance just doing read/writes with uring with just a few worker threads. Is this just how splice is implemented or can it be called asynchronously (within the kernel) without the worker thread having to block? |
I'm seeing stuck CPUs with ASYNC_HYBRID
|
Hmm, after a clean rebuild of my app it seems to work. Could have been just a temporary weirdness with the macro having wrong value or something. I'll run some more tests. |
so for networked reads, IOSQE_ASYNC_HYBRID will always deliver better performance? might it make sense to default to it inside the kernel then? unless the user requests sync or regular async? though why would they ever? maybe it makes sense to disable guarantee performance losing choices for certain ops so users can't shoot themselves in the foot? and for Direct IO reads, is IOSQE_FORCE_ASYNC the same best always choice? same logic as above might apply. why give users known worse option choices? |
Hi Johannes, I'm just back from holiday, sorry for the delay. Hmm, could you try the latest version of async hybrid on top of |
IMHO, no. It really depends on how many cpu resources users want to use. Suppose we are doing socket reading work, each work is
Similar answer for this one, if you have scarce cpu resources, doing it in force-async way under high IO pressure results in frequent |
https://lore.kernel.org/io-uring/[email protected]/T/#u |
Yep this seems stable so far. Consistent performance and no hung cpus 👍🏻 |
As far as I know, sendfile() doesn't create any new thread to do its job. It does the job all in the original context. The io_uring |
In my application I'm doing a lot of mixed disk and network I/O. Without using SQPOLL or IOSQE_ASYNC, the performance is pretty good but profiling show that the main thread spends a lot of time in io_write/io_read for the network I/O (calling into the TCP stack).
I wish improve performance by offloading also network I/O to worker threads by using IOSQE_ASYNC. It works so far that the io_read and io_write are indeed moved to a worker thread however the performance is terrible. Even during full load which would normally cause the main thread and worker threads to use significant CPU, the CPU usage is almost nothing at all. I don't know exactly what is going on but it's like the worker threads are sleeping and not getting notified that there's work for them to do. The effect is most noticeable on socket reads, compared to for example write, connect or close.
The text was updated successfully, but these errors were encountered: