-
Notifications
You must be signed in to change notification settings - Fork 417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
io_uring
bandwidth with cached file
#472
Comments
What kernel are you using? |
5.13.0-20 (also tested on 5.11.*, ~same result) |
My guess here would be that your psync ends up parallellizing the mem copy of the fully cached file between 8 threads, which is going to be faster than using a single ring where you essentially end up doing the memory copy inline from submit. It boils down to a memory copy benchmark, and one setup has 8 threads and the other has 1... Hence I don't think you're doing anything wrong as such, the test just isn't very meaningful. |
Clarifying the question: why |
Just ran a similar test here, just changing the io_uring case above to be 8 threads of 1G each like the psync case:
which shows about the same result, the runtime is short enough that there's a bit of variance between runs (+/- 1GB/sec either side). group 0 is psync here, group 1 is io_uring. For apples-to-apples, using iodepth=1 for the io_uring case as well. Does appear to be substantially slower to use higher queue depths for this. Didn't look into that yet, my guess would be that we're just spending extra time filling memory entries pointlessly for that. |
I am trying to replace threadpool of workers (they do just reads) with Another guess: does single core application need to have a thread pool of |
I'll check in the morning, it's late here. Fio doesn't do proper batching either, might be a concern. In general, you should not need a thread pool, you can mark requests as going async with IOSQE_ASYNC and there's also logic to cap the max pending async thread count. |
One thing that is interesting here is that if I run with iodepth=1, then I get about ~7GB/sec of bandwidth from one thread, but when I run with iodepth=128, then I get only 3GB/sec of bandwidth. Looking at profiles, the fast case spends ~13% of the time doing memory copies, and the slow case uses ~55%. That doesn't make a lot of sense! The higher queue depth case should spend the same time doing copies, just reaping the benefits of the batched submits. The theory here is that the total memory range used is one page for the qd=1 case, and it's 128 pages for the qd=128 case. That just falls out of cache. That's simply an artifact of the CPU, it's not really an io_uring thing. If I hacked fio to use the same buffer for all requests, I bet the 128 case would be faster than the qd=1 case. Anyway, that's the theory. I'll dig into this and see what I can find. |
Thank you. I tried |
shmhuge really helps alleviate pressure, but I think what we really need here is the ring sqe/cqe maps being in a huge page... That'll likely be a nice win overall too. Looking into it. |
Ran the "always copy to the same page" case for QD=128, and it didn't change anything. Puzzled, maybe this is tlb pressure? So I added iomem=shmhuge to use a huge page as backing for the job, and now the QD=128 job runs in ~10GB/sec and the QD=1 runs in ~7.5GB/sec. That's a lot more inline with what I'd expect. We're saving some time on being able to do a bunch of ios in the same syscall, and the that just yields more time to run the copy and hence a higher performance. |
I've added kernel support for using a single huge page for the rings, that should cut down on TLB pressure which I think is what is killing us in this test. I'll re-run tests on Monday with that. liburing support also exists in the 'huge' branch. Note that both of these are pretty experimental, I literally just started on the kernel side yesterday late afternoon and did the liburing changes this morning. |
Can you try with |
No changes at all. With threaded/no-thread io_uring and psync. |
I have this
fio
jobfile (with a file (./data1/file8) created before byfio
).io_uring
shows high latency (as expected). But the bandwidth is much less than bw ofpsync
method (threadpool of workers doing reads). For example, for my machine (disk util = 0%):Increasing number of threads in
io_uring
section helps to reach about 80% ofpsync
performance.What am I doing wrong?
The text was updated successfully, but these errors were encountered: