Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

io_uring bandwidth with cached file #472

Open
Tindarid opened this issue Nov 5, 2021 · 14 comments
Open

io_uring bandwidth with cached file #472

Tindarid opened this issue Nov 5, 2021 · 14 comments

Comments

@Tindarid
Copy link
Contributor

Tindarid commented Nov 5, 2021

I have this fio jobfile (with a file (./data1/file8) created before by fio).

[global]
filename=./data1/file8 ; 8G file
rw=read
invalidate=0
thread
offset=0
size=100%

[init_cache]
ioengine=sync

[psync]
wait_for_previous
group_reporting
ioengine=psync
numjobs=8
offset_increment=1g
io_size=1g

[uring]
wait_for_previous
group_reporting
ioengine=io_uring
numjobs=1
fixedbufs
iodepth=128

io_uring shows high latency (as expected). But the bandwidth is much less than bw of psync method (threadpool of workers doing reads). For example, for my machine (disk util = 0%):

Run status group 2 (all jobs): # psync
   READ: bw=11.7GiB/s (12.6GB/s), 11.7GiB/s-11.7GiB/s (12.6GB/s-12.6GB/s), io=8192MiB (8590MB), run=684-684msec

Run status group 3 (all jobs): # uring
   READ: bw=2904MiB/s (3045MB/s), 2904MiB/s-2904MiB/s (3045MB/s-3045MB/s), io=8192MiB (8590MB), run=2821-2821msec

Increasing number of threads in io_uring section helps to reach about 80% of psync performance.

What am I doing wrong?

@axboe
Copy link
Owner

axboe commented Nov 5, 2021

What kernel are you using?

@Tindarid
Copy link
Contributor Author

Tindarid commented Nov 5, 2021

What kernel are you using?

5.13.0-20 (also tested on 5.11.*, ~same result)

@axboe
Copy link
Owner

axboe commented Nov 5, 2021

My guess here would be that your psync ends up parallellizing the mem copy of the fully cached file between 8 threads, which is going to be faster than using a single ring where you essentially end up doing the memory copy inline from submit. It boils down to a memory copy benchmark, and one setup has 8 threads and the other has 1... Hence I don't think you're doing anything wrong as such, the test just isn't very meaningful.

@Tindarid
Copy link
Contributor Author

Tindarid commented Nov 5, 2021

parallelism

Run status group 1 (all jobs): # psync 1 thread
   READ: bw=3195MiB/s (3350MB/s), 3195MiB/s-3195MiB/s (3350MB/s-3350MB/s), io=8192MiB (8590MB), run=2564-2564msec

Run status group 2 (all jobs): # psync 2 threads
   READ: bw=6682MiB/s (7006MB/s), 6682MiB/s-6682MiB/s (7006MB/s-7006MB/s), io=8192MiB (8590MB), run=1226-1226msec

Run status group 3 (all jobs): # psync 4 threads
   READ: bw=11.5GiB/s (12.4GB/s), 11.5GiB/s-11.5GiB/s (12.4GB/s-12.4GB/s), io=8192MiB (8590MB), run=693-693msec

Run status group 4 (all jobs): # psync 8 threads
   READ: bw=12.0GiB/s (12.9GB/s), 12.0GiB/s-12.0GiB/s (12.9GB/s-12.9GB/s), io=8192MiB (8590MB), run=668-668msec

Run status group 5 (all jobs): # uring 1 thread
   READ: bw=3035MiB/s (3183MB/s), 3035MiB/s-3035MiB/s (3183MB/s-3183MB/s), io=8192MiB (8590MB), run=2699-2699msec

Run status group 6 (all jobs): # uring 2 thread
   READ: bw=5104MiB/s (5352MB/s), 5104MiB/s-5104MiB/s (5352MB/s-5352MB/s), io=8192MiB (8590MB), run=1605-1605msec

Run status group 7 (all jobs): # uring 4 thread
   READ: bw=7256MiB/s (7608MB/s), 7256MiB/s-7256MiB/s (7608MB/s-7608MB/s), io=8192MiB (8590MB), run=1129-1129msec

Run status group 8 (all jobs): # uring 8 thread
   READ: bw=6445MiB/s (6758MB/s), 6445MiB/s-6445MiB/s (6758MB/s-6758MB/s), io=8192MiB (8590MB), run=1271-1271msec

Clarifying the question: why psync in this case scales better than uring?

@axboe
Copy link
Owner

axboe commented Nov 5, 2021

Just ran a similar test here, just changing the io_uring case above to be 8 threads of 1G each like the psync case:

Run status group 0 (all jobs):
   READ: bw=29.3GiB/s (31.5GB/s), 29.3GiB/s-29.3GiB/s (31.5GB/s-31.5GB/s), io=8192MiB (8590MB), run=273-273msec

Run status group 1 (all jobs):
   READ: bw=29.7GiB/s (31.9GB/s), 29.7GiB/s-29.7GiB/s (31.9GB/s-31.9GB/s), io=8192MiB (8590MB), run=269-269msec

which shows about the same result, the runtime is short enough that there's a bit of variance between runs (+/- 1GB/sec either side). group 0 is psync here, group 1 is io_uring. For apples-to-apples, using iodepth=1 for the io_uring case as well. Does appear to be substantially slower to use higher queue depths for this. Didn't look into that yet, my guess would be that we're just spending extra time filling memory entries pointlessly for that.

@Tindarid
Copy link
Contributor Author

Tindarid commented Nov 5, 2021

the test just isn't very meaningful

I am trying to replace threadpool of workers (they do just reads) with io_uring in database-application. Old solution doesn't use O_DIRECT and has double buffering. Benchmarks on real data shows that io_uring solution loses (and I am doing something wrong). So, my investigation end up with this test.

Another guess: does single core application need to have a thread pool of uring instances (to compete with old solution based on, for example, posix aio?

@axboe
Copy link
Owner

axboe commented Nov 5, 2021

I'll check in the morning, it's late here. Fio doesn't do proper batching either, might be a concern. In general, you should not need a thread pool, you can mark requests as going async with IOSQE_ASYNC and there's also logic to cap the max pending async thread count.

@axboe
Copy link
Owner

axboe commented Nov 5, 2021

One thing that is interesting here is that if I run with iodepth=1, then I get about ~7GB/sec of bandwidth from one thread, but when I run with iodepth=128, then I get only 3GB/sec of bandwidth. Looking at profiles, the fast case spends ~13% of the time doing memory copies, and the slow case uses ~55%. That doesn't make a lot of sense! The higher queue depth case should spend the same time doing copies, just reaping the benefits of the batched submits.

The theory here is that the total memory range used is one page for the qd=1 case, and it's 128 pages for the qd=128 case. That just falls out of cache. That's simply an artifact of the CPU, it's not really an io_uring thing. If I hacked fio to use the same buffer for all requests, I bet the 128 case would be faster than the qd=1 case.

Anyway, that's the theory. I'll dig into this and see what I can find.

@Tindarid
Copy link
Contributor Author

Tindarid commented Nov 5, 2021

Thank you.

I tried nowait, force_async and played with iodepth, but bw only degrades (in this configuration). Maybe, really processor cache issue (but I haven't managed to find best parameters for it: with iodepth = 1 I have < 1GB/s, with iodepth = 128 I have 3 GB/s)

@axboe
Copy link
Owner

axboe commented Nov 5, 2021

shmhuge really helps alleviate pressure, but I think what we really need here is the ring sqe/cqe maps being in a huge page... That'll likely be a nice win overall too. Looking into it.

@axboe
Copy link
Owner

axboe commented Nov 5, 2021

Ran the "always copy to the same page" case for QD=128, and it didn't change anything. Puzzled, maybe this is tlb pressure? So I added iomem=shmhuge to use a huge page as backing for the job, and now the QD=128 job runs in ~10GB/sec and the QD=1 runs in ~7.5GB/sec. That's a lot more inline with what I'd expect. We're saving some time on being able to do a bunch of ios in the same syscall, and the that just yields more time to run the copy and hence a higher performance.

@axboe
Copy link
Owner

axboe commented Nov 6, 2021

I've added kernel support for using a single huge page for the rings, that should cut down on TLB pressure which I think is what is killing us in this test. I'll re-run tests on Monday with that. liburing support also exists in the 'huge' branch. Note that both of these are pretty experimental, I literally just started on the kernel side yesterday late afternoon and did the liburing changes this morning.

@axboe
Copy link
Owner

axboe commented Nov 8, 2021

Can you try with iomem=shmhuge added to your fio job file? Curious what kind of difference you'd see with it.

@Tindarid
Copy link
Contributor Author

Tindarid commented Nov 9, 2021

Can you try with iomem=shmhuge added to your fio job file? Curious what kind of difference you'd see with it.

No changes at all. With threaded/no-thread io_uring and psync.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants