Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

io_submit blocks the reactor when device's request queue fills up #70

Open
tgrabiec opened this issue Oct 21, 2015 · 7 comments
Open

io_submit blocks the reactor when device's request queue fills up #70

tgrabiec opened this issue Oct 21, 2015 · 7 comments

Comments

@tgrabiec
Copy link
Contributor

On systems with slow disks it's possible that block device queue will fill up and io_submit will block inside get_request blocking the reactor thread. This manifests itself with high iowait time.

It can be remedied by increasing /sys/block/$DEV/queue/nr_requests to match concurrency level but this has a downside of increasing request latency while not resulting in improved disk utilization. Better would be to avoid overflowing the queue on seastar level by applying back-pressure.

@tgrabiec
Copy link
Contributor Author

tgrabiec commented Apr 6, 2016

@glommer Is this issue fixed completely by ioqueues?

@glommer
Copy link
Contributor

glommer commented Apr 6, 2016

Theoretically yes, since we now control how many requests are in flight. It
can probably still happen if we run with a high iodepth but I don't think
this case is worth fixing
On Apr 6, 2016 10:31 AM, "Tomasz Grabiec" [email protected] wrote:

@glommer https://github.com/glommer Is this issue fixed completely by
ioqueues?


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#70 (comment)

BenPope pushed a commit to BenPope/seastar that referenced this issue Aug 31, 2023
@travisdowns
Copy link
Contributor

Is this issue fixed completely by ioqueues?

Pretty sure the answer is "no". At least in the case where the disk is not performing worse than the io-properties suggest (this seems relatively common at least in short bursts for both local and network attached disks), the IO scheduler will still let many requests into the disk and we can get a high concurrency and can exceed nr_requests, causing the reactor to block.

I'm not sure if this was better before the changes in #1766, since in principle the back link responded very quickly before, whereas now it takes a while to respond and by that time you have already hit nr_requests.

Since we know that hitting nr_requests is just a death sentence for the reactor, maybe we should have another hard cap just below that value such that the IO scheduler doesn't exceed this value? It does sound a lot like the old 2 bucket system (though it works in units of "requests" not the cost units the rest of the scheduler deals in) though I don't know if it could somehow be done more simply than that.

@travisdowns
Copy link
Contributor

Using io_uring may or may not solve this, related question: axboe/liburing#1184

@avikivity
Copy link
Member

Do you actually hit nr_request limits? I've seen it with spinning disks, but not SSD/NVMe.

@travisdowns
Copy link
Contributor

Do you actually hit nr_request limits? I've seen it with spinning disks, but not SSD/NVMe.

Yes, but because the disk (EBS in this case, but it also happens with local SSDs) suffers a temporary slowdown, e.g., dropping to 1% of it's usual throughput for a few 100ms. During this hiccup we will quickly exceed nr_requests (63 per device) since we are ~3000 IO/s per second. These are background writes so this would be OK (i.e., the world wouldn't stop) except that due to the reactor stall the world does stop.

@travisdowns
Copy link
Contributor

In normal operation, there are only a "few" IOs so we don't hit nr_requests.

I think this is one of the flaws in the current "feed-forward rate limiting" scheduler (as opposed to a "concurrency" scheduler): it does not do well when the characteristics of the disk change: you need to set the IO properties to an appreciable fraction of the true "nominal" disk performance, or else you leave a lot of IO on the table, but then if the disk is a bit slower for any reason the number of queued IOs quickly grows and the current feedback mechanism isn't quick enough to catch it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants