-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure all writes are flushed to disk #17065
base: master
Are you sure you want to change the base?
Conversation
If a data, label or uberblock write succeeds but the flush immediately after it fails, the flush error should propagate into the write for the write to return. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
If fsync() (zil_commit()) writes successfully, but then the flush fails, fsync() should not return success, but instead should fall into a full transaction wait. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
If a flush fails, its error should be copied into all associated writes and returned. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
This modifies the IO pipeline and queue to ensure that all leaf writes are flushed to the device before returning, and any flush errors will be surfaced as write errors. To do this, we arrange for completed write ZIOs to be held in the queue. After enough have been collected, a flush is issued. Once it completes, the held writes are returned to the pipeline. See the comments in vdev_queue.c for more detail. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
Adds an extra "flushq" column pair showing the number of IOs waiting for a flush to be issued, and for a flush to complete. While this presents like a queue, it's not quite the same as the existing priority-based queues, so it needs a bit of extra ceremony to thread it all through. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
Flushing is real work on the disks, and its useful to be able to show how long it takes. The necessary timing is normally only recorded for IO that goes through the queue, so a small amount of special-case code for FLUSH IOs is needed to make sure it gets them too. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
For years the ZIL has been managing its own flushing regime, even though it never detected flushing errors. Now that writes are implicitly flushed, this is no longer necessary and can be removed. Note that a simplified version of the LWB write chain has been maintained, as there is still an ordering requirement which this is responsible for. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
These were simple "barrier" flushes, but since the writes will now be flushed before they return, the wait points for the write ZIOs will take care of it. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
These were "barrier" flushes, but the writes will now be flushed before they return, so we don't need to do this explicitly anymore. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
There's nothing left that uses it, and flushing from the "top" of a vdev doesn't have any real meaning anymore. Thank you for your service. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
I had a long conversation this morning and I have a good idea to try. If it pans out, it will turn out that most of the construction is actually wrong, and this PR will be closed and reopened in a different form (the tests, some iostats and "remove flush" parts should still be good). As such, setting to draft for now. |
[Sponsors: Klara, Inc., Wasabi Technology, Inc.]
Hi, there’s a lot here! The summary is:
fsync()
no longer lies about it.Motivation and Context
Since the beginning, OpenZFS' “flush” operation has always ignored errors.1 Write errors are captured and dealt with, but if a write succeeds but the subsequent flush fails, the operation as a whole will appear to succeed.2
This situation is fairly rare in most setups where the storage device is physically and logically close to the OS (eg inside the same case, connected to the mainboard, etc). Here a failed device will usually not respond to the write IO at all and the block layer will signal a failure, or if the device is merely misbehaving, the write may be slow or time out. In either case, the error is raised on the write and OpenZFS detects it and deals with it (through retrying to a different device, returning an error to userspace, or suspending).
In a more complex environment where there are IO modules or “smart” backplanes in between, a failure between the write and the flush can happen more easily. In our customer case, the write was accepted by the device cache and success returned to OpenZFS, then the IO module would fail and the device be reset before it could be flushed (either the subsequent flush not arriving at the device, or the device’s own cache flush strategy not having time to run or complete). OpenZFS ignores flush errors so it wouldn’t be noticed at that point, but almost certainly would be at the next write.
Depending on the pool topology, the pool may suspend, or the write may simply be retried some other way. Regardless, the original write was lost but OpenZFS believed it succeeded long ago, and has signalled to its calling subsystem that all is well.
A failure like this in the
spa_sync()
data+label+uberblock+label ceremony usually goes unnoticed (though a suspend is quite likely, usually that is quickly followed byzpool clear
, and things continue). Some or all of the data may not be on disk, but no errors have been noticed, sozpool status
reports all is well. Even reading the data back can seem fine if the request is serviced from the ARC! It’s usually not until much later when the pool is reimported (after a reboot) and the damaged file read from that the problem is seen, and by then it’s too late.A worse version of this happens in conjunction with the ZIL. Consider
fsync()
, which iszil_commit()
under the hood. If a write succeeds in the above way,fsync()
will return success even though the data is not on disk. Depending on pool topology (eg SLOG device) this can still be survivable, asspa_sync()
may be able to successfully write out the DMU version of the change;fsync()
reported success early, but the data still made it to disk. If the write fails however,zil_commit()
will fall back totxg_wait_synced()
. If a similar sequence to the above then happens with the data write appearing to succeed, then the first label write failing, when the pool finally does advance the txg (maybe after suspend+resume),zil_commit()
will be satisfied that the data is on disk and againfsync()
returns success; a more permanent lie this time.3Resolving this issue is not easy.
Ultimately, the issue is that a “success” return from a write does not mean that the data is definitely on disk, as it may have only made it as far as a cache. A flush is required to ensure the drive has really committed it to stable storage (device bugs notwithstanding). Further, after a flush failure we must assume that all writes since the last successful flush are now suspect (this is exactly analogous to the POSIX
write()
andfsync()
APIs, in practice if not by the letter).OpenZFS already has error-handling for write errors in various forms - return the error, retry the IO, reallocate the IO and retry, suspend the pool, etc. These are all suitable responses to a flush failure too, as long as we can connect a flush failure back to the write.
Thus, the most straightforward approach is to ensure that every write also gets a flush, and any flush error is applied to the write before it returns.
Interestingly, the ZIL already has this concept (though not fully baked) and served as the initial guide for this work.
It would be simple enough to issue a flush to a leaf device after every write (and can even be handled by the block layer eg in Linux by setting
REQ_FUA
on the write). This works for correctness, but destroys performance - flush is a relatively slow operation compared to a write, even for an empty or near-empty cache.Instead, we want to accumulate writes, and at some opportune time, issue a single flush for all of them and then return them. This is similar in concept to the ZIL, but down at the leaf vdevs.
Description
The heart of this change is in the commit
vdev_queue: hold returning writes, flush before releasing
.vdev_queue.c
has extensive comments describing the situation, which I’ll summarise here.Previously, after a vdev leaf write (child/delegated) completes, it gets to
ZIO_STAGE_VDEV_IO_DONE
and tovdev_queue_io_done()
, where in-flight counts are adjusted down and the write IO (or for an aggregated write, its component IO) are returned to the pipeline.We change this slightly. As the write returns to the queue, it is held on a new list
vq_flush_list
. At some future time (vdev_queue_flush_check()
) we issue a flush to the leaf vdev (vdev_queue_flush_issue()
). The done handlervdev_queue_flush_done()
loops over the held writes, updates their error with the flush error if necessary, and sends them back to the pipeline for completion.When the flush is issued, the list of IOs is transferred to
vq_flush_active_list
, so that new returning writes can be accumulated onvq_flush_list
again for the next flush. To do otherwise would be to complete IOs that were returned after the flush started, which may mean that they did not end up being flushed.ZIO_STAGE_VDEV_IO_FLUSH
andvq_lock
contentionvq_lock
is currently one of our more heavily contended locks, so I’ve made an effort to add minimal extra overhead to it.Since initial completion handling and error injection is done in
zio_vdev_io_done()
, it’s necessary for all IO to go through that stage. Since that’s also where the queue returnvdev_queue_io_done()
is called, it would mean that a write would go through there twice; once to be added to the flush list, and again when being reissued. That would requirevq_lock
to be taken and released twice for every returning IO. To get around that, I’ve added an additional stage,ZIO_STAGE_VDEV_IO_FLUSH
. IOs that need flushing (non-probe writes) get that additional stage, and the dequeueing and disaggregation work as well as the flush work happens there, under the same lock. When they reachZIO_STAGE_VDEV_IO_DONE
only the smallest amount of (unlocked) housekeeping is done.Meanwhile, non-flushing queue IOs (read, TRIM and probe writes) just don’t get the
ZIO_STAGE_VDEV_IO_FLUSH
stage and work as before.Probe writes
I’ve excluded probe writes from being flushed because they’re part of device setup and error response, both times we’re interested in the low-level response of the device. I could be persuaded that the probe sequence should also have a flush, however I don’t think it’s necessary (after an error, the write should fail, and setup tends to be operator-assisted). If it should flush, I would not do it this way, but rather build a flush into the probe IO sequence properly. So I think it’s out of scope for this PR.
No explicit flushes
There are three commits remove existing calls to
zio_flush()
, and one more removingzio_flush()
itself.Most places these were simple “barrier” flushes, which are no longer necessary, since the preceding
zio_wait()
on a write tree won’t complete until the flushes are done anyway.The ZIL change is more extensive, since it no longer needs to track which vdevs were written to and defer flushing. I’ve made an effort to retain the IO chaining for writes though, which I understand to be necessary for
fsync()
ordering semantics.zio_flush()
can now be removed, not just because it’s unused, but also because it’s unuseful. A flush to the “top” of the vdev without error reporting (the existing form) doesn’t give us any particular guarantees, but if we did propagate flush errors then its not helpful either, because any device that returns a flush error (included faulted device) will fail the entire flush operation, which doesn’t tell us anything about what that means for the writes we’re trying to flush either. A far simpler and more conventionalzio_vdev_flush()
is added to flush a single vdev leaf and call a done function.Correctness tests
Three new tests are included. The first simulates the “device loss after IO cached” failure described at the start. We inject a no-op for the writes (succeed but do nothing), an IO error for the flush (to trigger the error recovery) and a probe failure (to fault the device and suspend the pool). Once the pool returns, if the flush error was propagated to the write, then the write should be reexecuted and so be available on disk. If not, the data will not be on disk.
The second does the same for the ZIL, which on flush error will fallback to the main txg sync path and then respond as above. Again, we check the data is on disk and also test the result of
fsync()
, which should not return success until and unless the data is truly on disk. (this test originally from #16314).The third test is a light test of the flush->write error propagation facility itself. We inject flush errors only, then confirm via
zpool status
that write errors were recorded.New iostats
New columns are added to
zpool iostat
. For queue stats, there’s a newflushq
column pair that shows the number of IOs waiting to be flushed (onvq_flush_list
) and the number “being flushed” (onvq_flush_active_list
):For latency stats, there’s overall total and disk latency figures to go with read and write:
It’s not yet clear to me how useful these are, but tbh I don’t find
zpool iostat
tremendously useful in the general case. At minimum, they let you see things happening.Performance
As currently presented, this PR is a performance disaster. I’ve tested a variety of workloads on a 14-wide RAIDz3 of rotational devices (a long-term test system provided by a customer, with behaviour we’re very familiar with). On all our test workloads (mixed read/write, write-only, with and without
fsync()
), write throughput drops to roughly the same as if all writes are sent through the ZIL.(fun twist: a 100% write+fsync workload on this system has ~10% higher throughput, which appears to be because flushes are now per-device rather than across the entire pool, so a single
zil_commit()
is not waiting for the entire pool of unrelated devices to flush).What we want is easy to state: we want to flush as little as possible, as late as possible. Ideally, we would flush only when something is waiting, after all the writes it’s interested in have been written. The difficulty is that down at the queue level, we have no idea what the write tree above us looks like, or even if anyone is waiting for us.
I had hoped I could get away with some basic guesswork. As written,
vdev_queue_flush_check()
decides if it should flush, that will wait for the there to be nothing on the queue, or until the oldest IO waiting for flush has been wait “too long”. This timeout is managed with a pair of tuneables,zfs_vdev_flush_timeout_ms
andzfs_vdev_flush_sync_timeout_ms
, the idea being that sync writes almost always mean a ZIL or label/uberblock sync (our old flush point) and should be completed ASAP, while all other writes have nothing much waiting, so can wait a little longer, but try not to accumulate too much.My intent was to run a battery of performance tests on a variety of workloads, changing these tuneables each time, until I found a not-terrible middle ground, and then post this PR suggesting that this could be improved a bit but wasn’t bad. Unfortunately, I had not understood just how bursty IO into a single vdev could be, and in basically all workloads the queue is empty long before any timeout is reached. and the flush happens.
I have tried various other methods, but all of them have been trying to respond to other events in the system to infer that a flush will be required “very soon“ because there is no way to know that all incoming IO is done.
I have some more invasive things to try. One idea is to group and count writes so we can tell when the last write in a group has completed and not flush before that. Another might be to have parent IOs signal when they’re finished issuing writes (possibly by reaching DONE), and then issue the flush when no one above us in the tree could issue another write as part of this group of work. Or maybe it’s more explicit, like ZIL writes: track all the leaf vdevs touched, and ask them all to flush directly (echoes of #16375 here).
But really, I want your ideas. If there’s a clever use of some distant value that would help us see what’s happening, or a smart way to group things, or whatever I would love to hear from you.
All this said, I think the overall construction is still right. If we treat
vdev_queue_flush_check()
as a smart black box, the rest of the code should not be substantially different whatever we do there.Other considerations
There’s a handful of things in the near and far future that I have not done anything about here, but have tried not to block.
dRAID
I have given no particular thought to how this interacts with dRAID other than testing for the dRAID spare special-case when checking for a vdev leaf. As far as I know, dRAID writes still end up as normal writes to normal leaf vdevs, so I do not imagine there’s any significant special handling required, if at all.
Flush IO taskq
Because more individual
ZIO_TYPE_FLUSH
IOs are being issued than before,zio_taskq[ZIO_TYPE_FLUSH]
, currently configured asZTI_ONE
/ZTI_ONE
is now being pushed harder than before.In my testing so far this has only presented a problem with
vdev_file
vdevs, becauseFLUSH
(andTRIM
) are bothzio_execute
'd synchronously, so multiple file vdevs being flushed will be serialised (and good luck if the underlying filesystem or device is itself slow). #17064 puts both those ops onz_vdev_file
, solving this.Regardless, I do not know if we will need to end up adding more threads for these queues, and/or put them out on the write taskqs. I don’t think so, since they require almost no work, but that's not the same as no work, and maybe the extra will be enough to cause a bottleneck there.
Direct IO queue bypass
If/when #16591 lands, direct writes will bypass the queue entirely, which is obviously incompatible with this change as-is. I suggest that in this case, it will probably be enough to add a “and also flush” option (like
REQ_FUA
) to direct writes that would bypass the queue, to ensure they still get flushed.However, a big part of the reason we might even want a queue bypass is because (a) aggregation is often pointless and (b)
vq_lock
is so heavily contended on low-latency devices. However, the work of managing the flush list and issuing flushes currently only happens undervq_lock
because it’s convenient; it could comfortably run under it’s own lock with a little bit of reshuffling (most of it is contingent on which locksvdev_queue_flush_check()
will need to decide what to do).I’ve got work elsewhere to break
vq_lock
entirely, which would be another approach. So for now, I’m comfortable that this can be dealt with later.Range flush
If we ever wanted to, it should be possible to support a “range flush” on a device that supports it by maintaining a range tree along with the waiting write IOs, not unlike
vq_write_offset_tree
. Then we really could flush only what we wrote and no more.Design and implementation notes
Why propagate flush errors into writes?
One may wonder why so much effort is being placed into propagating a flush error into an existing ZIO write tree, especially when looking at the performance difficulties caused by the lack of a clear “end of write” signal. Would it not be easier to do the flush as a separate step somehow?
The difficulty is that the IO tree is built around the idea of a single top operation, that spawns child operations, that in turn spawn child operations of their own, and on and on, until they get to the very bottom where the physical IO operations are performed. Then, the results are bubbled back up the tree, each intermediate operation considering the results of its children and deciding what to do.
Most of the error checking and recovery that makes OpenZFS what it is, for example, responding to a failed write by taking a faulting a device and rerouting the write elsewhere, is performed in those intermediate operation. Moreover, the housekeeping associated with a write like updating usage information or committing allocations also done as intermediate operations in the IO tree.
If we do the flush after the “top” write succeeds, and the flush fails, then what? We need to somehow reverse all “success” effects that IO tree caused, and get into its error paths. OpenZFS has no way to describe that sort of behaviour. So we have no real choice but to convert flush errors into write errors at the lowest level, so that the right paths back through the IO tree are taken.
Maybe an alternate structure is possible, but would require substantial rework or redesign of pretty much everything. It’s not really feasible.
History
This is the fifth iteration of this concept. The earlier ones did the same thing in different ways. I’m happy to show that code to anyone interested, but I don’t consider them worth revisiting at the moment. Mostly, I’m listing this in case you’re wondering if this would have been better done some other way. Maybe I tried that way!
ZIO_FLAG_FLUSH_WRITE
. ANULL
ZIO called the “flush head” was held on the leafvdev_t
, and accumulated completed writes as parent ZIOs. At flush time, a new flush head ZIO was swapped in, the oldzio_flush()
was called on the old flush head, and it’s result was propagated up the tree. This was the only version for many months but was plagued by a refcounting issue that could not be found, and was assumed to be related to aNULL
ZIO as a common element at the bottom of many complex write trees.vq_lock
, which I knew was hot. To get access to the stat structures, it seemed worth it to move it, just carefully.This version had an additional queue state,
ZIO_QS_FLUSH
, to indicate tovdev_queue_io_done()
whether it was receiving the IO from the write initially returning, or after being reinserted after flush. This required some annoying special cases checks inzio_vdev_io_done()
and worst, twovq_lock
acquisitions, one for each step.zio_flush()
calls it became apparent that this would need to apply to almost all writes, and any that it might not apply to would be identifiable in other ways (egZIO_FLAG_PROBE
). SoZIO_FLAG_FLUSH_WRITE
was removed and a few things were simplified to no longer check.vq_lock
, this is whereZIO_STAGE_VDEV_IO_FLUSH
was added, which ended up being a big win because it allowed more special-case checks forZIO_QS_FLUSH
to be removed.How Has This Been Tested?
Full ZTS run completed, including new tests, on Linux 6.1. Note that the extra flush load changes timing on some tests and can make some errors more likely to crop up (eg #17054). #17064 helps in many configurations.
Compile checked on FreeBSD 14.2.
Performance testing described was done on a customer-specific load generator on a 14-wide RAIDz3 of rotational devices, single dataset,
recordsize=1M
. 100 threads either reading or writing (configured differently each run). Writing threads repeatedly write files of 1K-2M, with optionalfsync()
at end of write, across thousands of directories. Reading threads simply read back whatever is around, randomly sized, flat out.(It’s not a particular interesting test case, and really I should re-express it in
fio
so it’s easier to share).I also ran a couple of rounds of this test toggling
zfs_nocacheflush
during the run, to make sure that flushes queued either side of the toggle were handled correctly. All appears well (as in, a lost flush would result in a write IO being held forever, which didn’t happen).Types of changes
Checklist:
Signed-off-by
.Footnotes
The ZFS birth commit (illumos/illumos-gate@fa9e4066f0) had support for flushing devices with write caches with the
DKIOCFLUSHWRITECACHE
ioctl. No errors are checked. The comment inzil_flush_vdevs()
from from the time shows the thinking. ↩It's not entirely clear from the code history why this was acceptable for devices that do have write caches. Our best guess is that this was an oversight: between the combination of hardware, pool topology and application behaviour required to hit this, it basically didn't come up. ↩
Somewhat frustratingly,
zil.c
contains comments describing this exact behaviour, and further discussion in #12443 (September 2021). It appears that those involved saw the potential, but were looking at a different problem and so didn't have the context to recognise it for what it was. ↩