Ensure all writes are flushed to disk #17065

robn · 2025-02-18T05:30:43Z

[Sponsors: Klara, Inc., Wasabi Technology, Inc.]

Hi, there’s a lot here! The summary is:

Problem: OpenZFS ignores disk flush errors. Sometimes that matters.
Solution: Check flush errors and turn them into write errors, which aren’t ignored.
Correctness: demonstrably better: we don’t lose data where we used to, and fsync() no longer lies about it.
Performance: a mess, because we now flush too often.
Wanted: design review, code review, ideas for being smarter about flush batching.

Motivation and Context

Since the beginning, OpenZFS' “flush” operation has always ignored errors.¹ Write errors are captured and dealt with, but if a write succeeds but the subsequent flush fails, the operation as a whole will appear to succeed.²

This situation is fairly rare in most setups where the storage device is physically and logically close to the OS (eg inside the same case, connected to the mainboard, etc). Here a failed device will usually not respond to the write IO at all and the block layer will signal a failure, or if the device is merely misbehaving, the write may be slow or time out. In either case, the error is raised on the write and OpenZFS detects it and deals with it (through retrying to a different device, returning an error to userspace, or suspending).

In a more complex environment where there are IO modules or “smart” backplanes in between, a failure between the write and the flush can happen more easily. In our customer case, the write was accepted by the device cache and success returned to OpenZFS, then the IO module would fail and the device be reset before it could be flushed (either the subsequent flush not arriving at the device, or the device’s own cache flush strategy not having time to run or complete). OpenZFS ignores flush errors so it wouldn’t be noticed at that point, but almost certainly would be at the next write.

Depending on the pool topology, the pool may suspend, or the write may simply be retried some other way. Regardless, the original write was lost but OpenZFS believed it succeeded long ago, and has signalled to its calling subsystem that all is well.

A failure like this in the spa_sync() data+label+uberblock+label ceremony usually goes unnoticed (though a suspend is quite likely, usually that is quickly followed by zpool clear, and things continue). Some or all of the data may not be on disk, but no errors have been noticed, so zpool status reports all is well. Even reading the data back can seem fine if the request is serviced from the ARC! It’s usually not until much later when the pool is reimported (after a reboot) and the damaged file read from that the problem is seen, and by then it’s too late.

A worse version of this happens in conjunction with the ZIL. Consider fsync(), which is zil_commit() under the hood. If a write succeeds in the above way, fsync() will return success even though the data is not on disk. Depending on pool topology (eg SLOG device) this can still be survivable, as spa_sync() may be able to successfully write out the DMU version of the change; fsync() reported success early, but the data still made it to disk. If the write fails however, zil_commit() will fall back to txg_wait_synced(). If a similar sequence to the above then happens with the data write appearing to succeed, then the first label write failing, when the pool finally does advance the txg (maybe after suspend+resume), zil_commit() will be satisfied that the data is on disk and again fsync() returns success; a more permanent lie this time.³

Resolving this issue is not easy.

Ultimately, the issue is that a “success” return from a write does not mean that the data is definitely on disk, as it may have only made it as far as a cache. A flush is required to ensure the drive has really committed it to stable storage (device bugs notwithstanding). Further, after a flush failure we must assume that all writes since the last successful flush are now suspect (this is exactly analogous to the POSIX write() and fsync() APIs, in practice if not by the letter).

OpenZFS already has error-handling for write errors in various forms - return the error, retry the IO, reallocate the IO and retry, suspend the pool, etc. These are all suitable responses to a flush failure too, as long as we can connect a flush failure back to the write.

Thus, the most straightforward approach is to ensure that every write also gets a flush, and any flush error is applied to the write before it returns.

Interestingly, the ZIL already has this concept (though not fully baked) and served as the initial guide for this work.

It would be simple enough to issue a flush to a leaf device after every write (and can even be handled by the block layer eg in Linux by setting REQ_FUA on the write). This works for correctness, but destroys performance - flush is a relatively slow operation compared to a write, even for an empty or near-empty cache.

Instead, we want to accumulate writes, and at some opportune time, issue a single flush for all of them and then return them. This is similar in concept to the ZIL, but down at the leaf vdevs.

Description

The heart of this change is in the commit vdev_queue: hold returning writes, flush before releasing. vdev_queue.c has extensive comments describing the situation, which I’ll summarise here.

Previously, after a vdev leaf write (child/delegated) completes, it gets to ZIO_STAGE_VDEV_IO_DONE and to vdev_queue_io_done(), where in-flight counts are adjusted down and the write IO (or for an aggregated write, its component IO) are returned to the pipeline.

We change this slightly. As the write returns to the queue, it is held on a new list vq_flush_list. At some future time (vdev_queue_flush_check()) we issue a flush to the leaf vdev (vdev_queue_flush_issue()). The done handler vdev_queue_flush_done() loops over the held writes, updates their error with the flush error if necessary, and sends them back to the pipeline for completion.

When the flush is issued, the list of IOs is transferred to vq_flush_active_list, so that new returning writes can be accumulated on vq_flush_list again for the next flush. To do otherwise would be to complete IOs that were returned after the flush started, which may mean that they did not end up being flushed.

`ZIO_STAGE_VDEV_IO_FLUSH` and `vq_lock` contention

vq_lock is currently one of our more heavily contended locks, so I’ve made an effort to add minimal extra overhead to it.

Since initial completion handling and error injection is done in zio_vdev_io_done(), it’s necessary for all IO to go through that stage. Since that’s also where the queue return vdev_queue_io_done() is called, it would mean that a write would go through there twice; once to be added to the flush list, and again when being reissued. That would require vq_lock to be taken and released twice for every returning IO. To get around that, I’ve added an additional stage, ZIO_STAGE_VDEV_IO_FLUSH. IOs that need flushing (non-probe writes) get that additional stage, and the dequeueing and disaggregation work as well as the flush work happens there, under the same lock. When they reach ZIO_STAGE_VDEV_IO_DONE only the smallest amount of (unlocked) housekeeping is done.

Meanwhile, non-flushing queue IOs (read, TRIM and probe writes) just don’t get the ZIO_STAGE_VDEV_IO_FLUSH stage and work as before.

Probe writes

I’ve excluded probe writes from being flushed because they’re part of device setup and error response, both times we’re interested in the low-level response of the device. I could be persuaded that the probe sequence should also have a flush, however I don’t think it’s necessary (after an error, the write should fail, and setup tends to be operator-assisted). If it should flush, I would not do it this way, but rather build a flush into the probe IO sequence properly. So I think it’s out of scope for this PR.

No explicit flushes

There are three commits remove existing calls to zio_flush(), and one more removing zio_flush() itself.

Most places these were simple “barrier” flushes, which are no longer necessary, since the preceding zio_wait() on a write tree won’t complete until the flushes are done anyway.

The ZIL change is more extensive, since it no longer needs to track which vdevs were written to and defer flushing. I’ve made an effort to retain the IO chaining for writes though, which I understand to be necessary for fsync() ordering semantics.

zio_flush() can now be removed, not just because it’s unused, but also because it’s unuseful. A flush to the “top” of the vdev without error reporting (the existing form) doesn’t give us any particular guarantees, but if we did propagate flush errors then its not helpful either, because any device that returns a flush error (included faulted device) will fail the entire flush operation, which doesn’t tell us anything about what that means for the writes we’re trying to flush either. A far simpler and more conventional zio_vdev_flush() is added to flush a single vdev leaf and call a done function.

Correctness tests

Three new tests are included. The first simulates the “device loss after IO cached” failure described at the start. We inject a no-op for the writes (succeed but do nothing), an IO error for the flush (to trigger the error recovery) and a probe failure (to fault the device and suspend the pool). Once the pool returns, if the flush error was propagated to the write, then the write should be reexecuted and so be available on disk. If not, the data will not be on disk.

The second does the same for the ZIL, which on flush error will fallback to the main txg sync path and then respond as above. Again, we check the data is on disk and also test the result of fsync(), which should not return success until and unless the data is truly on disk. (this test originally from #16314).

The third test is a light test of the flush->write error propagation facility itself. We inject flush errors only, then confirm via zpool status that write errors were recorded.

New iostats

New columns are added to zpool iostat. For queue stats, there’s a new flushq column pair that shows the number of IOs waiting to be flushed (on vq_flush_list) and the number “being flushed” (on vq_flush_active_list):

                             syncq_write  asyncq_write      flushq
pool                         pend  activ   pend  activ  pend  activ
--------------------------  -----  -----  -----  ----- -----  -----
pool-12452062456849323294     107    111      4     15   126    122
  raidz3-0                    107    109      4     15   128    122
    wwn-0x5000cca29b4180f8      0      8      0      1     9     10
    wwn-0x5000cca29b4941d8      0      5      0      1     7     11
    wwn-0x5000cca29b4f5b50      0      9      0      1    12      1
    wwn-0x5000cca29b603fd0      0      6      0      1     6     12
    wwn-0x5000cca29b6b09d8      1     10      0      2     0     11
    wwn-0x5000cca29b6b0b14      0      6      0      0    12     11
    wwn-0x5000cca29b6b5bf4      0      5      0      0    12     12
    wwn-0x5000cca29b6b5c50      0      7      0      1     4     11
    wwn-0x5000cca29b6ba648      1     10      0      2    12      9
    wwn-0x5000cca29b6d0c48     47     10      0      2     0     12
    wwn-0x5000cca29b6d6a64     20     10      0      1     5     11
    wwn-0x5000cca29b6d6cfc      0      4      0      0    12     12
    wwn-0x5000cca29b6d907c      0      8      0      0    12     11
    wwn-0x5000cca29b74ee88      0      7      0      1    12      9
--------------------------  -----  -----  -----  ----- -----  -----

For latency stats, there’s overall total and disk latency figures to go with read and write:

               total_wait           disk_wait       
pool         read  write  flush   read  write  flush
----------  -----  -----  -----  -----  -----  -----
tank         21ms    2ms  292us  641us    1ms  292us
  raidz1-0   21ms    2ms  292us  641us    1ms  292us
    quizm0   23ms    2ms  271us  598us    1ms  271us
    quizm1   17ms    2ms  358us  695us    1ms  358us
    quizm2   25ms    2ms  285us  641us    1ms  285us
    quizm3   20ms    2ms  260us  635us    2ms  260us
----------  -----  -----  -----  -----  -----  -----

It’s not yet clear to me how useful these are, but tbh I don’t find zpool iostat tremendously useful in the general case. At minimum, they let you see things happening.

Performance

As currently presented, this PR is a performance disaster. I’ve tested a variety of workloads on a 14-wide RAIDz3 of rotational devices (a long-term test system provided by a customer, with behaviour we’re very familiar with). On all our test workloads (mixed read/write, write-only, with and without fsync()), write throughput drops to roughly the same as if all writes are sent through the ZIL.

(fun twist: a 100% write+fsync workload on this system has ~10% higher throughput, which appears to be because flushes are now per-device rather than across the entire pool, so a single zil_commit() is not waiting for the entire pool of unrelated devices to flush).

What we want is easy to state: we want to flush as little as possible, as late as possible. Ideally, we would flush only when something is waiting, after all the writes it’s interested in have been written. The difficulty is that down at the queue level, we have no idea what the write tree above us looks like, or even if anyone is waiting for us.

I had hoped I could get away with some basic guesswork. As written, vdev_queue_flush_check() decides if it should flush, that will wait for the there to be nothing on the queue, or until the oldest IO waiting for flush has been wait “too long”. This timeout is managed with a pair of tuneables, zfs_vdev_flush_timeout_ms and zfs_vdev_flush_sync_timeout_ms, the idea being that sync writes almost always mean a ZIL or label/uberblock sync (our old flush point) and should be completed ASAP, while all other writes have nothing much waiting, so can wait a little longer, but try not to accumulate too much.

My intent was to run a battery of performance tests on a variety of workloads, changing these tuneables each time, until I found a not-terrible middle ground, and then post this PR suggesting that this could be improved a bit but wasn’t bad. Unfortunately, I had not understood just how bursty IO into a single vdev could be, and in basically all workloads the queue is empty long before any timeout is reached. and the flush happens.

I have tried various other methods, but all of them have been trying to respond to other events in the system to infer that a flush will be required “very soon“ because there is no way to know that all incoming IO is done.

I have some more invasive things to try. One idea is to group and count writes so we can tell when the last write in a group has completed and not flush before that. Another might be to have parent IOs signal when they’re finished issuing writes (possibly by reaching DONE), and then issue the flush when no one above us in the tree could issue another write as part of this group of work. Or maybe it’s more explicit, like ZIL writes: track all the leaf vdevs touched, and ask them all to flush directly (echoes of #16375 here).

But really, I want your ideas. If there’s a clever use of some distant value that would help us see what’s happening, or a smart way to group things, or whatever I would love to hear from you.

All this said, I think the overall construction is still right. If we treat vdev_queue_flush_check() as a smart black box, the rest of the code should not be substantially different whatever we do there.

Other considerations

There’s a handful of things in the near and far future that I have not done anything about here, but have tried not to block.

dRAID

I have given no particular thought to how this interacts with dRAID other than testing for the dRAID spare special-case when checking for a vdev leaf. As far as I know, dRAID writes still end up as normal writes to normal leaf vdevs, so I do not imagine there’s any significant special handling required, if at all.

Flush IO taskq

Because more individual ZIO_TYPE_FLUSH IOs are being issued than before, zio_taskq[ZIO_TYPE_FLUSH], currently configured as ZTI_ONE/ZTI_ONE is now being pushed harder than before.

In my testing so far this has only presented a problem with vdev_file vdevs, because FLUSH (and TRIM) are both zio_execute'd synchronously, so multiple file vdevs being flushed will be serialised (and good luck if the underlying filesystem or device is itself slow). #17064 puts both those ops on z_vdev_file, solving this.

Regardless, I do not know if we will need to end up adding more threads for these queues, and/or put them out on the write taskqs. I don’t think so, since they require almost no work, but that's not the same as no work, and maybe the extra will be enough to cause a bottleneck there.

Direct IO queue bypass

If/when #16591 lands, direct writes will bypass the queue entirely, which is obviously incompatible with this change as-is. I suggest that in this case, it will probably be enough to add a “and also flush” option (like REQ_FUA) to direct writes that would bypass the queue, to ensure they still get flushed.

However, a big part of the reason we might even want a queue bypass is because (a) aggregation is often pointless and (b) vq_lock is so heavily contended on low-latency devices. However, the work of managing the flush list and issuing flushes currently only happens under vq_lock because it’s convenient; it could comfortably run under it’s own lock with a little bit of reshuffling (most of it is contingent on which locks vdev_queue_flush_check() will need to decide what to do).

I’ve got work elsewhere to break vq_lock entirely, which would be another approach. So for now, I’m comfortable that this can be dealt with later.

Range flush

If we ever wanted to, it should be possible to support a “range flush” on a device that supports it by maintaining a range tree along with the waiting write IOs, not unlike vq_write_offset_tree. Then we really could flush only what we wrote and no more.

Design and implementation notes

Why propagate flush errors into writes?

One may wonder why so much effort is being placed into propagating a flush error into an existing ZIO write tree, especially when looking at the performance difficulties caused by the lack of a clear “end of write” signal. Would it not be easier to do the flush as a separate step somehow?

The difficulty is that the IO tree is built around the idea of a single top operation, that spawns child operations, that in turn spawn child operations of their own, and on and on, until they get to the very bottom where the physical IO operations are performed. Then, the results are bubbled back up the tree, each intermediate operation considering the results of its children and deciding what to do.

Most of the error checking and recovery that makes OpenZFS what it is, for example, responding to a failed write by taking a faulting a device and rerouting the write elsewhere, is performed in those intermediate operation. Moreover, the housekeeping associated with a write like updating usage information or committing allocations also done as intermediate operations in the IO tree.

If we do the flush after the “top” write succeeds, and the flush fails, then what? We need to somehow reverse all “success” effects that IO tree caused, and get into its error paths. OpenZFS has no way to describe that sort of behaviour. So we have no real choice but to convert flush errors into write errors at the lowest level, so that the right paths back through the IO tree are taken.

Maybe an alternate structure is possible, but would require substantial rework or redesign of pretty much everything. It’s not really feasible.

History

This is the fifth iteration of this concept. The earlier ones did the same thing in different ways. I’m happy to show that code to anyone interested, but I don’t consider them worth revisiting at the moment. Mostly, I’m listing this in case you’re wondering if this would have been better done some other way. Maybe I tried that way!

v1: The caller would signal that they wanted a flushing write with ZIO_FLAG_FLUSH_WRITE. A NULL ZIO called the “flush head” was held on the leaf vdev_t, and accumulated completed writes as parent ZIOs. At flush time, a new flush head ZIO was swapped in, the old zio_flush() was called on the old flush head, and it’s result was propagated up the tree. This was the only version for many months but was plagued by a refcounting issue that could not be found, and was assumed to be related to a NULL ZIO as a common element at the bottom of many complex write trees.
v2: unable to find the refcounting issue, in desperation I removed the “flush head” to be a straight ZIO list on the vdev with a standalone flush IO and a done handler that reissued the IO, not so different to the final version. The refcount issue continued, demonstrating that it wasn’t this code at fault, leading to zio: lock parent zios when updating wait counts on reexecute #17016.
v3: I started to add stats and counters to show what was happening on the flush list, and realised that those made more sense as part of the queue stats. I’d long wondered if whole thing should be a part of the queue (is this not a form of aggregation?) but had resisted adding complexity under vq_lock, which I knew was hot. To get access to the stat structures, it seemed worth it to move it, just carefully.
This version had an additional queue state, ZIO_QS_FLUSH, to indicate to vdev_queue_io_done() whether it was receiving the IO from the write initially returning, or after being reinserted after flush. This required some annoying special cases checks in zio_vdev_io_done() and worst, two vq_lock acquisitions, one for each step.
v4: as I started removing existing zio_flush() calls it became apparent that this would need to apply to almost all writes, and any that it might not apply to would be identifiable in other ways (eg ZIO_FLAG_PROBE). So ZIO_FLAG_FLUSH_WRITE was removed and a few things were simplified to no longer check.
v5 (current): still worried about vq_lock, this is where ZIO_STAGE_VDEV_IO_FLUSH was added, which ended up being a big win because it allowed more special-case checks for ZIO_QS_FLUSH to be removed.

How Has This Been Tested?

Full ZTS run completed, including new tests, on Linux 6.1. Note that the extra flush load changes timing on some tests and can make some errors more likely to crop up (eg #17054). #17064 helps in many configurations.

Compile checked on FreeBSD 14.2.

Performance testing described was done on a customer-specific load generator on a 14-wide RAIDz3 of rotational devices, single dataset, recordsize=1M. 100 threads either reading or writing (configured differently each run). Writing threads repeatedly write files of 1K-2M, with optional fsync() at end of write, across thousands of directories. Reading threads simply read back whatever is around, randomly sized, flat out.

(It’s not a particular interesting test case, and really I should re-express it in fio so it’s easier to share).

I also ran a couple of rounds of this test toggling zfs_nocacheflush during the run, to make sure that flushes queued either side of the toggle were handled correctly. All appears well (as in, a lost flush would result in a write IO being held forever, which didn’t happen).

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

The ZFS birth commit (illumos/illumos-gate@fa9e4066f0) had support for flushing devices with write caches with the DKIOCFLUSHWRITECACHE ioctl. No errors are checked. The comment in zil_flush_vdevs() from from the time shows the thinking. ↩
It's not entirely clear from the code history why this was acceptable for devices that do have write caches. Our best guess is that this was an oversight: between the combination of hardware, pool topology and application behaviour required to hit this, it basically didn't come up. ↩
Somewhat frustratingly, zil.c contains comments describing this exact behaviour, and further discussion in #12443 (September 2021). It appears that those involved saw the potential, but were looking at a different problem and so didn't have the context to recognise it for what it was. ↩

If a data, label or uberblock write succeeds but the flush immediately after it fails, the flush error should propagate into the write for the write to return. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>

If fsync() (zil_commit()) writes successfully, but then the flush fails, fsync() should not return success, but instead should fall into a full transaction wait. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>

If a flush fails, its error should be copied into all associated writes and returned. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>

This modifies the IO pipeline and queue to ensure that all leaf writes are flushed to the device before returning, and any flush errors will be surfaced as write errors. To do this, we arrange for completed write ZIOs to be held in the queue. After enough have been collected, a flush is issued. Once it completes, the held writes are returned to the pipeline. See the comments in vdev_queue.c for more detail. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>

Adds an extra "flushq" column pair showing the number of IOs waiting for a flush to be issued, and for a flush to complete. While this presents like a queue, it's not quite the same as the existing priority-based queues, so it needs a bit of extra ceremony to thread it all through. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>

Flushing is real work on the disks, and its useful to be able to show how long it takes. The necessary timing is normally only recorded for IO that goes through the queue, so a small amount of special-case code for FLUSH IOs is needed to make sure it gets them too. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>

For years the ZIL has been managing its own flushing regime, even though it never detected flushing errors. Now that writes are implicitly flushed, this is no longer necessary and can be removed. Note that a simplified version of the LWB write chain has been maintained, as there is still an ordering requirement which this is responsible for. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>

These were simple "barrier" flushes, but since the writes will now be flushed before they return, the wait points for the write ZIOs will take care of it. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>

These were "barrier" flushes, but the writes will now be flushed before they return, so we don't need to do this explicitly anymore. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>

There's nothing left that uses it, and flushing from the "top" of a vdev doesn't have any real meaning anymore. Thank you for your service. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>

robn · 2025-02-19T02:30:43Z

As currently presented, this PR is a performance disaster.

All this said, I think the overall construction is still right. If we treat vdev_queue_flush_check() as a smart black box, the rest of the code should not be substantially different whatever we do there.

I had a long conversation this morning and I have a good idea to try. If it pans out, it will turn out that most of the construction is actually wrong, and this PR will be closed and reopened in a different form (the tests, some iostats and "remove flush" parts should still be good).

As such, setting to draft for now.

robn added 10 commits February 18, 2025 09:49

ZTS: test that injected flush errors are surfaced as write errors

1493feb

If a flush fails, its error should be copied into all associated writes and returned. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>

zio: remove zio_flush()

1896f22

There's nothing left that uses it, and flushing from the "top" of a vdev doesn't have any real meaning anymore. Thank you for your service. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>

This was referenced Feb 18, 2025

zio_flush: propagate flush errors to the ZIL #16314

Closed

ZIO: add "vdev tracing" facility; use it for ZIL flushing #16375

Closed

robn marked this pull request as draft February 19, 2025 02:30

github-actions bot added the Status: Work in Progress Not yet ready for general review label Feb 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure all writes are flushed to disk #17065

Ensure all writes are flushed to disk #17065

robn commented Feb 18, 2025 •

edited

Loading

robn commented Feb 19, 2025

Ensure all writes are flushed to disk #17065

Are you sure you want to change the base?

Ensure all writes are flushed to disk #17065

Conversation

robn commented Feb 18, 2025 • edited Loading

Motivation and Context

Description

ZIO_STAGE_VDEV_IO_FLUSH and vq_lock contention

Probe writes

No explicit flushes

Correctness tests

New iostats

Performance

Other considerations

dRAID

Flush IO taskq

Direct IO queue bypass

Range flush

Design and implementation notes

Why propagate flush errors into writes?

History

How Has This Been Tested?

Types of changes

Checklist:

Footnotes

robn commented Feb 19, 2025

robn commented Feb 18, 2025 •

edited

Loading

`ZIO_STAGE_VDEV_IO_FLUSH` and `vq_lock` contention