Add a kill switch for SEEK_HOLE/SEEK_DATA #16000

rincebrain · 2024-03-16T16:28:31Z

We've had a number of varyingly rare races with trying to cheat and not copy certain file regions from userland come up.

Let's add a kill switch which should hopefully allow people to just hit it and keep running without those optimizations rather than hoping that we've caught them all This Time.

Motivation and Context

#15994 #15933 #15526 #11900 #14753

Description

I'd like a safe option while we debug things further in the future, rather than having to push out a release saying "we think this fixes it but since it's a rare edge case it's hard to be certain".

How Has This Been Tested?

Again, isn't that what CI is for?

(Really, I'm expecting to need to go modify the tests to set the tunable off to pass, but I wanted to see if anything else breaks first.)

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

rrevans · 2024-03-16T21:42:17Z

Maybe have zpl_llseek call generic_file_llseek and its equivalent on FreeBSD instead of reimplementing the logic?

rincebrain · 2024-03-16T22:03:00Z

The problem, and reason I put it here, is that I don't believe FreeBSD has an equivalent - or at a minimum there's no code currently calling it.

Specifically, in FreeBSD, I believe the codepath looks like vop_ioctl => zfs_freebsd_ioctl => zfs_ioctl => zfs_holey, so it's not as generic an interface there, and I don't obviously see a default handler in FreeBSD-specific code in the FBSD tree.

(I see references to Linux's generic handlers, but they appear to all be in code that's shared with Linux in various places.)

I could be wrong, and I'll go look once I'm either confident this doesn't break in the form I have locally which is one more commit ahead of this, or that this approach can't be correct.

shodanshok · 2024-03-17T16:44:28Z

Should zfs_dmu_offset_next_sync=0 already to that in practice (by not returning holes if the file was not fully synched in a previous txg)? If so, I would suggest reverting its default to 0

rincebrain · 2024-03-17T17:11:26Z

No. If you read #15933 you'll see this still happens even if you have that set.

Per openzfs#15933, there's currently an inconsistency in behavior in our special-case SEEK_HOLE behavior. So let's not use it for now. Signed-off-by: Rich Ercolani <[email protected]>

Signed-off-by: Rich Ercolani <[email protected]>

rincebrain · 2024-03-17T18:08:34Z

I just shoved a Linux-specific tunable that kills the non-generic_file_llseek behavior on Linux only, since that seems like a good mitigation for this for now. (I'm curious why we don't have that as the default behavior in the first place, but am not doing the code excavation right now.)

I also included a separate commit for disabling BRT by default, because I am of the opinion that we shouldn't leave git master in a known losing data state by default, and it seems likely to take more than a little bit of investigation to figure out the correct way to resolve this for both cases.

shodanshok · 2024-03-17T20:28:01Z

No. If you read #15933 you'll see this still happens even if you have that set.

I missed it, thanks.

rrevans · 2024-03-18T12:10:40Z

Good news: I can't reproduce any failures after with zfs_disable_holey=1 and zfs_bclone_enabled=0 after 30+ minutes. (As expected: all possible classes of llseek bugs w.r.t. sync/mmap races are mitigated if it can't reach dnode_is_dirty.)

Also the PR as-written looks correct to me, though I might suggest we have another think about disabling at head.

Maybe a good compromise is to both:

disable bclone and fancy llseek for files with mappings at head; leave bclone enabled.
disable fancy llseek entirely for the next release branch(es) only; leave bclone disabled too.

This way bleeding edge testers can test things we think work and continue to exercise #15571 and the rest of bclone while these problems are sorted AND we can protect normal release users and distros from future unknown problems with both features. But I'll leave it at that and let you and the other maintainers decide.

FYI- these ZTS functional tests are newly failing for me with your patch (I ran functional tests in linux.run and common.run; I don't really know how ZTS is used for ZFS tests/releases so hope this helps while you wait for CI runs):

    FAIL cp_files/cp_files_002_pos (expected PASS)
    FAIL mmap/mmap_seek_001_pos (expected PASS)

(cp_files_002 requires bclone, mmap_seek_001 is testing SEEK_HOLE/SEEK_DATA for mmap)

rincebrain · 2024-03-18T12:37:42Z

Yes, I expected those to fail, I was figuring I'd clean it up after agreeing on what to do.

And my general thinking on this, is to have a simple thing that is safe and ensures correct behavior, get that merged, and then refine it with more precision, because we've had a lot of cases of "I'm sure this'll work fine" optimizations being wrong, so I'd like to start with "this absolutely cannot fail", and then refine and possibly revert the refinements if they prove to be incorrect.

In part on the basis that the friction of getting a change in can be very high, so I'd like to be able to say "run this version" rather than "apply this patch to head".

rrevans · 2024-03-18T14:39:41Z

Sorry, I meant only those tests failed. The rest passed which is a good sign.

SG otherwise. I support merging this PR and hope to see PRs re-enabling incrementally at head once the root cause(s) are fixed.

amotin

I understand now it is for experiments, but lets not merge it this way by mistake. It should not change default, we are not that desperate now. It should be cross-platform, at least include FreeBSD. Plus one more cosmetics:

amotin · 2024-03-18T19:34:09Z

module/os/linux/zfs/zpl_file.c

@@ -550,6 +552,7 @@ zpl_direct_IO(int rw, struct kiocb *kiocb, struct iov_iter *iter, loff_t pos)
 static loff_t
 zpl_llseek(struct file *filp, loff_t offset, int whence)
 {
+if (!(zfs_disable_holey)) {


To not mess with formatting his could be moved into if below. And lets remove extra parentheses.

rincebrain · 2024-03-18T21:30:05Z

None of it was a mistake, Alexander. Since having such a kill switch doesn't resolve this here, I changed the PR to a minimum case to stop losing people's data. FreeBSD hasn't managed to reproduce either of those bugs so far, and the originally intended kill switch doesn't help with this bug, so I left it be.

…

On Mon, Mar 18, 2024, 3:38 PM Alexander Motin ***@***.***> wrote: ***@***.**** requested changes on this pull request. I understand now it is for experiments, but lets not merge it this way by mistake. It should not change default, we are not that desperate now. It should be cross-platform, at least include FreeBSD. Plus one more cosmetics: ------------------------------ In module/os/linux/zfs/zpl_file.c <#16000 (comment)>: > @@ -550,6 +552,7 @@ zpl_direct_IO(int rw, struct kiocb *kiocb, struct iov_iter *iter, loff_t pos) static loff_t zpl_llseek(struct file *filp, loff_t offset, int whence) { +if (!(zfs_disable_holey)) { To not mess with formatting his could be moved into if below. And lets remove extra parentheses. — Reply to this email directly, view it on GitHub <#16000 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABUI7J2P3CFXXWKJJBURKLYY47CJAVCNFSM6AAAAABEZN5WKGVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTSNBUGAYTCNRWGE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

amotin · 2024-03-18T22:35:05Z

@rincebrain You are proposing to disable a big chunk of functionality in master branch being there for many years without any understanding of the problem and killing any hope or motivation to fix it. This is NOT the way! Lets work on something more constructive!

rincebrain · 2024-03-18T22:38:11Z

I'm opening a PR that stops two rare data loss edge cases, given that we have a history of them in this functionality, with the intent we can iterate on a safe baseline.

amotin · 2024-03-19T00:14:32Z

It not just stops two DAMN RARE data loss edge cases, but also disables significant chunks of functionality. And you are doing it on master branch, where development is going on any way, and that is not used in any Linux distro AFAIK. Considering Brian haven't merged anything in few weeks (I don't know where he is), I am not certain any fix can be merged nearest days. And if we have few days/week, I would prefer something more creative rather than destructive.

Speaking about two cases, I don't think anybody argue that we should flush mapped pages before block cloning. That is the patch (or some of its variation) I'd like to see merged ASAP. After that the only question I see is what the hell is wrong with zn_flush_cached_data() on Linux? Why does it not flush everything, or what else is wrong with mmap? That is where we should focus instead of this discussion.

amotin · 2024-03-19T00:22:34Z

If you are eager to disable something, you can do it specifically in case of zn_has_cached_data() == B_TRUE and Linux. It should specifically limit the scope to mmap(), where we suspect the problem.

rincebrain · 2024-03-19T00:37:39Z

It's behind a tunable, you can just adjust it.

I too would like to see changes for that. But I'd like to have a start that isn't breaking anything.

If Brian is just gone, and we have no prospect of merging anything, then we may have nicer patches available by the time he's back. But I'd like to have a solid mitigation to start, which people can turn off if they wish, and then attempt more refined approaches from there, once it's no longer an acute issue.

And Arch and Gentoo ship git packages, off the top of my head, though since this also breaks on pre-2.2, that needs a solution too. (I thought Nix did too but I don't immediately see a listing.)

amotin · 2024-03-22T13:58:21Z

Lets close this in context of #16019 .

rincebrain mentioned this pull request Mar 16, 2024

some copied files are still corrupted (chunks replaced by zeros) #15933

Closed

Added Linux-specific hole fallback tunable

02e4699

Per openzfs#15933, there's currently an inconsistency in behavior in our special-case SEEK_HOLE behavior. So let's not use it for now. Signed-off-by: Rich Ercolani <[email protected]>

rincebrain force-pushed the killholes branch from 1fca1e2 to 02e4699 Compare March 17, 2024 18:05

Disable bclone_enabled by default until openzfs#15933 is resolved

35c7b43

Signed-off-by: Rich Ercolani <[email protected]>

amotin requested changes Mar 18, 2024

View reviewed changes

rincebrain closed this Mar 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a kill switch for SEEK_HOLE/SEEK_DATA #16000

Add a kill switch for SEEK_HOLE/SEEK_DATA #16000

rincebrain commented Mar 16, 2024

rrevans commented Mar 16, 2024

rincebrain commented Mar 16, 2024 •

edited

Loading

shodanshok commented Mar 17, 2024

rincebrain commented Mar 17, 2024

rincebrain commented Mar 17, 2024

shodanshok commented Mar 17, 2024

rrevans commented Mar 18, 2024

rincebrain commented Mar 18, 2024

rrevans commented Mar 18, 2024

amotin left a comment

amotin Mar 18, 2024

rincebrain commented Mar 18, 2024 via email

amotin commented Mar 18, 2024

rincebrain commented Mar 18, 2024

amotin commented Mar 19, 2024 •

edited

Loading

amotin commented Mar 19, 2024

rincebrain commented Mar 19, 2024

amotin commented Mar 22, 2024

Add a kill switch for SEEK_HOLE/SEEK_DATA #16000

Add a kill switch for SEEK_HOLE/SEEK_DATA #16000

Conversation

rincebrain commented Mar 16, 2024

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

rrevans commented Mar 16, 2024

rincebrain commented Mar 16, 2024 • edited Loading

shodanshok commented Mar 17, 2024

rincebrain commented Mar 17, 2024

rincebrain commented Mar 17, 2024

shodanshok commented Mar 17, 2024

rrevans commented Mar 18, 2024

rincebrain commented Mar 18, 2024

rrevans commented Mar 18, 2024

amotin left a comment

Choose a reason for hiding this comment

amotin Mar 18, 2024

Choose a reason for hiding this comment

rincebrain commented Mar 18, 2024 via email

amotin commented Mar 18, 2024

rincebrain commented Mar 18, 2024

amotin commented Mar 19, 2024 • edited Loading

amotin commented Mar 19, 2024

rincebrain commented Mar 19, 2024

amotin commented Mar 22, 2024

rincebrain commented Mar 16, 2024 •

edited

Loading

amotin commented Mar 19, 2024 •

edited

Loading