Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a kill switch for SEEK_HOLE/SEEK_DATA #16000

Closed
wants to merge 2 commits into from

Conversation

rincebrain
Copy link
Contributor

We've had a number of varyingly rare races with trying to cheat and not copy certain file regions from userland come up.

Let's add a kill switch which should hopefully allow people to just hit it and keep running without those optimizations rather than hoping that we've caught them all This Time.

Motivation and Context

#15994 #15933 #15526 #11900 #14753

Description

I'd like a safe option while we debug things further in the future, rather than having to push out a release saying "we think this fixes it but since it's a rare edge case it's hard to be certain".

How Has This Been Tested?

Again, isn't that what CI is for?

(Really, I'm expecting to need to go modify the tests to set the tunable off to pass, but I wanted to see if anything else breaks first.)

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@rrevans
Copy link
Contributor

rrevans commented Mar 16, 2024

Maybe have zpl_llseek call generic_file_llseek and its equivalent on FreeBSD instead of reimplementing the logic?

@rincebrain
Copy link
Contributor Author

rincebrain commented Mar 16, 2024

The problem, and reason I put it here, is that I don't believe FreeBSD has an equivalent - or at a minimum there's no code currently calling it.

Specifically, in FreeBSD, I believe the codepath looks like vop_ioctl => zfs_freebsd_ioctl => zfs_ioctl => zfs_holey, so it's not as generic an interface there, and I don't obviously see a default handler in FreeBSD-specific code in the FBSD tree.

(I see references to Linux's generic handlers, but they appear to all be in code that's shared with Linux in various places.)

I could be wrong, and I'll go look once I'm either confident this doesn't break in the form I have locally which is one more commit ahead of this, or that this approach can't be correct.

@shodanshok
Copy link
Contributor

Should zfs_dmu_offset_next_sync=0 already to that in practice (by not returning holes if the file was not fully synched in a previous txg)? If so, I would suggest reverting its default to 0

@rincebrain
Copy link
Contributor Author

No. If you read #15933 you'll see this still happens even if you have that set.

Per openzfs#15933, there's currently an inconsistency in behavior
in our special-case SEEK_HOLE behavior.

So let's not use it for now.

Signed-off-by: Rich Ercolani <[email protected]>
@rincebrain
Copy link
Contributor Author

I just shoved a Linux-specific tunable that kills the non-generic_file_llseek behavior on Linux only, since that seems like a good mitigation for this for now. (I'm curious why we don't have that as the default behavior in the first place, but am not doing the code excavation right now.)

I also included a separate commit for disabling BRT by default, because I am of the opinion that we shouldn't leave git master in a known losing data state by default, and it seems likely to take more than a little bit of investigation to figure out the correct way to resolve this for both cases.

@shodanshok
Copy link
Contributor

No. If you read #15933 you'll see this still happens even if you have that set.

I missed it, thanks.

@rrevans
Copy link
Contributor

rrevans commented Mar 18, 2024

Good news: I can't reproduce any failures after with zfs_disable_holey=1 and zfs_bclone_enabled=0 after 30+ minutes. (As expected: all possible classes of llseek bugs w.r.t. sync/mmap races are mitigated if it can't reach dnode_is_dirty.)

Also the PR as-written looks correct to me, though I might suggest we have another think about disabling at head.

Maybe a good compromise is to both:

  1. disable bclone and fancy llseek for files with mappings at head; leave bclone enabled.
  2. disable fancy llseek entirely for the next release branch(es) only; leave bclone disabled too.

This way bleeding edge testers can test things we think work and continue to exercise #15571 and the rest of bclone while these problems are sorted AND we can protect normal release users and distros from future unknown problems with both features. But I'll leave it at that and let you and the other maintainers decide.

FYI- these ZTS functional tests are newly failing for me with your patch (I ran functional tests in linux.run and common.run; I don't really know how ZTS is used for ZFS tests/releases so hope this helps while you wait for CI runs):

    FAIL cp_files/cp_files_002_pos (expected PASS)
    FAIL mmap/mmap_seek_001_pos (expected PASS)

(cp_files_002 requires bclone, mmap_seek_001 is testing SEEK_HOLE/SEEK_DATA for mmap)

@rincebrain
Copy link
Contributor Author

Yes, I expected those to fail, I was figuring I'd clean it up after agreeing on what to do.

And my general thinking on this, is to have a simple thing that is safe and ensures correct behavior, get that merged, and then refine it with more precision, because we've had a lot of cases of "I'm sure this'll work fine" optimizations being wrong, so I'd like to start with "this absolutely cannot fail", and then refine and possibly revert the refinements if they prove to be incorrect.

In part on the basis that the friction of getting a change in can be very high, so I'd like to be able to say "run this version" rather than "apply this patch to head".

@rrevans
Copy link
Contributor

rrevans commented Mar 18, 2024

Sorry, I meant only those tests failed. The rest passed which is a good sign.

SG otherwise. I support merging this PR and hope to see PRs re-enabling incrementally at head once the root cause(s) are fixed.

Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand now it is for experiments, but lets not merge it this way by mistake. It should not change default, we are not that desperate now. It should be cross-platform, at least include FreeBSD. Plus one more cosmetics:

@@ -550,6 +552,7 @@ zpl_direct_IO(int rw, struct kiocb *kiocb, struct iov_iter *iter, loff_t pos)
static loff_t
zpl_llseek(struct file *filp, loff_t offset, int whence)
{
if (!(zfs_disable_holey)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To not mess with formatting his could be moved into if below. And lets remove extra parentheses.

@rincebrain
Copy link
Contributor Author

rincebrain commented Mar 18, 2024 via email

@amotin
Copy link
Member

amotin commented Mar 18, 2024

@rincebrain You are proposing to disable a big chunk of functionality in master branch being there for many years without any understanding of the problem and killing any hope or motivation to fix it. This is NOT the way! Lets work on something more constructive!

@rincebrain
Copy link
Contributor Author

I'm opening a PR that stops two rare data loss edge cases, given that we have a history of them in this functionality, with the intent we can iterate on a safe baseline.

@amotin
Copy link
Member

amotin commented Mar 19, 2024

It not just stops two DAMN RARE data loss edge cases, but also disables significant chunks of functionality. And you are doing it on master branch, where development is going on any way, and that is not used in any Linux distro AFAIK. Considering Brian haven't merged anything in few weeks (I don't know where he is), I am not certain any fix can be merged nearest days. And if we have few days/week, I would prefer something more creative rather than destructive.

Speaking about two cases, I don't think anybody argue that we should flush mapped pages before block cloning. That is the patch (or some of its variation) I'd like to see merged ASAP. After that the only question I see is what the hell is wrong with zn_flush_cached_data() on Linux? Why does it not flush everything, or what else is wrong with mmap? That is where we should focus instead of this discussion.

@amotin
Copy link
Member

amotin commented Mar 19, 2024

If you are eager to disable something, you can do it specifically in case of zn_has_cached_data() == B_TRUE and Linux. It should specifically limit the scope to mmap(), where we suspect the problem.

@rincebrain
Copy link
Contributor Author

It's behind a tunable, you can just adjust it.

I too would like to see changes for that. But I'd like to have a start that isn't breaking anything.

If Brian is just gone, and we have no prospect of merging anything, then we may have nicer patches available by the time he's back. But I'd like to have a solid mitigation to start, which people can turn off if they wish, and then attempt more refined approaches from there, once it's no longer an acute issue.

And Arch and Gentoo ship git packages, off the top of my head, though since this also breaks on pre-2.2, that needs a solution too. (I thought Nix did too but I don't immediately see a listing.)

@amotin
Copy link
Member

amotin commented Mar 22, 2024

Lets close this in context of #16019 .

@rincebrain rincebrain closed this Mar 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants