Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zinject: add "probe" device injection type #16947

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

robn
Copy link
Member

@robn robn commented Jan 14, 2025

[Sponsors: Klara, Inc., Wasabi Technology, Inc.]

Motivation and Context

zinject can't create the scenario of a drive faulting after a failed write. The write failure can be injected, which triggers a probe, but probe IO always goes to the label regions, which are explicitly excluded from device injections. So the probe succeeds, and the write is retried, triggering another probe, over and over, forever.

This adds a new probe pseudo-IO-type for device injection, which allows a probe failure to be simulated.

Description

First, we extend the device injection type to allow more than just the "fundamental" IO types. zi_iotype was always large enough, there just wasn't a way to do more than the regular ZIO_TYPE_* values. Now it has its own values, ZINJECT_IOTYPE_* for injection IO types. For compatibility, the existing IO types are placed at the start so their numeric values match. all still matches all fundamental types, so this change should be UI and ABI-compatible.

Then, we add a new type, ZINJECT_IOTYPE_PROBE. This matches any and all ZIOs with the ZIO_FLAG_PROBE flag set, regardless of type or location on disk (including in the labels). The normal _READ and _WRITE injections won't match these, to preserve the existing behaviour.

Taken together, this we can simulate a device fault on write with:

$ zinject -d $<devcie> -e io -T write <pool>
$ zinject -d $<device> -e io -T probe <pool>

And indeed, this is exactly what the included test does.

How Has This Been Tested?

New test included.

All zinject-using tests have passed with this change in place.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

(Wow! Clean sweep!)

@robn robn changed the title Zinject probe zinject: add "probe" device injection type Jan 14, 2025
man/man8/zinject.8 Outdated Show resolved Hide resolved
module/zfs/zio_inject.c Show resolved Hide resolved
@amotin amotin added Status: Revision Needed Changes are required for the PR to be accepted Status: Code Review Needed Ready for review and testing labels Jan 14, 2025
@github-actions github-actions bot removed the Status: Revision Needed Changes are required for the PR to be accepted label Jan 14, 2025
@robn robn mentioned this pull request Jan 15, 2025
13 tasks
@tonyhutter
Copy link
Contributor

@robn when you're satisfied with your changes, would you mind squashing these commits?

@robn
Copy link
Member Author

robn commented Jan 15, 2025

@tonyhutter if/when #16950 lands, I'll rebase and then it will be just two logically-separate commits.

Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have no objections once #16950 is merged.

@amotin
Copy link
Member

amotin commented Jan 16, 2025

zinject_probe test timed out on FreeBSD, possibly causing more errors after.

robn added 4 commits January 17, 2025 07:57
The old kstat helper function was barely used, I suspect in part because
it was very limited in the kinds of kstats it could gather.

This adds new functions to replace it, for each kind of thing that can
have stats: global, pool and dataset. There's options in there to get a
single stat value, or all values within a group.

Most importantly, the interface is the same for both platforms.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
Removes other custom helpers and direct accesses to /proc.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
It's now a simple wrapper, so lets just call kstat direct.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
I'm about to add a new "type", and I need somewhere to put it!

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
@robn
Copy link
Member Author

robn commented Jan 16, 2025

zinject_probe test timed out on FreeBSD, possibly causing more errors after.

I'm guessing the max 10s wait for suspend is too short. Bumped it to max 30s, why not.

@amotin
Copy link
Member

amotin commented Jan 16, 2025

zinject_probe test timed out on FreeBSD, possibly causing more errors after.

I'm guessing the max 10s wait for suspend is too short. Bumped it to max 30s, why not.

I meant the CI timed out after 10 minutes. Something might be wrong with the test, so that it never completed.

Injecting a device probe failure is not possible by matching IO types,
because probe IO goes to the label regions, which is explicitly excluded
from injection. Even if it were possible, it would be awkward to do,
because a probe is sequence of reads and writes.

This commit adds a new IO "type" to match for injection, which looks for
the ZIO_FLAG_PROBE flag instead. Any probe IO will be match the
injection record and recieve the wanted error.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
@robn
Copy link
Member Author

robn commented Jan 17, 2025

I meant the CI timed out after 10 minutes. Something might be wrong with the test, so that it never completed.

I'm glad you pushed; it made me take another look at it, and took a little while to figure out (at least, I think I figured it out).

Test: /usr/local/share/zfs/zfs-tests/tests/functional/cli_root/zinject/zinject_probe (run as root) [10:00] [KILLED]
07:49:48.31 ASSERTION: Check zinject can correctly inject a probe failure.
07:49:48.43 Added handler 14 with the following properties:
07:49:48.43   pool: testpool
07:49:48.43   vdev: c1b926c8812118f1
07:49:48.44 SUCCESS: zinject -d md0 -e io -T probe testpool
07:49:48.44 Added handler 15 with the following properties:
07:49:48.44   pool: testpool
07:49:48.44   vdev: c1b926c8812118f1
07:49:48.45 SUCCESS: zinject -d md0 -e io -T write testpool
ZTS run /usr/local/share/zfs/zfs-tests/tests/functional/cli_root/zinject/zinject_probe
Solaris: WARNING: Pool 'testpool' has encountered an uncorrectable I/O failure and has been suspended.
Jan 16 07:49:48 vm2 ZFS[32666]: vdev probe failure, zpool=testpool path=/dev/md0
Jan 16 07:49:48 vm2 ZFS[32742]: catastrophic pool I/O failure, zpool=testpool

Two observations:

  • dd never finished, or it would have logged a SUCCESS line
  • the probe failure is within half a second of the injections being added

So probably, it suspended before the dd could complete; maybe before it started.

Last push tries a lot harder to control the timing, by pushing out the txg timeout, syncing the pool, doing the write (to cache) before setting the injections, and then forcing it out. It tests ok here; we'll see if it ends up being flaky or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants