Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect a slow raidz child during reads #16900

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

don-brady
Copy link
Contributor

Motivation and Context

There is a concern, and has been observed in practice, that a slow disk can bring down the overall read performance of raidz. Currently in ZFS, a slow disk is detected by comparing the disk read latency to a custom threshold value, such as 30 seconds. This can be tuned to a lower threshold but that requires understanding the context in which it will be applied. And hybrid pools can have a wide range of expected disk latencies.

What might be a better approach, is to identify the presence of a slow disk outlier based on its latency distance from the latencies of its peers.  This would offer a more dynamic solution that can adapt to different type of media and workloads.

Description

The solution proposed here comes in two parts

  1. Detecting a persistent read outlier – adapts over time under various factors (workload, environment, etc.)
  2. Placing outlier drive in a sit out period -- the raidz group can use parity to reconstruct the data that was skipped.

Detecting Outliers
The most recent latency value for each child is saved in the vdev_t. Then periodically, the samples from all the children are sorted and a statistical outlier can be detected if present. The code uses a Tukey's fence, with K = 2, for detecting extreme outliers. This rule defines extreme outliers as data points outside the fence of the third quartile plus two times the Interquartile Range (IQR). This range is the distance between the first and third quartile.

image

Sitting Out
After a vdev has encounter multiple outlier detections (> 50), it is marked for being in a sit out period that by default lasts for 10 minutes.

Each time a slow disk is placed into a sit out period, its vdev_stat.vs_slow_ios count is incremented and a zevent class ereport.fs.zfs.delay is posted.  

The length of the sit out period can be changed using the raid_read_sit_out_secs module parameter.  Setting it to zero disables slow outlier detection.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

How Has This Been Tested?

Tested with various configs, including dRAID.

For an extreme example, an HDD was used in an 8 wide SSD raidz2 and it was compared to taking the HDD offline. This was using a fio(1) streaming read workload across 4 threads to 20GB files. Both the record size and IO request size were 1MB.

Original Offline Detection
401 secs 54 secs 58 sec
204 IOPS 1521 IOPS 1415 IOPS
204 MiB/s 1521 MiB/s 1415 MiB/s

Also measured the cost over time of vdev_child_slow_outlier() where the statistical analysis occurs (every 20ms).

@vdev_child_slow_outlier_hist[ns]:  
[256, 512)           832 |@                                                   | 
[512, 1K)          14372 |@@@@@@@@@@@@@@@@@@@@@@@                             | 
[1K, 2K)           32231 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| 
[2K, 4K)           23283 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@               | 
[4K, 8K)            7773 |@@@@@@@@@@@@                                        | 
[8K, 16K)           2790 |@@@@                                                | 
[16K, 32K)           411 |                                                    | 
[32K, 64K)           122 |                                                    | 
[64K, 128K)           51 |                                                    | 
[128K, 256K)          26 |                                                    | 
[256K, 512K)           5 |                                                    | 

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

module/zfs/vdev_raidz.c Outdated Show resolved Hide resolved
*/
uint64_t two_sectors = 2ULL << zio->io_vd->vdev_top->vdev_ashift;
if (zio->io_type == ZIO_TYPE_READ && zio->io_error == 0 &&
zio->io_size >= two_sectors && zio->io_delay != 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain why do we care about the two sectors (all data columns) here?

Not accounting aggregated ZIOs makes this algorithm even more random that periodic sampling alone would do. With RAIDZ splitting ZIOs between vdevs into smaller ones, they are good candidates for aggregation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original thinking was to avoid small-ish metadata reads in the latency samples that were not aggregated

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know if in practice this extra filtering is really helpful? If it is, great, let's keep it and add the reasoning about avoiding small-ish metadata reads to the comment. If we can't, perhaps we want to drop this artificial 2 sector limit.

module/zfs/vdev_raidz.c Outdated Show resolved Hide resolved
module/zfs/vdev_raidz.c Outdated Show resolved Hide resolved
module/zfs/vdev_raidz.c Outdated Show resolved Hide resolved
Comment on lines 3037 to 3046
latency_sort(lat_data, samples);
uint64_t fence = latency_quartiles_fence(lat_data, samples);
if (lat_data[samples - 1] > fence) {
/*
* Keep track of how many times this child has had
* an outlier read. A disk that persitently has a
* higer than peers outlier count will be considered
* a slow disk.
*/
atomic_add_64(&svd->vdev_outlier_count, 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With small number of children and with only one random sample from each I really doubt this math can be statistically meaningful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The switch to using a weighted moving average addresses I believe addresses the concern here.

@amotin amotin added the Status: Code Review Needed Ready for review and testing label Dec 26, 2024
module/zfs/vdev_raidz.c Outdated Show resolved Hide resolved
@tonyhutter
Copy link
Contributor

Initial thoughts

This is going to be really useful on large pools. We often see SCSI disks "lose their mind" for a few minutes in our larger pools, tanking performance. It's nice that this feature will basically read and rebuild the data from parity first rather than attempt to read from the bad disk. I can see us using this in production.

Will disks that are "sitting out" still be read during a scrub?

If a disk is sitting out, and a read is unable to reconstruct from the other "healthy" disks, will ZFS read from the sitting out disk as a last-ditch effort?

Could this work for mirrors as well?

What happens if the algorithm wants to sit out more disks in a raid group than there is parity?

What happens if a pool has dissimilar disks in its raid group? Like, say, 7 NVMe and 1 HDD. Will it sit out the HDD since it will be so much slower than the other drives? That might actually be a nice side effect...

It would be good to see which disks are currently sitting out via a vdev property. I would expect users would want to see this for troubleshooting performance problems and locating sick disks. I can see users wanting to manually set it on sick disks as well:

# Check if /dev/sda is currently being sit out via the algorithm
$ zpool get sitting_out tank L0 sda
NAME                PROPERTY         VALUE   SOURCE
sda                 sitting_out      on      -

# Manually set /dev/sdb to sit out
$ zpool set sitting_out=on tank L0 sdb
$ zpool get sitting_out tank L0 sdb
NAME                PROPERTY       VALUE   SOURCE
sdb                 sitting_out    on      local

You could tell if sitting_out was manually or automatically set by looking at the SOURCE value. If the user manually set the vdev to sit out, we would probably want it to persist that way rather than periodically retesting the vdev's latency.

Alternatively you could name the property penalty_box rather than sitting_out, but the terminology is your call. I'll give you that it's less wordy to say "the disk is sitting out" rather than "the disk is in the penalty box" though. Other ideas: prefer_no_reads, avoid_reads, minimize_reads.

module/zfs/vdev_raidz.c Outdated Show resolved Hide resolved
@don-brady
Copy link
Contributor Author

Latest changes...

  • Use weighted moving average for saved read latency values
  • Refactored vdev_skip_latency_outlier() -- it now only checks state of sit out expiration. Moved sit out state change logic into vdev_child_slow_outlier().
  • Removed overloading of vdev_stat_lock.
  • Allow for only one vdev_child_slow_outlier() to run at a time (closed race window).

@don-brady
Copy link
Contributor Author

Will disks that are "sitting out" still be read during a scrub?

Yes. both resilver and scrub reads will not sit out.

Could this work for mirrors as well?

Possibly but that would require a different outlier detection.

What happens if the algorithm wants to sit out more disks in a raid group than there is parity?

Currently one one outlier sit out at a time is allowed.

What happens if a pool has dissimilar disks in its raid group? Like, say, 7 NVMe and 1 HDD. Will it sit out the HDD since it will be so much slower than the other drives? That might actually be a nice side effect...

Yes, it would likely end up sitting out the HDD since it will be an outlier. I did test this exact scenario and could hear the sit out period. 😃

It would be good to see which disks are currently sitting out via a vdev property.

I was using zpool iostat -v to observe that a disk was sitting out. There is also a zevent each time we enter a sit out. It's a slow io event but we could also tag it to indicate that it is also entering a sit out. I could add a read-only vdev property for read sit out. Not sure I would want to make it writable. The design here is meant to be dynamic/adaptive. The admin can still offline a disk.

man/man4/zfs.4 Outdated Show resolved Hide resolved
module/zfs/vdev_raidz.c Outdated Show resolved Hide resolved
module/zfs/vdev_raidz.c Outdated Show resolved Hide resolved
module/zfs/vdev_raidz.c Outdated Show resolved Hide resolved
module/zfs/vdev_raidz.c Outdated Show resolved Hide resolved
module/zfs/vdev_raidz.c Outdated Show resolved Hide resolved
module/zfs/vdev_raidz.c Outdated Show resolved Hide resolved
module/zfs/vdev_raidz.c Outdated Show resolved Hide resolved
module/zfs/vdev_raidz.c Outdated Show resolved Hide resolved
module/zfs/vdev_raidz.c Outdated Show resolved Hide resolved
*/
uint64_t two_sectors = 2ULL << zio->io_vd->vdev_top->vdev_ashift;
if (zio->io_type == ZIO_TYPE_READ && zio->io_error == 0 &&
zio->io_size >= two_sectors && zio->io_delay != 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know if in practice this extra filtering is really helpful? If it is, great, let's keep it and add the reasoning about avoiding small-ish metadata reads to the comment. If we can't, perhaps we want to drop this artificial 2 sector limit.

module/zfs/vdev_raidz.c Outdated Show resolved Hide resolved
module/zfs/vdev_raidz.c Show resolved Hide resolved
@behlendorf
Copy link
Contributor

I could add a read-only vdev property for read sit out. Not sure I would want to make it writable. The design here is meant to be dynamic/adaptive. The admin can still offline a disk.

A read-only vdev property seems like the way to go. That would provide some additional visibility in to when a drive is sitting out. I don't think we'd want to make it writable for exactly the reasons you mentioned.

@don-brady
Copy link
Contributor Author

Added read-only VDEV_PROP_SIT_OUT_READS property.

@don-brady don-brady force-pushed the raidz-detect-slow-outlier branch 2 times, most recently from 1140ee3 to 3965f73 Compare January 6, 2025 21:34
@don-brady
Copy link
Contributor Author

Some recent feedback changes. Rebased to latest master and squash commits.

@tonyhutter
Copy link
Contributor

I gave this a try today using the script below. I was able to sit out the disk I delayed, but I wasn't able to remove the sit-out state afterwards. Maybe I need non-ARC reads to trigger the sitting-out -> not-sitting-out logic?

#!/bin/bash
# Run this from a zfs git workspace
#

# shorten sit out period for testing
sudo bash -c 'echo 5 > /sys/module/zfs/parameters/raidz_read_sit_out_secs'

truncate -s 200M file{0,1,2,3,4,5,6,7,8,9}
sudo ./zpool create tank raidz2 `pwd`/file{0..9}
sudo dd if=/dev/urandom bs=1M count=100 of=/tank/bigfile
sudo ./zpool export tank
sudo ./zpool import -d . tank

echo "Initial state"
# We should be sitting out
sudo ./zpool get -Hp sit_out_reads tank `pwd`/file9

# Add 500ms delay on last disk
sudo ./zinject -d `pwd`/file9 -D500:1 tank

# Do some reads
sudo dd if=/tank/bigfile of=/dev/null bs=4k 

echo "Should be sitting out"
# We should be sitting out
sudo ./zpool get -Hp sit_out_reads tank `pwd`/file9

# Clear fault injection
sudo ./zinject -c all

echo "wait for us to stop sitting out part 1"

sleep 6

# Are we still sitting out?
sudo ./zpool get -Hp sit_out_reads tank `pwd`/file9

# Do some more reads to see if we can trigger the vdev to stop sitting out
sudo dd if=/tank/bigfile of=/dev/null bs=4k 

echo "wait for us to stop sitting out part 2"
sleep 6

# Are we still sitting out?
sudo ./zpool get -Hp sit_out_reads tank `pwd`/file9

@don-brady don-brady force-pushed the raidz-detect-slow-outlier branch from 3965f73 to 77e4878 Compare January 8, 2025 01:00
* A zio->io_delay value of zero means this IO was part of
* an aggregation.
*/
if (zio->io_type == ZIO_TYPE_READ && zio->io_error == 0 &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about adding if raidz_read_sit_out_secs != 0 to this check to skip calculating the ewma when this detection is disabled. It might help a little.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Added

         * A zio->io_delay value of zero means this IO was part of
         * an aggregation.
         */
-       if (zio->io_type == ZIO_TYPE_READ && zio->io_error == 0 &&
-           zio->io_size > 0 && zio->io_delay != 0) {
+       if (raidz_read_sit_out_secs != 0 && zio->io_type == ZIO_TYPE_READ &&
+           zio->io_error == 0 && zio->io_size > 0 && zio->io_delay != 0) {
                vdev_t *vd = zio->io_vd;
                uint64_t previous_ewma = atomic_load_64(&vd->vdev_ewma_latency);
                if (previous_ewma == 0)

A single slow responding disk can affect the overall read
performance of a raidz group.  When a raidz child disk is
determined to be a persistent slow outlier, then have it
sit out during reads for a period of time. The raidz group
can use parity to reconstruct the data that was skipped.

Each time a slow disk is placed into a sit out period, its
`vdev_stat.vs_slow_ios count` is incremented and a zevent
class `ereport.fs.zfs.delay` is posted.

The length of the sit out period can be changed using the
`raid_read_sit_out_secs` module parameter.  Setting it to
zero disables slow outlier detection.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Don Brady <[email protected]>
@don-brady don-brady force-pushed the raidz-detect-slow-outlier branch from 77e4878 to 64d5138 Compare January 8, 2025 17:36
@clhedrick
Copy link

A response said that resilver reads would not be affected. I'd like to suggest that this may be a bad idea. We currently have a RAIDZ2 that is rebuilding. One of the disks is very, very slow but still responds. It's really slowing down the rebuild. I'd like the resilver to avoid reading from it. For a scrub I can see that you want to read everything, but I'm not so sure about resilver.

man/man4/zfs.4 Outdated
@@ -501,6 +501,18 @@ For testing, pause RAID-Z expansion when reflow amount reaches this value.
.It Sy raidz_io_aggregate_rows Ns = Ns Sy 4 Pq ulong
For expanded RAID-Z, aggregate reads that have more rows than this.
.
.It Sy raidz_read_sit_out_secs Ns = Ns Sy 600 Ns s Po 10 min Pc Pq ulong
Copy link
Contributor

@tonyhutter tonyhutter Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All references to 'raidz' should be renamed to 'raid', since this works with dRAID (and to match the raid_read_sit_out_secs name in the commit message)

Copy link
Contributor

@behlendorf behlendorf Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was originally, I asked Don to rename it "raidz" in this #16900 (comment) so it's consistent with some other existing names. Perhaps the best thing to do would be rename it read_sit_out_secs. Then we could also generically apply it to mirrors if someday support is added for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or even, sit_out_secs to be consistent with the other renames.

Comment on lines 118 to 119
Can also be incremented when a vdev was determined to be a raidz leaf latency
outlier.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand what you mean by this. Who would be incrementing zio_slow_io_ms?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about this...

     slow_ios       This indicates the number of slow I/O operations encountered
                    by this vdev.  A slow I/O is defined as an operation that
                    did not complete within the zio_slow_io_ms threshold in mil‐
                    liseconds (30000 by default).  For RAIDZ and DRAID configu‐
                    rations, this value also represents the number of times the
                    vdev was identified as an outlier and excluded from partici‐
                    pating in read I/O operations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@tonyhutter
Copy link
Contributor

For a scrub I can see that you want to read everything, but I'm not so sure about resilver.

I agree with this - we would want the resilver to finish as quickly as possible. That would mean sitting out the sick disk if we could reconstruct from parity.

@@ -104,12 +104,19 @@ Comma separated list of children of this vdev
The number of children belonging to this vdev
.It Sy read_errors , write_errors , checksum_errors , initialize_errors , trim_errors
The number of errors of each type encountered by this vdev
.It Sy sit_out_reads
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rename it to just sit_out here and in other places to be brief. We could use it for other things too. For writes we now have ability to mark top-level vdevs as non-allocating in some cases, but may be we could converge those two at some point.

if (io_flags & (ZIO_FLAG_SCRUB | ZIO_FLAG_RESILVER))
return (B_FALSE);

return (vd->vdev_read_sit_out_expire >= gethrtime());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With raidz_read_sit_out_secs measured in seconds it makes no sense to measure vdev_read_sit_out_expire in nanoseconds. gethrtime() might be quite expensive on old hardware. gethrestime_sec() or something similar would be more than enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I'll change it to use gethrestime_sec()

Comment on lines 2320 to 2327
/*
* Scale using 16 bits with an effective alpha of 0.50
*/
const uint64_t scale = 16;
const uint64_t alpha = 32768;

return (((alpha * latest_value) + (((1ULL << scale) - alpha) *
previous_ewma)) >> scale);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why do you need 16 bit scale (you should still have few orders of magnitude before overflow, but it is just unnecessary) to implement alpha of 0.5. And why the alpha is so big? I would consider more significant dampening factor, considering we are talking about decision that will affect us for next 10 minutes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an alpha of 0.5 was an arbitrary guess. More dampening is probably better. Here's some samples from a slow HD showing an alpha of 0.5 vs 0.25, times in microsecs. I'll change it to use 0.25

     io_delay        0.5        0.25
     19875 -->     21856 |     18598
     20063 -->     20959 |     18964
     33074 -->     27017 |     22492
     24669 -->     25843 |     23036
     19052 -->     22447 |     22040
     26655 -->     24551 |     23194
     31510 -->     28031 |     25273
     33931 -->     30981 |     27437
     34881 -->     32931 |     29298
     41175 -->     37053 |     32267
     25343 -->     31198 |     30536
      6359 -->     18778 |     24492
    129904 -->     74341 |     50845
    136273 -->    105307 |     72202
     60164 -->     82736 |     69192
     12481 -->     47608 |     55015
      6220 -->     26914 |     42816
     52259 -->     39587 |     45177
     16710 -->     28148 |     38060
     83711 -->     55930 |     49473
     12421 -->     34176 |     40210
     22164 -->     28170 |     35698
     19167 -->     23668 |     31565

Comment on lines +2919 to +2922
hrtime_t now = gethrtime();
uint64_t last = atomic_load_64(&vd->vdev_last_latency_check);

if ((now - last) < MSEC2NSEC(raid_outlier_check_interval_ms) ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, I think this could be switched to seconds.

vdev_child_slow_outlier(zio_t *zio)
{
vdev_t *vd = zio->io_vd;
if (raidz_read_sit_out_secs == 0 || vd->vdev_children < LAT_SAMPLES_MIN)
Copy link
Member

@amotin amotin Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While some limit on number of samples makes sense, I am not sure it makes sense to apply it to the number of children. 3-wide RAIDZ1 may also have the same problems, but may be there we could use wider safety interval. If anywhere, the limit should be applied to a number of samples we have averaged for each children before we can trust any statistics. Though averaging hides the dispersion, so it might need more thinking how to account it better.

Comment on lines 2911 to 2915
spa_t *spa = zio->io_spa;
if (spa_load_state(spa) == SPA_LOAD_TRYIMPORT ||
spa_load_state(spa) == SPA_LOAD_RECOVER ||
(spa_load_state(spa) != SPA_LOAD_NONE &&
spa->spa_last_open_failed)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block is executed every time in production for no much reason.

Copy link
Contributor Author

@don-brady don-brady Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right. I'll remove it.

}

int samples = vd->vdev_children;
uint64_t data[LAT_SAMPLES_STACK];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope we have these 512 bytes of stack to waste.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more than I'd like, but my feeling is it should be alright. It does raise a few question though.

  1. Does the kmem_alloc() really hurt performance given that we call this at most once per raid_outlier_check_interval_ms?
  2. If it really is that expensive, it looks like we could allocate this buffer once and stash a pointer to it in the top-level raidz/draid vdev. The atomic_cas_64(&vd->vdev_last_latency_check) check on line 2923 should prevent concurrent access, and this would get right of the alloc in all cases. We'd definitely want to add a comment explaining this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

64 was chosen for benefit of draid path. Stack function depth is bounded and not that deep (below). But still, 512 is big and perhaps outside the spirt of being a good stack usage citizen.

	vdev_child_slow_outlier
	vdev_raidz_io_done
	zio_vdev_io_done
	zio_execute
	taskq_thread

q3 = latency_median_value(&data[(n+1) >> 1], n>>1);

uint64_t iqr = q3 - q1;
uint64_t fence = q3 + iqr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In comment you write fence = Q3 + 2 x (Q3 - Q1), but I don't see the 2x here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking back across branches not sure how it mutated. I'll put it back to:
return (q3 + ((q3 - q1) << 2));

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just in case, << 2 is 4x, not 2x. But in general you don't need to be so heavy on bit shifts. Compilers can optimize those things just fine.

Comment on lines 2985 to 2987
if (atomic_add_64_nv(&svd->vdev_outlier_count, 1) >
LAT_OUTLIER_LIMIT && svd == ovd &&
svd->vdev_read_sit_out_expire == 0) {
Copy link
Member

@amotin amotin Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why do you care about the vdev with maximum number of outliers here, considering first of them reachinf LAT_OUTLIER_LIMIT will reset all to 0? Also since this code should be executed only by one thread, I am not sure after atomic_cas_64() above vdev_outlier_count needs atomic accesses. Also vdev_read_sit_out_expire should also be always 0 here, since you've checked that above.

@tonyhutter
Copy link
Contributor

@don-brady I added a test case in my https://github.com/tonyhutter/zfs/tree/slow_vdev_sit_out branch. Commit is here: tonyhutter@8a58493

Would you mind pulling it into your patch stack? I have run it locally in a VM but have not tested in on the qemu runners yet.

also added test from Tony

Signed-off-by: Don Brady <[email protected]>
. $STF_SUITE/include/libtest.shlib

# Avoid name conflicts with the default pool
TESTPOOL2=testpool2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already defined in default.cfg.

log_must zinject -c all

# Wait for us to exit our sit out period
sleep 6
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a wait_sit_out() function to tests/zfs-tests/include/libtest.shlib along with the various other wait_* functions. Something like this should work (not tested).

#
# Wait for vdev 'sit_out' property to be cleared.
#
# $1 pool name
# $2 timeout
#
function wait_sit_out #pool timeout
{
        typeset timeout=${2:-300}
        typeset pool=${1:-$TESTPOOL}
        for (( timer = 0; timer < $timeout; timer++ )); do
                if [ "$(get_vdev_prop sit_out "$pool")" = "off" ]; then
                        return 0
                fi
                sleep 1;
        done
        
        return 1
}

and

log_must wait_sit_out $TESTPOOL2 10

}
}

/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To skip vdevs which are sitting out during a resilver we need to modify this check slightly. Each resilvering drive will add one to rr->rr_missingdata and if we're going to sit out one more drive there needs to be sufficient parity available. This updated code should do the trick. You'll need to update the raidz similarly.

Can you also add versions of tests/functional/replacement/attach_resilver.ksh, tests/functional/replacement/replace_resilver.ksh which sit out a drive during the resilver. We'll want to test raidz[1-3] and draid[1-3].

        /*
         * When the row contains a latency outlier and sufficient parity
         * exists to reconstruct the column data, then skip reading the
         * known slow child vdev as a performance optimization.
         */
        if (rr->rr_outlier_cnt > 0) {
                int available_parity = rr->rr_firstdatacol - rr->rr_missingparity;
                int required_parity = rr->rr_missingdata + 1;

                if (available_parity >= required_parity) {
                        for (int c = rr->rr_cols - 1;
                             c >= rr->rr_firstdatacol; c--) {
                                raidz_col_t *rc = &rr->rr_col[c];

                                if (rc->rc_error == 0 &&
                                    rc->rc_latency_outlier) {
                                        rr->rr_missingdata++;
                                        rc->rc_error = SET_ERROR(EAGAIN);
                                        rc->rc_skipped = 1;
                                        break;
                                }
                        }
                }
        }

@tonyhutter
Copy link
Contributor

"raidz" should be removed from the parameter name in Linux to be consistent with the FreeBSD name. Right now it is:

FreeBSD
$ sysctl -ad | grep sit_out
vfs.zfs.vdev.read_sit_out_secs: Raidz/draid slow disk sit out time period in seconds
Linux
$ ls -l /sys/module/zfs/parameters/ | grep sit_out
-rw-r--r--. 1 root root 4096 Jan  9 17:41 raidz_read_sit_out_secs

@tonyhutter
Copy link
Contributor

@don-brady this will resolve the test case changes and get it passing on FreeBSD:

DIFF
diff --git a/tests/zfs-tests/include/libtest.shlib b/tests/zfs-tests/include/libtest.shlib
index fbad1747b..975a50141 100644
--- a/tests/zfs-tests/include/libtest.shlib
+++ b/tests/zfs-tests/include/libtest.shlib
@@ -3090,6 +3090,26 @@ function wait_scrubbed #pool timeout
        done
 }
 
+# Wait for vdev 'sit_out' property to be cleared.
+#
+# $1 pool name
+# $2 vdev name
+# $3 timeout
+#
+function wait_sit_out #pool vdev timeout
+{
+       typeset timeout=${3:-300}
+       typeset vdev=$2
+       typeset pool=$1
+       for (( timer = 0; timer < $timeout; timer++ )); do
+               if [[ "$(get_vdev_prop sit_out $pool $vdev)" == "off" ]] ; then
+                   return 0
+               fi
+               sleep 1
+       done
+       false
+}
+
 # Backup the zed.rc in our test directory so that we can edit it for our test.
 #
 # Returns: Backup file name.  You will need to pass this to zed_rc_restore().
diff --git a/tests/zfs-tests/include/tunables.cfg b/tests/zfs-tests/include/tunables.cfg
index 7d667c553..1e82f6606 100644
--- a/tests/zfs-tests/include/tunables.cfg
+++ b/tests/zfs-tests/include/tunables.cfg
@@ -69,7 +69,7 @@ MULTIHOST_INTERVAL            multihost.interval              zfs_multihost_interval
 OVERRIDE_ESTIMATE_RECORDSIZE   send.override_estimate_recordsize       zfs_override_estimate_recordsize
 PREFETCH_DISABLE               prefetch.disable                zfs_prefetch_disable
 RAIDZ_EXPAND_MAX_REFLOW_BYTES  vdev.expand_max_reflow_bytes    raidz_expand_max_reflow_bytes
-RAIDZ_READ_SIT_OUT_SECS        vdev.raidz_read_sit_out_secs    raidz_read_sit_out_secs
+READ_SIT_OUT_SECS              vdev.read_sit_out_secs          raidz_read_sit_out_secs
 REBUILD_SCRUB_ENABLED          rebuild_scrub_enabled           zfs_rebuild_scrub_enabled
 REMOVAL_SUSPEND_PROGRESS       removal_suspend_progress        zfs_removal_suspend_progress
 REMOVE_MAX_SEGMENT             remove_max_segment              zfs_remove_max_segment
diff --git a/tests/zfs-tests/tests/functional/events/slow_vdev_sit_out.ksh b/tests/zfs-tests/tests/functional/events/slow_vdev_sit_out.ksh
index 27c3a8f5f..d859d1681 100755
--- a/tests/zfs-tests/tests/functional/events/slow_vdev_sit_out.ksh
+++ b/tests/zfs-tests/tests/functional/events/slow_vdev_sit_out.ksh
@@ -29,17 +29,14 @@
 #      1. Create various raidz/draid pools
 #      2. Inject delays into one of the disks
 #      3. Verify disk is set to 'sit out' for awhile.
-#      4. Wait for RAIDZ_READ_SIT_OUT_SECS and verify sit out state is lifted.
+#      4. Wait for READ_SIT_OUT_SECS and verify sit out state is lifted.
 #
 
 . $STF_SUITE/include/libtest.shlib
 
-# Avoid name conflicts with the default pool
-TESTPOOL2=testpool2
-
 function cleanup
 {
-       restore_tunable RAIDZ_READ_SIT_OUT_SECS
+       restore_tunable READ_SIT_OUT_SECS
        log_must zinject -c all
        destroy_pool $TESTPOOL2
        log_must rm -f $TEST_BASE_DIR/vdev.$$.*
@@ -50,8 +47,8 @@ log_assert "Verify sit_out works"
 log_onexit cleanup
 
 # shorten sit out period for testing
-save_tunable RAIDZ_READ_SIT_OUT_SECS
-set_tunable32 RAIDZ_READ_SIT_OUT_SECS 5
+save_tunable READ_SIT_OUT_SECS
+set_tunable32 READ_SIT_OUT_SECS 5
 
 log_must truncate -s 150M $TEST_BASE_DIR/vdev.$$.{0..9}
 
@@ -71,7 +68,7 @@ for raidtype in raidz raidz2 raidz3 draid1 draid2 draid3 ; do
 
        # Do some reads and wait for us to sit out
        for i in {1..100} ; do
-               dd if=/$TESTPOOL2/bigfile skip=$i bs=1M count=1 of=/dev/null &> /dev/null
+               dd if=/$TESTPOOL2/bigfile skip=$i bs=1M count=1 of=/dev/null
 
                sit_out=$(get_vdev_prop sit_out $TESTPOOL2 $BAD_VDEV)
                if [[ "$sit_out" == "on" ]] ; then
@@ -85,8 +82,7 @@ for raidtype in raidz raidz2 raidz3 draid1 draid2 draid3 ; do
        log_must zinject -c all
 
        # Wait for us to exit our sit out period
-       sleep 6
-       log_must test "$(get_vdev_prop sit_out $TESTPOOL2 $BAD_VDEV)" == "off"
+       log_must wait_sit_out $TESTPOOL2 $BAD_VDEV 10
        destroy_pool $TESTPOOL2
 done

@don-brady
Copy link
Contributor Author

"raidz" should be removed from the parameter name in Linux to be consistent with the FreeBSD name.

I was matching the existing name_prefix convention for the other raidz tunables (which may be wrong).

On Linux tunables should have some prefix scope since they all end up in parameters. Maybe use a name_prefix of `vdev' if you want consistency across platforms.

also changed
'raidz_read_sit_out_secs' to 'vdev_read_sit_out_secs'

Signed-off-by: Don Brady <[email protected]>
@tonyhutter
Copy link
Contributor

@don-brady I need to move slow_vdev_sit_out.ksh out of the test/functional/events dir to somewhere else. That should fix the failures on FreeBSD. Let me get back to you with a patch for that fix..

@tonyhutter
Copy link
Contributor

tonyhutter commented Jan 14, 2025

@don-brady this gets my test case working again on FreeBSD:

DIFF
diff --git a/tests/runfiles/common.run b/tests/runfiles/common.run
index 1ed9478c4..917650739 100644
--- a/tests/runfiles/common.run
+++ b/tests/runfiles/common.run
@@ -702,9 +702,11 @@ tests = ['dio_aligned_block', 'dio_async_always', 'dio_async_fio_ioengines',
     'dio_unaligned_block', 'dio_unaligned_filesize']
 tags = ['functional', 'direct']
 
-[tests/functional/events]
+[tests/functional/sit_out]
 tests = ['slow_vdev_sit_out']
-tags = ['functional', 'events']
+pre =
+post =
+tags = ['functional', 'sit_out']
 
 [tests/functional/exec]
 tests = ['exec_001_pos', 'exec_002_neg']
@@ -904,7 +906,7 @@ tags = ['functional', 'rename_dirs']
 tests = ['attach_import', 'attach_multiple', 'attach_rebuild',
     'attach_resilver', 'detach', 'rebuild_disabled_feature',
     'rebuild_multiple', 'rebuild_raidz', 'replace_import', 'replace_rebuild',
-    'replace_resilver', 'replace_resilver_sit_out' 'resilver_restart_001',
+    'replace_resilver', 'replace_resilver_sit_out', 'resilver_restart_001',
     'resilver_restart_002', 'scrub_cancel']
 tags = ['functional', 'replacement']
 
diff --git a/tests/zfs-tests/tests/Makefile.am b/tests/zfs-tests/tests/Makefile.am
index 42e7bce85..a24d17bc5 100644
--- a/tests/zfs-tests/tests/Makefile.am
+++ b/tests/zfs-tests/tests/Makefile.am
@@ -1497,7 +1497,6 @@ nobase_dist_datadir_zfs_tests_tests_SCRIPTS += \
        functional/events/events_001_pos.ksh \
        functional/events/events_002_pos.ksh \
        functional/events/setup.ksh \
-       functional/events/slow_vdev_sit_out.ksh \
        functional/events/zed_cksum_config.ksh \
        functional/events/zed_cksum_reported.ksh \
        functional/events/zed_diagnose_multiple.ksh \
@@ -1994,6 +1993,7 @@ nobase_dist_datadir_zfs_tests_tests_SCRIPTS += \
        functional/scrub_mirror/scrub_mirror_003_pos.ksh \
        functional/scrub_mirror/scrub_mirror_004_pos.ksh \
        functional/scrub_mirror/setup.ksh \
+       functional/sit_out/slow_vdev_sit_out.ksh \
        functional/slog/cleanup.ksh \
        functional/slog/setup.ksh \
        functional/slog/slog_001_pos.ksh \
diff --git a/tests/zfs-tests/tests/functional/events/slow_vdev_sit_out.ksh b/tests/zfs-tests/tests/functional/sit_out/slow_vdev_sit_out.ksh
similarity index 100%
rename from tests/zfs-tests/tests/functional/events/slow_vdev_sit_out.ksh
rename to tests/zfs-tests/tests/functional/sit_out/slow_vdev_sit_out.ksh

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants