Potential trim issue on 0.8.0 #8835
Replies: 32 comments 6 replies
-
To add to this, some 9 hours later, still running. Manually running:
does nothing, while the other devices report After running above:
|
Beta Was this translation helpful? Give feedback.
-
@kroy-the-rabbit that would be a real of a coincidence. According to the console messages posted an unexpected timeout was encountered (hostbyte=DID_TIME_OUT). After aborting the command it was converted to an IO error by the scsi layer and then reported to ZFS. This doesn't necessarily mean there's anything wrong with the device, but something in the path resulted in the timeout. A couple quick questions.
|
Beta Was this translation helpful? Give feedback.
-
Yep. The rest of the pool other than the failed device says:
The other mirror device in that vdev says:
The "failed" device says (and doesn't seem to be moving):
That did bring the status to "untrimmed". Restarting it by the specific device resulted in a postive result in less than 3 minutes:
|
Beta Was this translation helpful? Give feedback.
-
It sounds as if ZFS may have overwhelmed the controller with outstanding TRIM commands resulting in the unexpected timeout and subsequently faulted device. You could try setting the module option |
Beta Was this translation helpful? Give feedback.
-
so I didn't want to open another ticket, but I'm not sure this is the same thing. root@io:/z/vm_backup# lsb_release -a; uname -a filename: /lib/modules/4.15.0-70-generic/updates/dkms/zfs.ko maybe it's my disk, not dealing well with trim? maybe not. This is my backup system so I don't really want to use it as a testbed to run more tests, but if there's something non destructive you want me to do, I can try it. oh and there were no disk errors of any kind in kern.log. |
Beta Was this translation helpful? Give feedback.
-
I have issues with trim too, triming pool results in zvol corruption ALWAYS.
Status says it was trimmed OK: (100% trimmed, completed at Fri 13 Dec 2019 03:42:04 PM MSK) |
Beta Was this translation helpful? Give feedback.
-
echo 1 > /sys/module/zfs/parameters/zfs_vdev_trim_max_active didn't help |
Beta Was this translation helpful? Give feedback.
-
@Temtaime if it's not too much to ask would it be possible for you to rerun your test case with the sudo sh -c "echo 2048 >/sys/module/zfs/parameters/zfs_flags" This should help us determine if it's the software or hardware which is causing the problem. |
Beta Was this translation helpful? Give feedback.
-
@behlendorf Hello. Thanks for a reply. |
Beta Was this translation helpful? Give feedback.
-
@behlendorf Long story short, i replaced ssd and the issue is gone. |
Beta Was this translation helpful? Give feedback.
-
sorry I didn't get back to you before. *-disk and *-disk I have a single pool with them mirrored together, and if I run trim, it all goes boom. |
Beta Was this translation helpful? Give feedback.
-
@behlendorf Hi from stu. Happy friday. So those drives I had a problem with are going to be freed up in the next few days I hope, for a little while anyway. I asked tom and he said tag you and ask if you'd want me to do any testing with these drives. I'm inclined to agree with you that the firmware on these drives is not capable of dealing with whatever trims zfs is throwing at it. But if you think this is a problem worth chasing and there's something I can test on these drives, lemme know, I can do whatever. a short version of what happened so you don't have to read through everything again: |
Beta Was this translation helpful? Give feedback.
-
Thanks for the update. If you're able to do a little testing with the drives it would be helpful to create a new pool and see if you can reproduce the issue. If so are there any IO errors logged to dmesg? What does the |
Beta Was this translation helpful? Give feedback.
-
Didn't have a lot of time, I'll get you more information, but the first set of stuff is interesting... I ran zpool trim z went away, came back after a while... these two ssd are mirrored together.... lrwxrwxrwx 1 root root 10 May 2 20:30 ata-SanDisk_SSD_PLUS_1000GB_191177464112-part1 -> ../../sdf1
so kern.log has io errors on sdd which is 190919447101 May 3 07:45:06 io kernel: [207209.921245] ata4.00: exception Emask 0x0 SAct 0x200 SErr 0x0 action 0x0 But the other disk seems fine (from a kern.log point of view) yet the data is corrupt, the mirror couldn't get the data from either drive? there are zero errors in kern.log for the sdf drive. reading the files zpool status -v says are bad I get io errors... root@io:/dev/disk/by-id# file /z/backup/sp/spford/ford/git/o/objects/e8/6955a1e45bf3d4c3e3d7836d8c10a14389711b |
Beta Was this translation helpful? Give feedback.
-
a few more notes: I checked and there's no firmware update for these drives, so it is what it is. Also, I guess this makes sense, just wanted to check, but when doing a zpool status, I notice that every time I did a du on my zpool, the CKSUM errors went up. I guess that's because zfs was unable to repair the blocks, so it kept finding them bad over and over every time I tried to read that data, and the CKSUM value is just a tally of errors it has come across I guess? Anyway. I'll let you know if I find anything else worth mentioning but it's starting to look like either a bad cable or something not seated right. I blew away the pool, made a new one and am generating more data and deletes, it seems to be holding up for now. Also, I'm realizing, these appear to be pretty low grade consumer level drives, so one can only ask for so much. Thanks for the feedback anyway. |
Beta Was this translation helpful? Give feedback.
-
I have a similar issue occuring with autotrim=on on a a IBM M1015 HBA (LSI 9211-8i). I can isolate it to the HBA as the raidz1 consists of disks on the HBA and on the MoBo SATA controller. With autotrim=on I receive on all 4 HBA-connected devices error messages like this (none on MoBO connected disks): with autotrim=off I so far have not seen the issue appearing. A manual trim also did not invoke any issues Updtae: I ran a scrub without autotrim and did receive errors I/O errors (though not zio pool errors but typical blk_update_request IO errors). I ordered a new 8087 cable to check if errors persist with a new cable. If I do not report back, consider my issue solved ;-) @nixomose what HBA are you using? |
Beta Was this translation helpful? Give feedback.
-
all my drives are plugged directly into the motherboard, so whatever controller's on the board which lshw says is.... product: 7 Series/C210 Series Chipset Family 6-port SATA Controller [AHCI mode] [8086:1E02] |
Beta Was this translation helpful? Give feedback.
-
I can't imagine my problem is the controller though, because other ssds attached the same way are fine with trim, I'm sure it's the drives in my case. never buying those again. :-) |
Beta Was this translation helpful? Give feedback.
-
Hello, seems like I've faced a similar issue. I hope some more info could help to figure out root cause for the issue. I'm running NixOS 19.09 with ZFS 0.8.3 (while there was the same issue with 0.8.2). So, before enabling the autotrim feature I've decided to test on one device first using the
Trim would eventually successfully finish. Scrub successfully passing after that as well and the pool seems to continue running just fine after that. I've faced that issue more than 3 months ago and after successful scrub, I've cleaned up the errors and they never acquired again. So, that's quite interesting and I'm wondering is it even a real error or some issue that leading to zfs to report an error... Here is what I see in logs at the time of running trim command:
|
Beta Was this translation helpful? Give feedback.
-
So I installed the new cable without any significant change - I still get errors like these regularly:
It does not seem to be a trim issue to another ZFS related issue. Before converting to zfs, I ran those disks as mdadm raid5 without any issues. Nevertheless, this seems to be isolated to the controller. 4 out of 5 disks are connected via the HBA (LSI 9211-8i) and all of them are throwing these error. It is very unlikely that all 4 SSDs all of a sudden have hardware issues. I hope someone has an idea how to fix this although not this is not related to trim... |
Beta Was this translation helpful? Give feedback.
-
Someone from reddit mentioned this bug in response to a post i submitted inquiring if the Linux mpt3sas driver may have been responsible for a reproducible bug with zpool trim where using zpool trim with a single vdev mirror pool comprised of samsung 860 evo ssd's was causing controller resets with an LSI 9305-16i sas hba. Running Gentoo Linux and 16+ years experience using it. Here's the server kernel config running Linux 5.4.38 and the kernel log result of attempting to use zpool trim with zfs_vdev_max_active=2 After recovering from the first round of disk resets and having set zfs_vdev_trim_max_active=1 zpool trim degraded the mirror pool due to more disk resets and faulted one disk in the pool. Currently running zfs 0.8.4 but this was reproducable on previous zfs versions when i was willing to attempt testing it. I'm not certain what to attempt from here but i know i'm not running zpool trim on my root disk pool set until further notice which this mirror pool certainly is not. I'm likely going to try a pool rebuild and attempt to trim with a fresh pool and will report the results. Here's the kernel logs post zfs_vdex_max_active=1
Follow up zpool commands and general system info
|
Beta Was this translation helpful? Give feedback.
-
Update to my previous comment. I migrated datasets off of the 250GB mirror pool to prepare for further testing,
Pool creation command string used
Results... zpool trim succeeded with no device reset errors, same ssd disks and same server. Bizarre... I also tested attempting to rate limit trim using zpool trim -r initially to attempt to further rate limit the trim intensity expecting dmesg to fill with mpt3sas device reset errors however there were none after testing zpool trim -r and forcing several different trim rate limits or just allowing trim to proceed at full i/o speed.
|
Beta Was this translation helpful? Give feedback.
-
I believe I just ran into this issue as well. Running Ubuntu Server 20.04.01 with a fully upgraded zpool. 4 Samsung 860 EVOs in a mirror-stripped setup. Running a stock Currently running trim on each drive individually is working fine. |
Beta Was this translation helpful? Give feedback.
-
How did you run trim individually on each drive? i've set the kernel parameter for zfs_vdev_trim_max_active=1 and tested this again across several zfs versions up to 0.8.5 and the issue has been more frequently reproducible for disks containing partitions where zfs only resides within the boundary of a disk partition however my "deadpool" mirror pool where zfs was not restricted to the boundaries of disk partitions recently experienced the same complication so theories why this occurs are uncertain. It may be worth mentioning my rpool was created by cli parted with a gpt disk label. my 500gb mirror pool set i believe has samsung firmware updates available and lsi 9305-16i hba had firmware updates applied but does have more recent firmware update available however when I previously had maintenance windows available samsung magician was unable to update firmware for disks connected to the lsi hba controller. I have another maintenance window planned within coming weeks to decommission deadpool replacing that mirror set then expanding rpool by adding another 500gb mirror as well as planned strategy for applying firmware updates for the sata ssd's and hba controller. After the updates and maintenance are complete i'll be able to attempt to reproduce or isolate this bug further. I have been suspicious that sata ssd's run by a sas hba may have been part of the cause of complications experienced because of the interface capability differences between sas and sata while similar iirc sata is not capable of the command or bandwidth parallelism of sas. I've not been able to or have not attempted to reproduce this bug using direct sata 6gb/s connections provided by a consumer pc motherboard however soon I will have the 250gb ssd set available to use for installing an rpool on another pc using an asus prime z270-a motherboard. |
Beta Was this translation helpful? Give feedback.
-
Interesting. Two of the drives in my setup are partitioned to contain EFI, so I fall within both of those groupings. Unfortunately, I don't have a good solution for trimming drives individually. I just issued the trim command by hand |
Beta Was this translation helpful? Give feedback.
-
Hi,
|
Beta Was this translation helpful? Give feedback.
-
Did you upgrade the kernel by any chance during the replacement?
It looks like 5.4.152 has some changes to trimming on Samsung 860 SSDs.
There is a way to restore NCQ for the drive using kernel command line
arguments.
========================
commit caff281e2073e4e71fd0bced2385b7771b512264
Author: Kate Hsuan ***@***.***>
Date: Fri Sep 3 17:44:11 2021 +0800
libata: Add ATA_HORKAGE_NO_NCQ_ON_ATI for Samsung 860 and 870 SSD.
commit 7a8526a5cd51cf5f070310c6c37dd7293334ac49 upstream.
Many users are reporting that the Samsung 860 and 870 SSD are having
various issues when combined with AMD/ATI (vendor ID 0x1002) SATA
controllers and only completely disabling NCQ helps to avoid these
issues.
Always disabling NCQ for Samsung 860/870 SSDs regardless of the host
SATA adapter vendor will cause I/O performance degradation with well
behaved adapters. To limit the performance impact to ATI adapters,
introduce the ATA_HORKAGE_NO_NCQ_ON_ATI flag to force disable NCQ
only for these adapters.
.....
========
…On Tue, Nov 30, 2021 at 12:26 PM QBANIN ***@***.***> wrote:
Hi,
Until a week ago I had ***@***.*** with 2x Samsung 860 Evo as ZFS
Raid1 + 2xWD RED 2,5" 1TB HDD as RAID0 and 4x WD RED 3,5" 3TB as RAID10 .
Recently I all 3,5" WD REDs with Seagate SAS drives and since then I'm
getting this errors at least once a day. It happens at xx H and 01 m (few
times at 2:01 and 16:01) like in attached log. sda and sdd are Samsung 860
Evo drives. The weird thing is that for almost 2 years I had no problems
until recent 3,5" drives replacement. Any clue why this error apeears
always at 01m and how to fix it? SMART for all drives is clear. After
theses read errors RAID1 pool becomes degraded but works fine after a scrub
till next time.
sudo zfs version
zfs-2.0.3-9bpo10+1
zfs-kmod-2.0.3-9bpo10+1
uname -a
Linux NAS 5.4.161-qba1 #3 <#3> SMP
Sun Nov 21 18:11:35 CET 2021 x86_64 GNU/Linux
sudo dmesg -T |grep DID
[nie lis 28 16:01:06 2021] sd 0:0:3:0: [sdd] tag#8439 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[nie lis 28 16:01:06 2021] sd 0:0:3:0: [sdd] tag#8401 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[nie lis 28 16:01:06 2021] sd 0:0:3:0: [sdd] tag#8424 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[nie lis 28 16:01:06 2021] sd 0:0:3:0: [sdd] tag#8418 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[nie lis 28 16:01:06 2021] sd 0:0:3:0: [sdd] tag#8437 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[nie lis 28 16:01:06 2021] sd 0:0:3:0: [sdd] tag#8390 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[nie lis 28 16:01:06 2021] sd 0:0:3:0: [sdd] tag#8408 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[nie lis 28 16:01:06 2021] sd 0:0:3:0: [sdd] tag#8438 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[nie lis 28 16:01:06 2021] sd 0:0:3:0: [sdd] tag#8405 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[nie lis 28 16:01:06 2021] sd 0:0:3:0: [sdd] tag#8386 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[nie lis 28 16:01:38 2021] sd 0:0:3:0: [sdd] tag#8429 FAILED Result:
hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
[nie lis 28 16:01:38 2021] sd 0:0:3:0: [sdd] tag#8427 FAILED Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE
[nie lis 28 16:01:38 2021] sd 0:0:3:0: [sdd] tag#8407 FAILED Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE
[nie lis 28 16:01:38 2021] sd 0:0:3:0: [sdd] tag#8433 FAILED Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE
[nie lis 28 16:01:38 2021] sd 0:0:3:0: [sdd] tag#8397 FAILED Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE
[nie lis 28 16:01:38 2021] sd 0:0:3:0: [sdd] tag#8428 FAILED Result:
hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
[nie lis 28 16:01:38 2021] sd 0:0:3:0: [sdd] tag#8426 FAILED Result:
hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
[nie lis 28 16:01:38 2021] sd 0:0:3:0: [sdd] tag#8425 FAILED Result:
hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
[nie lis 28 16:01:38 2021] sd 0:0:3:0: [sdd] tag#8423 FAILED Result:
hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
[nie lis 28 16:01:38 2021] sd 0:0:3:0: [sdd] tag#8422 FAILED Result:
hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
[nie lis 28 16:02:09 2021] sd 0:0:0:0: [sda] tag#8447 FAILED Result:
hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
[nie lis 28 16:02:09 2021] sd 0:0:0:0: [sda] tag#8442 FAILED Result:
hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
[nie lis 28 16:02:09 2021] sd 0:0:0:0: [sda] tag#8439 FAILED Result:
hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
[nie lis 28 16:02:09 2021] sd 0:0:0:0: [sda] tag#8438 FAILED Result:
hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
[nie lis 28 16:02:09 2021] sd 0:0:0:0: [sda] tag#8436 FAILED Result:
hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
[nie lis 28 16:02:09 2021] sd 0:0:0:0: [sda] tag#8427 FAILED Result:
hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
[nie lis 28 16:02:09 2021] sd 0:0:0:0: [sda] tag#8423 FAILED Result:
hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
[wto lis 30 02:01:04 2021] sd 0:0:0:0: [sda] tag#9392 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[wto lis 30 02:01:04 2021] sd 0:0:0:0: [sda] tag#9387 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[wto lis 30 02:01:04 2021] sd 0:0:0:0: [sda] tag#9360 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[wto lis 30 02:01:04 2021] sd 0:0:0:0: [sda] tag#9362 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[wto lis 30 02:01:04 2021] sd 0:0:0:0: [sda] tag#9369 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[wto lis 30 02:01:04 2021] sd 0:0:0:0: [sda] tag#9401 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[wto lis 30 02:01:04 2021] sd 0:0:0:0: [sda] tag#9391 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[wto lis 30 02:01:04 2021] sd 0:0:0:0: [sda] tag#9382 FAILED Result:
hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
[wto lis 30 02:01:04 2021] sd 0:0:0:0: [sda] tag#9388 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[wto lis 30 02:01:04 2021] sd 0:0:0:0: [sda] tag#9370 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[wto lis 30 02:01:35 2021] sd 0:0:0:0: [sda] tag#9348 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[wto lis 30 02:01:35 2021] sd 0:0:0:0: [sda] tag#9347 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[wto lis 30 02:01:35 2021] sd 0:0:0:0: [sda] tag#9345 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[wto lis 30 02:01:35 2021] sd 0:0:0:0: [sda] tag#9344 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[wto lis 30 02:01:35 2021] sd 0:0:0:0: [sda] tag#9360 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[wto lis 30 02:01:35 2021] sd 0:0:0:0: [sda] tag#9359 FAILED Result:
hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
[wto lis 30 02:01:35 2021] sd 0:0:0:0: [sda] tag#9358 FAILED Result:
hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
[wto lis 30 02:01:35 2021] sd 0:0:0:0: [sda] tag#9357 FAILED Result:
hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
[wto lis 30 02:01:35 2021] sd 0:0:0:0: [sda] tag#9356 FAILED Result:
hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
[wto lis 30 02:01:35 2021] sd 0:0:0:0: [sda] tag#9346 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[wto lis 30 02:02:05 2021] sd 0:0:0:0: [sda] tag#9396 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[wto lis 30 02:02:05 2021] sd 0:0:0:0: [sda] tag#9389 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[wto lis 30 02:02:05 2021] sd 0:0:0:0: [sda] tag#9388 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[wto lis 30 02:02:05 2021] sd 0:0:0:0: [sda] tag#9366 FAILED Result:
hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
[wto lis 30 02:02:05 2021] sd 0:0:0:0: [sda] tag#9371 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[wto lis 30 02:02:05 2021] sd 0:0:0:0: [sda] tag#9349 FAILED Result:
hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
[wto lis 30 02:02:06 2021] sd 0:0:3:0: [sdd] tag#9393 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[wto lis 30 02:02:06 2021] sd 0:0:3:0: [sdd] tag#9370 FAILED Result:
hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
[wto lis 30 02:02:06 2021] sd 0:0:3:0: [sdd] tag#9369 FAILED Result:
hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
[wto lis 30 02:02:06 2021] sd 0:0:3:0: [sdd] tag#9368 FAILED Result:
hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#8835 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABXQ6HJIQV5LTOU274G46HDUOQR5LANCNFSM5JAQ4OUQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Beta Was this translation helpful? Give feedback.
-
@xartin This error appears with both autotrim on or off @IvanVolosyuk I'm almost 100% certain that I upgraded to >5.4.152 few weeks before disk replacement. Besides I don't use ATI controller. Tried to trigger this issue by running trim manually but no luck. |
Beta Was this translation helpful? Give feedback.
-
Yesterday I disabled NCQ for 2 SSD drives with echo 1 > /sys/block/sdX/device/queue_depth command. Today the same error has been triggered by 2 remaining SATA WD RED HDDs, guess when? at 11:01 :( What kind of operation is linux performing at xx:01m ? There's nothing scheduled in cron at this time. Could the be some kind of interference between SAS and SATA drives? BTW. Drives attached to LSI controller don't respect libata.force=noncq kernel parameters, queue depth=32 after reboot, this works only for drives attached to mobo sata controller. |
Beta Was this translation helpful? Give feedback.
-
I'm hitting this issue nixos 21.05/21.11 SanDisk plus drives |
Beta Was this translation helpful? Give feedback.
-
System information
Describe the problem you're observing
Upgraded a 0.7.13 pool to 0.8.0. Started a trim on the pool. Within minutes a disk was marked as faulted. SMART data is clean, regular scrubs are occurring, and the disk has never once thrown an error. It has less than 2000 hours on it.
I'm not saying the disk isn't bad, but it would be a heck of a coincidence.
Cleared the error on the disk, now the disk seems to be in a permanent "trimming" state.
zpool layout is eight S3610 in four mirrors
Describe how to reproduce the problem
Include any warning/errors/backtraces from the system logs
Permanent trimming (now 5 hours later):
Disk errors:
After clearing the faulted device:
Beta Was this translation helpful? Give feedback.
All reactions