-
We recently decided to make the move from Illumos (OpenIndiana, 5.11) to Linux (and ZFS on Linux / OpenZFS) for our storage systems. We're noticing that the rate that our disks error is a lot higher than we formerly had on the Illumos-based systems.
Disks that we use are enterprise SSDs or HDDs and SMART-monitoring shows no issues with them. Besides, procedurally, when SMART is showing the least bit of issues with a disk, we pro-actively replace it with a new disk. Logging for the IO-errors that we're seeing
We've turned to the logs to track what errors were occurring, but they make it look like the physical disks are the culprit. As if it's not related to OpenZFS, but our disks are just failing. Yet, considering this happens immediately after switching from Illumos to Linux on multiple machines and we find it hard to believe that mpt3sas is fundamentally different between Illumos & Linux, we'd like to figure out if this isn't OpenZFS first. It feels more likely that tweaking ZFS will yield better results than looking at the mpt3sas library. Besides, the machines are well-maintained and machines with similar age that are still running Illumos (quite some) don't show this behaviour, making a physical problem (humidity, temperature, cabling, etc.) unlikely. Similar things happen to HDDs, however the rate of errors seems to be a bit lower (but still a lot higher than we have on Illumos). SMART information about this disk in particular
We tried changing some parameters, like lowering the I'm not sure if collecting perf-data (and how we would match that with the moments errors occur) is sensible. Or if there are other tools to resort to when this kind of "random" behaviour occurs. If anyone can point me in the right direction (or a book / blog / article) on tracing situations like these, I'm happy to learn. Some more information about the SAS controller and the device (`sas3ircu 0 DISPLAY`)
ZFS `zpool events -v`
Update After creating this post, we upgraded the OpenZFS version on the machine and continued testing, we also enabled debugging and have the following logging. Are there (still) "notorious" problems between the LSI HBA SAS 2008 / 3008 controllers and OpenZFS? Kernel logging of some write errors that occurred
|
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 7 replies
-
In the meantime, I enabled debug logging on
To see, if another error occurs, if anything precedes it. |
Beta Was this translation helpful? Give feedback.
-
use the zfs PPA that backports newer versions, or, upgrade to Ubuntu 22.04. you're using an ancient build of ZFS that no longer receives basic fixes. |
Beta Was this translation helpful? Give feedback.
-
@maxboone Were these controllers already flashed to IT mode? If not, would still be interesting why same worked better in illumos. You could experiment with firmware like @bghira hinted. |
Beta Was this translation helpful? Give feedback.
-
So, first, Are they all the same model SSD/HBA/firmware for each/etc? My guess would be some feature that illumos doesn't feel the need to implement but Linux does, or some IO pattern differences. Those errors look like "SSD stopped responding for 30 seconds so we aborted the IO" - does illumos perhaps have a different timeout length it defaults to configuring, or not treat a single timeout as a fatal error to bubble up the OS stack? |
Beta Was this translation helpful? Give feedback.
use the zfs PPA that backports newer versions, or, upgrade to Ubuntu 22.04. you're using an ancient build of ZFS that no longer receives basic fixes.