-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement dynamic gang header sizes #17004
base: master
Are you sure you want to change the base?
Conversation
module/zcommon/zfeature_common.c
Outdated
zfeature_register(SPA_FEATURE_DYNAMIC_GANG_HEADER, | ||
"com.klarasystems:dynamic_gang_header", "dynamic_gang_header", | ||
"Support for dynamically sized gang headers", | ||
ZFEATURE_FLAG_ACTIVATE_ON_ENABLE, ZFEATURE_TYPE_BOOLEAN, NULL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I also thought about this gang block issue, it also immediately became obvious that it would require read-incompatible pool feature. I am very unhappy about it. It would require updates to at least GRUB and FreeBSD loader to not break booting or careful compatibility settings from every user and distribution. The "activate on enable" makes it even worse, while I bet in many cases it could be avoided, for example for any pool with at least one ashift=9
vdev.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACTIVATE_ON_ENABLE can certainly be avoided. If nothing else, we could have a check in zio_write_gang_block
that only activates it if we get allocations back that let us use a header of size > 512. In theory, we could modify the feature refcount for every single gang block created and freed, but that feels like a bit much, since each time you need to create a dmu_tx_t, assign it to the io's txg, etc. Doing it once you actually write a new-style gang block for the first time and never deactivating it should mean that for most pools, it never gets activated, and for those with vdevs with 512 byte sectors, it also wouldn't get activated.
read-incompatible, on the other hand, is definitely unavoidable. For now at least, people can avoid enabling the feature on their root pools, so the bootloader issue isn't as pressing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"avoid activating the feature" means never running zpool upgrade
for existing pools, or setting right compatibility
property on creation or at least before upgrading. At this time we don't even support disabling features that were enabled by mistake and that are not active. We should implement that at some point BTW. This might be a good motivation, if we accept the incompatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out a slightly bigger hammer than expected is needed here, so we definitely don't want to update this for every single gang header. It turns out that 1) I forgot that this featureflag needed the MOS flag and 2) MOS features basically have to be updated from syncing context. This means we have to spin out a synctask to do the job for us.
module/zfs/zio.c
Outdated
lsize = MIN(P2ROUNDUP(resid / (gbh_nblkptrs(gangblocksize) - g), | ||
spa->spa_min_alloc), resid); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that on vdev with ashift=12 we can allocate 4KB gang header, does not mean we need 31 children. Would we fit into 3 we could remain read-compatible and would not explode the number of allocations. Some RAIDZ3 will happily consume 4x more space here. What happen on vdevs with ashift=15
, I don't want even think.
Meanwhile while proper rounding here is needed indeed, I think it should be spelled as:
lsize = zio_roundup_alloc_size(spa, resid / (gbh_nblkptrs(gangblocksize) - g);
lsize = MIN(resid, lsize);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I noticed that in my testing, which is why I implemented this. The example I mentioned in the description of a 128KiB block ganging into 64k sectors was exactly the setup where I encountered it.
zio_roundup_alloc_size is handy, I did not know about that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which is why I implemented this
Proper rounding should make it some better, but will not solve completely. RAIDZ3 still has allocation size of ashift
, but too expensive on it. It would be good to allocate as little of fragments as possible, but I can't think of much other than trying different sizes in a loop. Unless we implement some sort of opportunistic allocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, it sounds like your point is that it would be nice if instead of saying "please allocate me the smallest possible chunk" and then assuming that all of the gang leaves are of that size, it would be nice if we could take full advantage of whatever sizes we have available, in order to try to to minimize the number of leaves we allocate?
I agree that would be nice, but I think that would be a separate change, since it would involve new (probably somewhat complex) logic. I don't think it would need another featureflag; the gang issue code calculates the offset of each subsequent leaf node by looking at BP_GET_PSIZE of the current one. As long as those were set appropriately, I think we could have mixed leaf sizes without breaking on-disk compatibility.
I think such a feature might want cooperation from the allocation code; rather than trying sizes in a loop, providing a size range to the allocation code would solve the problem, though that would be a lot of plumbing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think allocation of 31 4KB chunks for 128KB write is acceptable even as only a starter. Comparing to that even the existing algorithm of nested gangs looks efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this is a serious issue (though in that specific example, you would actually end up with 16 8k chunks, I think; that doesn't change the basic problem, though). But I really don't want to balloon this change with a new mechanism for dealing with maximal leaf sizing; that change is likely going to require extensive modifications to the allocator or the way we issue gang child IOs, and quite possibly both.
How would you feel about a tunable that enables the larger gang headers, set to false by default? That way even if someone upgrades their pool, they still don't get the new functionality unless they also flip the tunable. That would let us keep each commit relatively contained in scope/functionality, which makes them easier to implement and review. Plus, people who really need the better gang headers could turn them on, but until the fix is in (which would hopefully be in the same dot-release) most users would be unaffected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could live with it disabled, but I would not commit it this way myself. The questionable result does not worth the read-incompatible pool feature in my opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you need and excuse to work on opportunistic allocation, may be it could be a ZIL, which usually has only a guess of what it may want next, but in many cases could take whatever given. I remember there was also a problem with fragmentation which Matt Ahrens handled by adding a module parameter limiting maximum ZIL block size.
ZFS gang block headers are currently fixed at 512 bytes. This is increasingly wasteful in the era of larger disk sector sizes. This PR allows any size allocation to work as a gang header. It also contains supporting changes to ZDB to make gang headers easier to work with. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Paul Dagnelie <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Motivation and Context
ZFS gang block headers are currently fixed at 512 bytes. This is increasingly wasteful in the era of larger disk sector sizes. This PR allows any size allocation to work as a gang header. It also contains supporting changes to ZDB to make gang headers easier to work with.
Description
The ZDB changes are first; basically, there are just some small tweaks to make it easier to work with gang blocks. First, the compact blkptr printer now notes which DVAs have the gang bit set. There is also a fix of a bug that's been around since 2009; ZDB gang block header printing has been broken since then. The problem is that if you do a
zio_read
of a BP with the gang bit set, you don't get back the header, you get back the underlying data. The fix is to just not set the gang bit.The way dynamic sized gang headers work is that the amount of space we allow ourselves to use for the gang header is equal to the size of the smallest possible allocation on the vdevs we got back when we allocated it. This is necessary to work around the fact that the ASIZE for a gang BP isn't the allocated size of the gang header, but of the entire tree under it. Because of that fact, when reading, claiming, or freeing, the allocated size of a gang header must be able to be determined from no other information than the vdev it was allocated on. We reuse the existing
vdev_gang_header_asize
for this. We take this minimum space, and pack the block full of blkptrs, up to a tail with an embedded checksum. This allows us to store many more gang children per header, leading to much shallower gang trees if we're forced into intense ganging.One important wrinkle is that if the pool we're using has old gang headers on it, they may be smaller than the smallest allocation of the relevant vdevs. We have a workaround for this in
zio_checksum_error
, which will now try the checksum again in the case of a gang block of size above 512 bytes, with the size reduced to 512. This should not pose a significant performance problem, since 1) calculating a checksum for a block that small doesn't take very long, and 2) hopefully the number of gang blocks on systems is not too high. This also only applies to systems that have both old gang headers and the feature enabled.Much of the remaining changes are just tweaking the gang tree orchestration logic to work with a variable number of gang block pointers. The final interesting change is to the logic that determines the size of the gang children. When only 3 gang children were possible, it didn't really matter what the sector size was; you were going to be allocating at most 3 of them, so the space waste was limited. With large sector sizes, however, it could get expensive: A 128k allocation that gangs on a system with 64k sectors will, without changes, consume 256 64k sectors, with 512 bytes of data each, wasting 128x as much space as the original allocation. The fix is to use the spa_min_ashift rather than the MINBLOCKSIZE when calculating the lsize.
How Has This Been Tested?
In addition to the ZTS and zloop, extensive manual tests were performed, verifying that 1) the new ZDB functionality works, and 2) that larger gang headers are correctly used and allocated when applicable. Mixed ashift testing occured, as well as the full compatibility matrix of old pools, new pools without the feature enabled running against old and new code.
Types of changes
Checklist:
Signed-off-by
.