Implement dynamic gang header sizes #17004

pcd1193182 · 2025-01-29T01:12:34Z

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

Motivation and Context

ZFS gang block headers are currently fixed at 512 bytes. This is increasingly wasteful in the era of larger disk sector sizes. This PR allows any size allocation to work as a gang header. It also contains supporting changes to ZDB to make gang headers easier to work with.

Description

The ZDB changes are first; basically, there are just some small tweaks to make it easier to work with gang blocks. First, the compact blkptr printer now notes which DVAs have the gang bit set. There is also a fix of a bug that's been around since 2009; ZDB gang block header printing has been broken since then. The problem is that if you do a zio_read of a BP with the gang bit set, you don't get back the header, you get back the underlying data. The fix is to just not set the gang bit.

The way dynamic sized gang headers work is that the amount of space we allow ourselves to use for the gang header is equal to the size of the smallest possible allocation on the vdevs we got back when we allocated it. This is necessary to work around the fact that the ASIZE for a gang BP isn't the allocated size of the gang header, but of the entire tree under it. Because of that fact, when reading, claiming, or freeing, the allocated size of a gang header must be able to be determined from no other information than the vdev it was allocated on. We reuse the existing vdev_gang_header_asize for this. We take this minimum space, and pack the block full of blkptrs, up to a tail with an embedded checksum. This allows us to store many more gang children per header, leading to much shallower gang trees if we're forced into intense ganging.

One important wrinkle is that if the pool we're using has old gang headers on it, they may be smaller than the smallest allocation of the relevant vdevs. We have a workaround for this in zio_checksum_error, which will now try the checksum again in the case of a gang block of size above 512 bytes, with the size reduced to 512. This should not pose a significant performance problem, since 1) calculating a checksum for a block that small doesn't take very long, and 2) hopefully the number of gang blocks on systems is not too high. This also only applies to systems that have both old gang headers and the feature enabled.

Much of the remaining changes are just tweaking the gang tree orchestration logic to work with a variable number of gang block pointers. The final interesting change is to the logic that determines the size of the gang children. When only 3 gang children were possible, it didn't really matter what the sector size was; you were going to be allocating at most 3 of them, so the space waste was limited. With large sector sizes, however, it could get expensive: A 128k allocation that gangs on a system with 64k sectors will, without changes, consume 256 64k sectors, with 512 bytes of data each, wasting 128x as much space as the original allocation. The fix is to use the spa_min_ashift rather than the MINBLOCKSIZE when calculating the lsize.

How Has This Been Tested?

In addition to the ZTS and zloop, extensive manual tests were performed, verifying that 1) the new ZDB functionality works, and 2) that larger gang headers are correctly used and allocated when applicable. Mixed ashift testing occured, as well as the full compatibility matrix of old pools, new pools without the feature enabled running against old and new code.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

amotin · 2025-01-29T04:13:28Z

module/zcommon/zfeature_common.c

+	zfeature_register(SPA_FEATURE_DYNAMIC_GANG_HEADER,
+	    "com.klarasystems:dynamic_gang_header", "dynamic_gang_header",
+	    "Support for dynamically sized gang headers",
+	    ZFEATURE_FLAG_ACTIVATE_ON_ENABLE, ZFEATURE_TYPE_BOOLEAN, NULL,


While I also thought about this gang block issue, it also immediately became obvious that it would require read-incompatible pool feature. I am very unhappy about it. It would require updates to at least GRUB and FreeBSD loader to not break booting or careful compatibility settings from every user and distribution. The "activate on enable" makes it even worse, while I bet in many cases it could be avoided, for example for any pool with at least one ashift=9 vdev.

ACTIVATE_ON_ENABLE can certainly be avoided. If nothing else, we could have a check in zio_write_gang_block that only activates it if we get allocations back that let us use a header of size > 512. In theory, we could modify the feature refcount for every single gang block created and freed, but that feels like a bit much, since each time you need to create a dmu_tx_t, assign it to the io's txg, etc. Doing it once you actually write a new-style gang block for the first time and never deactivating it should mean that for most pools, it never gets activated, and for those with vdevs with 512 byte sectors, it also wouldn't get activated.

read-incompatible, on the other hand, is definitely unavoidable. For now at least, people can avoid enabling the feature on their root pools, so the bootloader issue isn't as pressing.

"avoid activating the feature" means never running zpool upgrade for existing pools, or setting right compatibility property on creation or at least before upgrading. At this time we don't even support disabling features that were enabled by mistake and that are not active. We should implement that at some point BTW. This might be a good motivation, if we accept the incompatibility.

It turns out a slightly bigger hammer than expected is needed here, so we definitely don't want to update this for every single gang header. It turns out that 1) I forgot that this featureflag needed the MOS flag and 2) MOS features basically have to be updated from syncing context. This means we have to spin out a synctask to do the job for us.

amotin · 2025-01-29T04:23:52Z

module/zfs/zio.c

+		lsize = MIN(P2ROUNDUP(resid / (gbh_nblkptrs(gangblocksize) - g),
+		    spa->spa_min_alloc), resid);


The fact that on vdev with ashift=12 we can allocate 4KB gang header, does not mean we need 31 children. Would we fit into 3 we could remain read-compatible and would not explode the number of allocations. Some RAIDZ3 will happily consume 4x more space here. What happen on vdevs with ashift=15, I don't want even think.

Meanwhile while proper rounding here is needed indeed, I think it should be spelled as:

lsize = zio_roundup_alloc_size(spa, resid / (gbh_nblkptrs(gangblocksize) - g); lsize = MIN(resid, lsize);

Yeah, I noticed that in my testing, which is why I implemented this. The example I mentioned in the description of a 128KiB block ganging into 64k sectors was exactly the setup where I encountered it.

zio_roundup_alloc_size is handy, I did not know about that!

which is why I implemented this

Proper rounding should make it some better, but will not solve completely. RAIDZ3 still has allocation size of ashift, but too expensive on it. It would be good to allocate as little of fragments as possible, but I can't think of much other than trying different sizes in a loop. Unless we implement some sort of opportunistic allocation.

Ah, it sounds like your point is that it would be nice if instead of saying "please allocate me the smallest possible chunk" and then assuming that all of the gang leaves are of that size, it would be nice if we could take full advantage of whatever sizes we have available, in order to try to to minimize the number of leaves we allocate?

I agree that would be nice, but I think that would be a separate change, since it would involve new (probably somewhat complex) logic. I don't think it would need another featureflag; the gang issue code calculates the offset of each subsequent leaf node by looking at BP_GET_PSIZE of the current one. As long as those were set appropriately, I think we could have mixed leaf sizes without breaking on-disk compatibility.

I think such a feature might want cooperation from the allocation code; rather than trying sizes in a loop, providing a size range to the allocation code would solve the problem, though that would be a lot of plumbing.

I don't think allocation of 31 4KB chunks for 128KB write is acceptable even as only a starter. Comparing to that even the existing algorithm of nested gangs looks efficient.

I agree that this is a serious issue (though in that specific example, you would actually end up with 16 8k chunks, I think; that doesn't change the basic problem, though). But I really don't want to balloon this change with a new mechanism for dealing with maximal leaf sizing; that change is likely going to require extensive modifications to the allocator or the way we issue gang child IOs, and quite possibly both.

How would you feel about a tunable that enables the larger gang headers, set to false by default? That way even if someone upgrades their pool, they still don't get the new functionality unless they also flip the tunable. That would let us keep each commit relatively contained in scope/functionality, which makes them easier to implement and review. Plus, people who really need the better gang headers could turn them on, but until the fix is in (which would hopefully be in the same dot-release) most users would be unaffected.

I could live with it disabled, but I would not commit it this way myself. The questionable result does not worth the read-incompatible pool feature in my opinion.

If you need and excuse to work on opportunistic allocation, may be it could be a ZIL, which usually has only a guess of what it may want next, but in many cases could take whatever given. I remember there was also a problem with fragmentation which Matt Ahrens handled by adding a module parameter limiting maximum ZIL block size.

ZFS gang block headers are currently fixed at 512 bytes. This is increasingly wasteful in the era of larger disk sector sizes. This PR allows any size allocation to work as a gang header. It also contains supporting changes to ZDB to make gang headers easier to work with. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Paul Dagnelie <[email protected]>

Signed-off-by: Paul Dagnelie <[email protected]>

amotin reviewed Jan 29, 2025

View reviewed changes

Paul Dagnelie added 3 commits January 29, 2025 09:36

amotin feedback

c28910b

fix 19 year old bug

3642e79

pcd1193182 force-pushed the dyn_gang branch from b5f097f to 3642e79 Compare January 29, 2025 20:11

Paul Dagnelie added 2 commits January 29, 2025 13:31

gang headers tunable

b9308dd

Signed-off-by: Paul Dagnelie <[email protected]>

storeabi

fd5350b

Signed-off-by: Paul Dagnelie <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement dynamic gang header sizes #17004

Implement dynamic gang header sizes #17004

pcd1193182 commented Jan 29, 2025

amotin Jan 29, 2025 •

edited

Loading

pcd1193182 Jan 29, 2025 •

edited

Loading

amotin Jan 29, 2025 •

edited

Loading

pcd1193182 Jan 29, 2025

amotin Jan 29, 2025

pcd1193182 Jan 29, 2025

amotin Jan 29, 2025

pcd1193182 Jan 29, 2025

amotin Jan 29, 2025

pcd1193182 Jan 29, 2025

amotin Jan 29, 2025 •

edited

Loading

amotin Jan 29, 2025

		lsize = MIN(P2ROUNDUP(resid / (gbh_nblkptrs(gangblocksize) - g),
		spa->spa_min_alloc), resid);

Implement dynamic gang header sizes #17004

Are you sure you want to change the base?

Implement dynamic gang header sizes #17004

Conversation

pcd1193182 commented Jan 29, 2025

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

amotin Jan 29, 2025 • edited Loading

Choose a reason for hiding this comment

pcd1193182 Jan 29, 2025 • edited Loading

Choose a reason for hiding this comment

amotin Jan 29, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amotin Jan 29, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amotin Jan 29, 2025 •

edited

Loading

pcd1193182 Jan 29, 2025 •

edited

Loading

amotin Jan 29, 2025 •

edited

Loading

amotin Jan 29, 2025 •

edited

Loading