Expand fragmentation table to reflect larger possibile allocation sizes #16986

pcd1193182 · 2025-01-23T16:37:20Z

Motivation and Context

When you are using large recordsizes in conjunction with raidz, with incompressible data, you can pretty reliably be making 21 MB allocations. Unfortunately, the fragmentation metric in ZFS considers any metaslabs with 16 MB free chunks completely unfragmented, so you can have a metaslab report 0% fragmented and be unable to satisfy an allocation. When using the segment-based metaslab weight, this is inconvenient; when using the space-based one, it can seriously degrade performance.

Description

We expand the fragmentation table to extend up to 1GB, and redefine the table size based on the actual table, rather than having a static define. We also tweak the one variable that depends on fragmentation directly.

The one caveat for this change is that on pools with small disks (less than 200GB), once a metaslab is dirtied at all it will always report fragmented. This is because at our target of 200 metaslabs, the whole metaslab is less than a gigabyte, so the largest possible free segment is less than a gigabyte. This may result in some user questions, but most users probably don't have disks that small installed. At larger sizes, the problem disappears. Users may note an increase in fragmentation when this change is released, but it doesn't actually reflect any on-disk changes, just a new measurement scale.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

How Has This Been Tested?

Basic sanity testing only; passes the zfs test suite and zloop, and reports fragmentation correctly.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Paul Dagnelie <[email protected]>

amotin

Do you have any particular motivation to go as high as 1GB? IIRC 16MB is pretty hard block limit for ZFS now that is not going to change (soon). Sure you've shown that 16MB may be not enough, and free ranges close to it might not represent no fragmentation, since close sized allocations may produce significant amount of smaller fragments, exposing the hidden fragmentation. But I think those effects should rapidly diminish and could be neglected somewhere about 64-128MB. Also looking on logic of vdev_metaslab_set_size() it seems 512MB is the lowest metaslab size for the most of cases, which makes 128MB also a sweet spot to allow almost empty metaslabs to remain non-fragmented.

pcd1193182 · 2025-01-27T18:57:54Z

The previous code was based on the assumption that the max allocation size was 128KiB, and they chose an "unfragmented" size of 16MiB; 128 times larger. We are now concerned about 22MiB allocations, and 1 GiB isn't even 64 times larger than that, so I feel like it provides a reasonable compromise between "This metaslab can truly satisfy any allocations we throw at it" and the practical consideration of metaslab sizes for most use cases.

amotin · 2025-01-27T19:13:03Z

I am not convinced that we need that 128x overhead. I think 8x overhead I propose should be fine. I don't think many people actually use blocks above 1MB, since the benefit is often pretty low, so it might still effectively be a 128x overhead. Also it practically solves concerns about reporting fragmentation on freshly created small vdevs that you mentioned.

allanjude · 2025-01-28T18:22:23Z

I am not convinced that we need that 128x overhead. I think 8x overhead I propose should be fine. I don't think many people actually use blocks above 1MB, since the benefit is often pretty low, so it might still effectively be a 128x overhead. Also it practically solves concerns about reporting fragmentation on freshly created small vdevs that you mentioned.

This is just 'how big of a contiguous space is required to consider this metaslab "not fragmented at all"'. It isn't really an overhead, just what determines what each different '% fragmented' means. When it wouldn't be possible to do more than a handful of allocations before having to resort to gang blocks, it doesn't seem to make sense to score it is 'not fragmented'

amotin · 2025-01-28T18:40:39Z

@allanjude To get the "not fragmented at all" you'd need all the free space to be in at least 128MB (as I propose) chunks and nothing smaller. Which means in a worst possible case you should be able to allocate ~85% of free space in maximum size blocks before you need gang blocks. And quite likely in process of allocation will appear smaller chunks, which will make fragmentation value to recompute. It does not sound too wrong. If you think it is not enough, 256MB could increase it to ~92%. Do we often plan to work at higher pool utilizations?

pcd1193182 · 2025-01-28T23:42:26Z

One counterargument that occurs to me is that if we can allocate 85% of the space in the metaslab before we need to gang (or move to a different metaslab), wouldn't it make some kind of sense for the fragmentation to be at around 15%? Or at least not at zero percent? Admittedly, that 85% metric is in the context of 22MiB allocations, which are the worst case scenario. But we can see what size allocation it would take for the space waste to be a certain percentage.

If we look at it from this perspective, the old table allowed a space waste of about 1% with blocks around a max allocation size of 176KiB. In order to have the same behavior with new max-size allocations, we would need a table that caps out at 2 GiB. The 1GB table allows 1% at around 10MiB, while a 128MiB table allows 1% at 1.2MiB. So 1GiB makes sense if we care about max-size allocations, while 128MiB is fine for most use cases.

I think it probably makes sense to compromise at 512 MiB here; if the user has vdevs of 100GiB or smaller, they might see fragmentation more quickly than before, but it won't rise above a few percent for quite a while, which feels fine to me. And by modern standards, those are quite small disks for anything but dedicated SLOG devices. Meanwhile, users doing max size allocations will still start to get notified about fragmentation at a time that feels reasonable; 512MB chunks can store 23 allocations, and use up about 96% of the free space before they have to gang.

amotin · 2025-01-29T03:01:05Z

OK. But I propose a different curve, still centered around 1MB as yours (instead of original 128KB), but smoother and symmetric:

size	old	pcd	mav	increment
512	100	100	100	0
1024	100	100	100	1
2048	98	98	99	2
4096	95	95	97	4
8192	90	92	93	5
16384	80	90	88	5
32768	70	85	83	6
65536	60	80	77	6
131072	50	75	71	7
262144	40	70	64	7
524288	30	60	57	7
1048576	20	50	50	7
2097152	15	40	43	7
4194304	10	35	36	7
8388608	5	30	29	6
16777216	0	25	23	6
33554432	0	20	17	5
67108864	0	15	12	5
134217728	0	10	7	4
268435456	0	5	3	2
536870912	0	2	1	1
1073741824	0	0	0	0

PS: It actually ended up to 1GB, but we could move it one step down. :)

pcd1193182 · 2025-01-29T20:27:44Z

OK. But I propose a different curve, still centered around 1MB as yours (instead of original 128KB), but smoother and symmetric:

I like a smoother curve, I tweaked it to fit into the 512 tablesize by just having the increments go from 0 to 10 and back. It's now centered between 512k and 1M, but I think that's fine. Does that work for you?

amotin · 2025-01-29T20:45:18Z

Your curve has much higher gradient in a center than mine. While that was what I also started from, then I intentionally tried to flatten that out. Though I don't have any serious math behind it, merely a feeling. There is not much special going on in a center, considering we may have different workloads with different block sizes, to give it a high gradient.

Same time it is weird to see no difference between 512 and 1024 points, even though sure that pool would be close to unusable, that is why I'd just move it one point down.

pcd1193182 force-pushed the frag_table branch from f3dff9d to e459589 Compare January 23, 2025 16:45

pcd1193182 changed the title ~~Expand fragmentation table size to reflect larger possibility space for allocation sizes~~ Expand fragmentation table to reflect larger possibile allocation sizes Jan 23, 2025

pcd1193182 force-pushed the frag_table branch from e459589 to d78a493 Compare January 23, 2025 16:46

pcd1193182 self-assigned this Jan 23, 2025

Expand fragmentation table to reflect larger possibile allocation sizes

c4567d1

Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Paul Dagnelie <[email protected]>

pcd1193182 force-pushed the frag_table branch from d78a493 to c4567d1 Compare January 23, 2025 17:10

amotin reviewed Jan 23, 2025

View reviewed changes

amotin added the Status: Code Review Needed Ready for review and testing label Jan 23, 2025

smooth curve

235c663

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expand fragmentation table to reflect larger possibile allocation sizes #16986

Expand fragmentation table to reflect larger possibile allocation sizes #16986

pcd1193182 commented Jan 23, 2025

amotin left a comment •

edited

Loading

pcd1193182 commented Jan 27, 2025

amotin commented Jan 27, 2025 •

edited

Loading

allanjude commented Jan 28, 2025

amotin commented Jan 28, 2025 •

edited

Loading

pcd1193182 commented Jan 28, 2025 •

edited

Loading

amotin commented Jan 29, 2025 •

edited

Loading

pcd1193182 commented Jan 29, 2025 •

edited

Loading

amotin commented Jan 29, 2025

Expand fragmentation table to reflect larger possibile allocation sizes #16986

Are you sure you want to change the base?

Expand fragmentation table to reflect larger possibile allocation sizes #16986

Conversation

pcd1193182 commented Jan 23, 2025

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

amotin left a comment • edited Loading

Choose a reason for hiding this comment

pcd1193182 commented Jan 27, 2025

amotin commented Jan 27, 2025 • edited Loading

allanjude commented Jan 28, 2025

amotin commented Jan 28, 2025 • edited Loading

pcd1193182 commented Jan 28, 2025 • edited Loading

amotin commented Jan 29, 2025 • edited Loading

pcd1193182 commented Jan 29, 2025 • edited Loading

amotin commented Jan 29, 2025

amotin left a comment •

edited

Loading

amotin commented Jan 27, 2025 •

edited

Loading

amotin commented Jan 28, 2025 •

edited

Loading

pcd1193182 commented Jan 28, 2025 •

edited

Loading

amotin commented Jan 29, 2025 •

edited

Loading

pcd1193182 commented Jan 29, 2025 •

edited

Loading