-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expand fragmentation table to reflect larger possibile allocation sizes #16986
base: master
Are you sure you want to change the base?
Conversation
f3dff9d
to
e459589
Compare
e459589
to
d78a493
Compare
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Paul Dagnelie <[email protected]>
d78a493
to
c4567d1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any particular motivation to go as high as 1GB? IIRC 16MB is pretty hard block limit for ZFS now that is not going to change (soon). Sure you've shown that 16MB may be not enough, and free ranges close to it might not represent no fragmentation, since close sized allocations may produce significant amount of smaller fragments, exposing the hidden fragmentation. But I think those effects should rapidly diminish and could be neglected somewhere about 64-128MB. Also looking on logic of vdev_metaslab_set_size()
it seems 512MB is the lowest metaslab size for the most of cases, which makes 128MB also a sweet spot to allow almost empty metaslabs to remain non-fragmented.
The previous code was based on the assumption that the max allocation size was 128KiB, and they chose an "unfragmented" size of 16MiB; 128 times larger. We are now concerned about 22MiB allocations, and 1 GiB isn't even 64 times larger than that, so I feel like it provides a reasonable compromise between "This metaslab can truly satisfy any allocations we throw at it" and the practical consideration of metaslab sizes for most use cases. |
I am not convinced that we need that 128x overhead. I think 8x overhead I propose should be fine. I don't think many people actually use blocks above 1MB, since the benefit is often pretty low, so it might still effectively be a 128x overhead. Also it practically solves concerns about reporting fragmentation on freshly created small vdevs that you mentioned. |
This is just 'how big of a contiguous space is required to consider this metaslab "not fragmented at all"'. It isn't really an overhead, just what determines what each different '% fragmented' means. When it wouldn't be possible to do more than a handful of allocations before having to resort to gang blocks, it doesn't seem to make sense to score it is 'not fragmented' |
@allanjude To get the "not fragmented at all" you'd need all the free space to be in at least 128MB (as I propose) chunks and nothing smaller. Which means in a worst possible case you should be able to allocate ~85% of free space in maximum size blocks before you need gang blocks. And quite likely in process of allocation will appear smaller chunks, which will make fragmentation value to recompute. It does not sound too wrong. If you think it is not enough, 256MB could increase it to ~92%. Do we often plan to work at higher pool utilizations? |
One counterargument that occurs to me is that if we can allocate 85% of the space in the metaslab before we need to gang (or move to a different metaslab), wouldn't it make some kind of sense for the fragmentation to be at around 15%? Or at least not at zero percent? Admittedly, that 85% metric is in the context of 22MiB allocations, which are the worst case scenario. But we can see what size allocation it would take for the space waste to be a certain percentage. If we look at it from this perspective, the old table allowed a space waste of about 1% with blocks around a max allocation size of 176KiB. In order to have the same behavior with new max-size allocations, we would need a table that caps out at 2 GiB. The 1GB table allows 1% at around 10MiB, while a 128MiB table allows 1% at 1.2MiB. So 1GiB makes sense if we care about max-size allocations, while 128MiB is fine for most use cases. I think it probably makes sense to compromise at 512 MiB here; if the user has vdevs of 100GiB or smaller, they might see fragmentation more quickly than before, but it won't rise above a few percent for quite a while, which feels fine to me. And by modern standards, those are quite small disks for anything but dedicated SLOG devices. Meanwhile, users doing max size allocations will still start to get notified about fragmentation at a time that feels reasonable; 512MB chunks can store 23 allocations, and use up about 96% of the free space before they have to gang. |
OK. But I propose a different curve, still centered around 1MB as yours (instead of original 128KB), but smoother and symmetric:
PS: It actually ended up to 1GB, but we could move it one step down. :) |
I like a smoother curve, I tweaked it to fit into the 512 tablesize by just having the increments go from 0 to 10 and back. It's now centered between 512k and 1M, but I think that's fine. Does that work for you? |
Your curve has much higher gradient in a center than mine. While that was what I also started from, then I intentionally tried to flatten that out. Though I don't have any serious math behind it, merely a feeling. There is not much special going on in a center, considering we may have different workloads with different block sizes, to give it a high gradient. Same time it is weird to see no difference between 512 and 1024 points, even though sure that pool would be close to unusable, that is why I'd just move it one point down. |
Motivation and Context
When you are using large recordsizes in conjunction with raidz, with incompressible data, you can pretty reliably be making 21 MB allocations. Unfortunately, the fragmentation metric in ZFS considers any metaslabs with 16 MB free chunks completely unfragmented, so you can have a metaslab report 0% fragmented and be unable to satisfy an allocation. When using the segment-based metaslab weight, this is inconvenient; when using the space-based one, it can seriously degrade performance.
Description
We expand the fragmentation table to extend up to 1GB, and redefine the table size based on the actual table, rather than having a static define. We also tweak the one variable that depends on fragmentation directly.
The one caveat for this change is that on pools with small disks (less than 200GB), once a metaslab is dirtied at all it will always report fragmented. This is because at our target of 200 metaslabs, the whole metaslab is less than a gigabyte, so the largest possible free segment is less than a gigabyte. This may result in some user questions, but most users probably don't have disks that small installed. At larger sizes, the problem disappears. Users may note an increase in fragmentation when this change is released, but it doesn't actually reflect any on-disk changes, just a new measurement scale.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
How Has This Been Tested?
Basic sanity testing only; passes the zfs test suite and zloop, and reports fragmentation correctly.
Types of changes
Checklist:
Signed-off-by
.