Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand fragmentation table to reflect larger possibile allocation sizes #16986

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

pcd1193182
Copy link
Contributor

Motivation and Context

When you are using large recordsizes in conjunction with raidz, with incompressible data, you can pretty reliably be making 21 MB allocations. Unfortunately, the fragmentation metric in ZFS considers any metaslabs with 16 MB free chunks completely unfragmented, so you can have a metaslab report 0% fragmented and be unable to satisfy an allocation. When using the segment-based metaslab weight, this is inconvenient; when using the space-based one, it can seriously degrade performance.

Description

We expand the fragmentation table to extend up to 1GB, and redefine the table size based on the actual table, rather than having a static define. We also tweak the one variable that depends on fragmentation directly.

The one caveat for this change is that on pools with small disks (less than 200GB), once a metaslab is dirtied at all it will always report fragmented. This is because at our target of 200 metaslabs, the whole metaslab is less than a gigabyte, so the largest possible free segment is less than a gigabyte. This may result in some user questions, but most users probably don't have disks that small installed. At larger sizes, the problem disappears. Users may note an increase in fragmentation when this change is released, but it doesn't actually reflect any on-disk changes, just a new measurement scale.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

How Has This Been Tested?

Basic sanity testing only; passes the zfs test suite and zloop, and reports fragmentation correctly.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@pcd1193182 pcd1193182 changed the title Expand fragmentation table size to reflect larger possibility space for allocation sizes Expand fragmentation table to reflect larger possibile allocation sizes Jan 23, 2025
@pcd1193182 pcd1193182 self-assigned this Jan 23, 2025
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Paul Dagnelie <[email protected]>
Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any particular motivation to go as high as 1GB? IIRC 16MB is pretty hard block limit for ZFS now that is not going to change (soon). Sure you've shown that 16MB may be not enough, and free ranges close to it might not represent no fragmentation, since close sized allocations may produce significant amount of smaller fragments, exposing the hidden fragmentation. But I think those effects should rapidly diminish and could be neglected somewhere about 64-128MB. Also looking on logic of vdev_metaslab_set_size() it seems 512MB is the lowest metaslab size for the most of cases, which makes 128MB also a sweet spot to allow almost empty metaslabs to remain non-fragmented.

@amotin amotin added the Status: Code Review Needed Ready for review and testing label Jan 23, 2025
@pcd1193182
Copy link
Contributor Author

The previous code was based on the assumption that the max allocation size was 128KiB, and they chose an "unfragmented" size of 16MiB; 128 times larger. We are now concerned about 22MiB allocations, and 1 GiB isn't even 64 times larger than that, so I feel like it provides a reasonable compromise between "This metaslab can truly satisfy any allocations we throw at it" and the practical consideration of metaslab sizes for most use cases.

@amotin
Copy link
Member

amotin commented Jan 27, 2025

I am not convinced that we need that 128x overhead. I think 8x overhead I propose should be fine. I don't think many people actually use blocks above 1MB, since the benefit is often pretty low, so it might still effectively be a 128x overhead. Also it practically solves concerns about reporting fragmentation on freshly created small vdevs that you mentioned.

@allanjude
Copy link
Contributor

I am not convinced that we need that 128x overhead. I think 8x overhead I propose should be fine. I don't think many people actually use blocks above 1MB, since the benefit is often pretty low, so it might still effectively be a 128x overhead. Also it practically solves concerns about reporting fragmentation on freshly created small vdevs that you mentioned.

This is just 'how big of a contiguous space is required to consider this metaslab "not fragmented at all"'. It isn't really an overhead, just what determines what each different '% fragmented' means. When it wouldn't be possible to do more than a handful of allocations before having to resort to gang blocks, it doesn't seem to make sense to score it is 'not fragmented'

@amotin
Copy link
Member

amotin commented Jan 28, 2025

@allanjude To get the "not fragmented at all" you'd need all the free space to be in at least 128MB (as I propose) chunks and nothing smaller. Which means in a worst possible case you should be able to allocate ~85% of free space in maximum size blocks before you need gang blocks. And quite likely in process of allocation will appear smaller chunks, which will make fragmentation value to recompute. It does not sound too wrong. If you think it is not enough, 256MB could increase it to ~92%. Do we often plan to work at higher pool utilizations?

@pcd1193182
Copy link
Contributor Author

pcd1193182 commented Jan 28, 2025

One counterargument that occurs to me is that if we can allocate 85% of the space in the metaslab before we need to gang (or move to a different metaslab), wouldn't it make some kind of sense for the fragmentation to be at around 15%? Or at least not at zero percent? Admittedly, that 85% metric is in the context of 22MiB allocations, which are the worst case scenario. But we can see what size allocation it would take for the space waste to be a certain percentage.

If we look at it from this perspective, the old table allowed a space waste of about 1% with blocks around a max allocation size of 176KiB. In order to have the same behavior with new max-size allocations, we would need a table that caps out at 2 GiB. The 1GB table allows 1% at around 10MiB, while a 128MiB table allows 1% at 1.2MiB. So 1GiB makes sense if we care about max-size allocations, while 128MiB is fine for most use cases.

I think it probably makes sense to compromise at 512 MiB here; if the user has vdevs of 100GiB or smaller, they might see fragmentation more quickly than before, but it won't rise above a few percent for quite a while, which feels fine to me. And by modern standards, those are quite small disks for anything but dedicated SLOG devices. Meanwhile, users doing max size allocations will still start to get notified about fragmentation at a time that feels reasonable; 512MB chunks can store 23 allocations, and use up about 96% of the free space before they have to gang.

@amotin
Copy link
Member

amotin commented Jan 29, 2025

OK. But I propose a different curve, still centered around 1MB as yours (instead of original 128KB), but smoother and symmetric:

size old pcd mav increment
512 100 100 100 0
1024 100 100 100 1
2048 98 98 99 2
4096 95 95 97 4
8192 90 92 93 5
16384 80 90 88 5
32768 70 85 83 6
65536 60 80 77 6
131072 50 75 71 7
262144 40 70 64 7
524288 30 60 57 7
1048576 20 50 50 7
2097152 15 40 43 7
4194304 10 35 36 7
8388608 5 30 29 6
16777216 0 25 23 6
33554432 0 20 17 5
67108864 0 15 12 5
134217728 0 10 7 4
268435456 0 5 3 2
536870912 0 2 1 1
1073741824 0 0 0 0

image

PS: It actually ended up to 1GB, but we could move it one step down. :)

@pcd1193182
Copy link
Contributor Author

pcd1193182 commented Jan 29, 2025

OK. But I propose a different curve, still centered around 1MB as yours (instead of original 128KB), but smoother and symmetric:

I like a smoother curve, I tweaked it to fit into the 512 tablesize by just having the increments go from 0 to 10 and back. It's now centered between 512k and 1M, but I think that's fine. Does that work for you?

@amotin
Copy link
Member

amotin commented Jan 29, 2025

Your curve has much higher gradient in a center than mine. While that was what I also started from, then I intentionally tried to flatten that out. Though I don't have any serious math behind it, merely a feeling. There is not much special going on in a center, considering we may have different workloads with different block sizes, to give it a high gradient.

Same time it is weird to see no difference between 512 and 1024 points, even though sure that pool would be close to unusable, that is why I'd just move it one point down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants