Would it make sense to bump the default recordsize to 1M? #16957

chungy · 2025-01-16T21:10:55Z

chungy
Jan 16, 2025

This was prompted by a query on IRC, effectively of the nature "Why not set a rootfs to recordsize=1M?" and I was unable to give any drawbacks to it. To be honest, I never think about it much; almost everything I run is kept at the default 128K, with a few exceptions for media datasets set to 1M, and postgresql datasets set to 16K.

Consulting the zfsprops(7) manual page, on the recordsize property, it mentions that it may be set up to 16 MiB, but caveats that sizes larger than 1 MiB may have a negative impact on I/O latency. This seems to heavily imply that everything up to and including 1 MiB is "basically fine" for most work loads (and you can always override the default anyway as necessary).

I had a thought that perhaps this was not done for backwards compatibility concerns, especially sending to old ZFS implementations. I believe this cannot be the case, however, since zfs send defaults to breaking up records into new units maxing out at 128K, and the receiving system takes on the task to apply new optimizations for the physical layout. I have successfully tested this by creating a pool with -o version=28 and it was able to receive a dataset with 1M records fine, becoming 128K records on the receiving pool. Of course, the -L and -w options to zfs send can break this compatibility, but that's easily the territory of "it's your own fault for setting those options when the receiver doesn't support it."

With both these facets in mind, I am left to wonder why the default remains at 128K, if there's any reason other than this has been the ZFS way since 2005. Naturally, pools lacking the large_blocks feature can only remain at a 128K default, otherwise 1M should probably be fine?

gmelikov · 2025-01-17T10:35:28Z

gmelikov
Jan 17, 2025
Collaborator

128K was a sweet spot for HDDs (IIRC), and it's still a good compromise.

Larger blocks can help with:

compress ratio
less metadata
,
but they have drawbacks:
larger IO latency
read-modify-write, read amplification, write amplification
worse dedup ratio (if you use it)

So,

If you're certain that IO profile is sequential and all read data will be used (media), then yes, 1M block may be ok
If you may have random IO, then anything large is too expensive
2M+ block will have too much latency per operation, so it's usually good for archival usage only

Rootfs usually have random IO, so it's better to use defaults. But you can test any block size you want, it'll be interesting to look at benchmarks :)

0 replies

chungy · 2025-01-17T15:28:46Z

chungy
Jan 17, 2025
Author

Rootfs usually have random IO, so it's better to use defaults. But you can test any block size you want, it'll be interesting to look at benchmarks :)

I remember talks about Sun having file system replay data so they could run ~3 years of real-world file system operations very quickly and measure ZFS's performance, fragmentation, et al. Do we have any such data to test ZFS against?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Would it make sense to bump the default recordsize to 1M? #16957

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Would it make sense to bump the default recordsize to 1M? #16957

chungy Jan 16, 2025

Replies: 2 comments

gmelikov Jan 17, 2025 Collaborator

chungy Jan 17, 2025 Author

chungy
Jan 16, 2025

gmelikov
Jan 17, 2025
Collaborator

chungy
Jan 17, 2025
Author