Read prefetched buffers from L2ARC #15451

shodanshok · 2023-10-25T11:45:51Z

Prefetched buffers are currently read from L2ARC if, and only if, l2arc_noprefetch is set to non-default value of 0. This means that a streaming read which can be served from L2ARC will instead engage the main pool.

For example, consider what happens when a file is sequentially read:

application requests contiguous data, engaging the prefetcher;
ARC buffers are initially marked as prefetched but, as the calling application consumes data, the prefetch tag is cleared;
these "normal" buffers become eligible for L2ARC and are copied to it;
re-reading the same file will not engage L2ARC even if it contains the required buffers;
main pool has to suffer another sequential read load, which (due to most NCQ-enabled HDDs preferring sequential loads) can drammatically increase latency for uncached random reads.

In other words, current behavior is to write data to L2ARC (wearing it) without using the very same cache when reading back the same data. This was probably useful many years ago to preserve L2ARC read bandwidth but, with current SSD speed/size/price, it is vastly sub-optimal.

Setting l2arc_noprefetch=1, while enabling L2ARC to serve these reads, means that even prefetched but unused buffers will be copied into L2ARC, further increasing wear and load for potentially not-useful data.

This patch enable prefetched buffer to be read from L2ARC even when l2arc_noprefetch=1 (default), increasing sequential read speed and reducing load on the main pool without polluting L2ARC with not-useful (ie: unused) prefetched data. Moreover, it clear users confusion about L2ARC size increasing but not serving any IO when doing sequential reads.

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

Prefetched buffers are currently read from L2ARC if, and only if, l2arc_noprefetch is set to non-default value of 0. This means that a streaming read which can be served from L2ARC will instead engage the main pool. For example, consider what happens when a file is sequentially read: - application requests contiguous data, engaging the prefetcher; - ARC buffers are initially marked as prefetched but, as the calling application consumes data, the prefetch tag is cleared; - these "normal" buffers become eligible for L2ARC and are copied to it; - re-reading the same file will *not* engage L2ARC even if it contains the required buffers; - main pool has to suffer another sequential read load, which (due to most NCQ-enabled HDDs preferring sequential loads) can drammatically increase latency for uncached random reads. In other words, current behavior is to write data to L2ARC (wearing it) without using the very same cache when reading back the same data. This was probably useful many years ago to preserve L2ARC read bandwidth but, with current SSD speed/size/price, it is vastly sub-optimal. Setting l2arc_noprefetch=1, while enabling L2ARC to serve these reads, means that even prefetched but unused buffers will be copied into L2ARC, further increasing wear and load for potentially not-useful data. This patch enable prefetched buffer to be read from L2ARC even when l2arc_noprefetch=1 (default), increasing sequential read speed and reducing load on the main pool without polluting L2ARC with not-useful (ie: unused) prefetched data. Moreover, it clear users confusion about L2ARC size increasing but not serving any IO when doing sequential reads. Signed-off-by: Gionatan Danti <[email protected]>

behlendorf

This makes good sense to me.

amotin · 2023-10-25T20:58:33Z

I'd say it makes sense in some cases, but in general "it depends". L2ARC is not expected and often is not faster than main pool on sequential deeply prefetched reads. I still think sending those reads to main pool may be beneficial. Ideally it would be the best to balance depending on load in specific time, but it is more complicated.

I agree that it is a waste to write L2ARC and not read it, but I was thinking about opposite approach -- not marking as L2ARC-eligible buffers that were speculatively prefetched and for which prefetch read completed before demand request (i.e. ARC "hit", not "iohit" in my new terms), same as now done if demand request never arrives. I.e. if speculative prefetcher done its job well first time, hopefully it will be able to do it again next time and we do not need L2ARC. But if not, then reading from L2ARC may have sense even if it is prefetched. It may actually give us an interesting effect when first blocks of each file will be written to and read from L2ARC, but by the time prefetcher rump up the following data can be read from main pool. :)

shodanshok · 2023-10-26T05:43:47Z

I'd say it makes sense in some cases, but in general "it depends". L2ARC is not expected and often is not faster than main pool on sequential deeply prefetched reads. I still think sending those reads to main pool may be beneficial. Ideally it would be the best to balance depending on load in specific time, but it is more complicated.

Well, sure, it depends on pool geometry. However, a single SATA SSD is 4-8x faster (depending on request size, LBA and fragmentation) than an HDD for sequential reads. A single NVMe drive is 20-50x faster. Moreover, a mirrored vdev is not going to double sequential read speed. How many times one has a 12 vdev main pool backed by a single SATA drive? Or a 50 vdev main pool with only a single NVMe drive? My personal experience is that large main pools command large (both in number and in size) L2ARC devices. Even smaller pool generally follow this rule (ie: I have multiple 4-6 vdev systems with 2x SATA SSD L2ARC or 1x NVMe drive).

I agree that it is a waste to write L2ARC and not read it, but I was thinking about opposite approach -- not marking as L2ARC-eligible buffers that were speculatively prefetched and for which prefetch read completed before demand request (i.e. ARC "hit", not "iohit" in my new terms), same as now done if demand request never arrives. I.e. if speculative prefetcher done its job well first time, hopefully it will be able to do it again next time and we do not need L2ARC. But if not, then reading from L2ARC may have sense even if it is prefetched. It may actually give us an interesting effect when first blocks of each file will be written to and read from L2ARC, but by the time prefetcher rump up the following data can be read from main pool. :)

It would be an interesting approach, useful for very large main pools with limited L2ARC capabilities, but I don't think it re-pays the added complexity. HDDs are so slow that reads should not touch them if data exist on fast SSD storage. It is not only about peak read speed, where HDDs are somewhat competitive. It really is about most HDD prioritizing sequential (and easy-to-coalesce) reads, in spite of both writes and random reads (whose latency greatly spikes). So leaving the main pool free to serve writes and uncached reads leads to better performance and should be the main target of L2ARC.

Regards.

amotin

L2ARC is always limited in capacity and/or performance, otherwise it would be a primary storage. Most of L2ARC-class devices are just unable to sustain full-speed write stream for years. The fact that ZFS throttles L2ARC writes saves their life. But it means we have to prioritize what data worth writing to L2ARC and what are not. May be we could use different policies depending on how full is L2ARC -- blind unprioritized overwrite of already full L2ARC makes no sense. My proposal should not be so hard to implement. Unless somebody else get to it first I may do it one day. At that point the condition in arc_read() indeed would not have much sense -- if we never write trash to L2ARC then we should have no trash hits there. From that perspective, hoping for further development, I approve this commit.

shodanshok · 2023-10-26T17:06:41Z

First, thank both you and @behlendorf for accepting this commit.

L2ARC is always limited in capacity and/or performance, otherwise it would be a primary storage. Most of L2ARC-class devices are just unable to sustain full-speed write stream for years. The fact that ZFS throttles L2ARC writes saves their life.

My point is that current behavior is the worst possible because:

with l2arc_noprefetch=1 (default) L2ARC is written but often not used for reads;
with l2arc_noprefetch=0 L2ARC, while serving more reads, has to potentially endure even more writes (prefetched but not used buffers)

This patch fixes the "not used for reads" part without increasing write load on L2ARC.

My personal experience is that L2ARC is often abundantly fast - it is a cache (and not a primary storage) mainly due to limited capacity and resiliency. I understand that big HDD pools can sustain high streaming performance, but so can 2-4x NVMe cache drive.

Regards.

Prefetched buffers are currently read from L2ARC if, and only if, l2arc_noprefetch is set to non-default value of 0. This means that a streaming read which can be served from L2ARC will instead engage the main pool. For example, consider what happens when a file is sequentially read: - application requests contiguous data, engaging the prefetcher; - ARC buffers are initially marked as prefetched but, as the calling application consumes data, the prefetch tag is cleared; - these "normal" buffers become eligible for L2ARC and are copied to it; - re-reading the same file will *not* engage L2ARC even if it contains the required buffers; - main pool has to suffer another sequential read load, which (due to most NCQ-enabled HDDs preferring sequential loads) can dramatically increase latency for uncached random reads. In other words, current behavior is to write data to L2ARC (wearing it) without using the very same cache when reading back the same data. This was probably useful many years ago to preserve L2ARC read bandwidth but, with current SSD speed/size/price, it is vastly sub-optimal. Setting l2arc_noprefetch=1, while enabling L2ARC to serve these reads, means that even prefetched but unused buffers will be copied into L2ARC, further increasing wear and load for potentially not-useful data. This patch enable prefetched buffer to be read from L2ARC even when l2arc_noprefetch=1 (default), increasing sequential read speed and reducing load on the main pool without polluting L2ARC with not-useful (ie: unused) prefetched data. Moreover, it clear users confusion about L2ARC size increasing but not serving any IO when doing sequential reads. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Gionatan Danti <[email protected]> Closes openzfs#15451

adamdmoss · 2024-11-09T18:46:12Z

IMHO this is perfectly sensible and it's how I've run my pools for years...

imwhocodes · 2024-12-18T10:33:42Z

Hi, thanks for your work

Has this PR been added to 2.2.7 release?

I see it is in the patch-set but not in the release notes

Thanks,
Luca

AllKind · 2024-12-18T11:08:38Z

@imwhocodes
#16720 (comment)

behlendorf added Type: Performance Performance improvement or performance problem Status: Code Review Needed Ready for review and testing labels Oct 25, 2023

behlendorf approved these changes Oct 25, 2023

View reviewed changes

behlendorf requested a review from amotin October 25, 2023 20:00

amotin approved these changes Oct 26, 2023

View reviewed changes

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Oct 26, 2023

behlendorf merged commit 043c6ee into openzfs:master Oct 26, 2023
19 checks passed

shodanshok deleted the l2prefetch branch June 29, 2024 17:22

shodanshok mentioned this pull request Nov 9, 2024

zfs-2.2.7 patchset #16720

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read prefetched buffers from L2ARC #15451

Read prefetched buffers from L2ARC #15451

shodanshok commented Oct 25, 2023 •

edited

Loading

behlendorf left a comment

amotin commented Oct 25, 2023

shodanshok commented Oct 26, 2023 •

edited

Loading

amotin left a comment

shodanshok commented Oct 26, 2023 •

edited

Loading

adamdmoss commented Nov 9, 2024

imwhocodes commented Dec 18, 2024

AllKind commented Dec 18, 2024

Read prefetched buffers from L2ARC #15451

Read prefetched buffers from L2ARC #15451

Conversation

shodanshok commented Oct 25, 2023 • edited Loading

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

behlendorf left a comment

Choose a reason for hiding this comment

amotin commented Oct 25, 2023

shodanshok commented Oct 26, 2023 • edited Loading

amotin left a comment

Choose a reason for hiding this comment

shodanshok commented Oct 26, 2023 • edited Loading

adamdmoss commented Nov 9, 2024

imwhocodes commented Dec 18, 2024

AllKind commented Dec 18, 2024

shodanshok commented Oct 25, 2023 •

edited

Loading

shodanshok commented Oct 26, 2023 •

edited

Loading

shodanshok commented Oct 26, 2023 •

edited

Loading