Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read prefetched buffers from L2ARC #15451

Merged
merged 1 commit into from
Oct 26, 2023
Merged

Conversation

shodanshok
Copy link
Contributor

@shodanshok shodanshok commented Oct 25, 2023

Prefetched buffers are currently read from L2ARC if, and only if, l2arc_noprefetch is set to non-default value of 0. This means that a streaming read which can be served from L2ARC will instead engage the main pool.

For example, consider what happens when a file is sequentially read:

  • application requests contiguous data, engaging the prefetcher;
  • ARC buffers are initially marked as prefetched but, as the calling application consumes data, the prefetch tag is cleared;
  • these "normal" buffers become eligible for L2ARC and are copied to it;
  • re-reading the same file will not engage L2ARC even if it contains the required buffers;
  • main pool has to suffer another sequential read load, which (due to most NCQ-enabled HDDs preferring sequential loads) can drammatically increase latency for uncached random reads.

In other words, current behavior is to write data to L2ARC (wearing it) without using the very same cache when reading back the same data. This was probably useful many years ago to preserve L2ARC read bandwidth but, with current SSD speed/size/price, it is vastly sub-optimal.

Setting l2arc_noprefetch=1, while enabling L2ARC to serve these reads, means that even prefetched but unused buffers will be copied into L2ARC, further increasing wear and load for potentially not-useful data.

This patch enable prefetched buffer to be read from L2ARC even when l2arc_noprefetch=1 (default), increasing sequential read speed and reducing load on the main pool without polluting L2ARC with not-useful (ie: unused) prefetched data. Moreover, it clear users confusion about L2ARC size increasing but not serving any IO when doing sequential reads.

Motivation and Context

Description

How Has This Been Tested?

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

Prefetched buffers are currently read from L2ARC if, and only if,
l2arc_noprefetch is set to non-default value of 0. This means that
a streaming read which can be served from L2ARC will instead engage
the main pool.

For example, consider what happens when a file is sequentially read:
- application requests contiguous data, engaging the prefetcher;
- ARC buffers are initially marked as prefetched but, as the calling
application consumes data, the prefetch tag is cleared;
- these "normal" buffers become eligible for L2ARC and are copied to it;
- re-reading the same file will *not* engage L2ARC even if it contains
the required buffers;
- main pool has to suffer another sequential read load, which (due to
most NCQ-enabled HDDs preferring sequential loads) can drammatically
increase latency for uncached random reads.

In other words, current behavior is to write data to L2ARC (wearing it)
without using the very same cache when reading back the same data. This
was probably useful many years ago to preserve L2ARC read bandwidth but,
with current SSD speed/size/price, it is vastly sub-optimal.

Setting l2arc_noprefetch=1, while enabling L2ARC to serve these reads,
means that even prefetched but unused buffers will be copied into L2ARC,
further increasing wear and load for potentially not-useful data.

This patch enable prefetched buffer to be read from L2ARC even when
l2arc_noprefetch=1 (default), increasing sequential read speed and
reducing load on the main pool without polluting L2ARC with not-useful
(ie: unused) prefetched data. Moreover, it clear users confusion about
L2ARC size increasing but not serving any IO when doing sequential
reads.

Signed-off-by: Gionatan Danti <[email protected]>
@behlendorf behlendorf added Type: Performance Performance improvement or performance problem Status: Code Review Needed Ready for review and testing labels Oct 25, 2023
Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes good sense to me.

@behlendorf behlendorf requested a review from amotin October 25, 2023 20:00
@amotin
Copy link
Member

amotin commented Oct 25, 2023

I'd say it makes sense in some cases, but in general "it depends". L2ARC is not expected and often is not faster than main pool on sequential deeply prefetched reads. I still think sending those reads to main pool may be beneficial. Ideally it would be the best to balance depending on load in specific time, but it is more complicated.

I agree that it is a waste to write L2ARC and not read it, but I was thinking about opposite approach -- not marking as L2ARC-eligible buffers that were speculatively prefetched and for which prefetch read completed before demand request (i.e. ARC "hit", not "iohit" in my new terms), same as now done if demand request never arrives. I.e. if speculative prefetcher done its job well first time, hopefully it will be able to do it again next time and we do not need L2ARC. But if not, then reading from L2ARC may have sense even if it is prefetched. It may actually give us an interesting effect when first blocks of each file will be written to and read from L2ARC, but by the time prefetcher rump up the following data can be read from main pool. :)

@shodanshok
Copy link
Contributor Author

shodanshok commented Oct 26, 2023

I'd say it makes sense in some cases, but in general "it depends". L2ARC is not expected and often is not faster than main pool on sequential deeply prefetched reads. I still think sending those reads to main pool may be beneficial. Ideally it would be the best to balance depending on load in specific time, but it is more complicated.

Well, sure, it depends on pool geometry. However, a single SATA SSD is 4-8x faster (depending on request size, LBA and fragmentation) than an HDD for sequential reads. A single NVMe drive is 20-50x faster. Moreover, a mirrored vdev is not going to double sequential read speed. How many times one has a 12 vdev main pool backed by a single SATA drive? Or a 50 vdev main pool with only a single NVMe drive? My personal experience is that large main pools command large (both in number and in size) L2ARC devices. Even smaller pool generally follow this rule (ie: I have multiple 4-6 vdev systems with 2x SATA SSD L2ARC or 1x NVMe drive).

I agree that it is a waste to write L2ARC and not read it, but I was thinking about opposite approach -- not marking as L2ARC-eligible buffers that were speculatively prefetched and for which prefetch read completed before demand request (i.e. ARC "hit", not "iohit" in my new terms), same as now done if demand request never arrives. I.e. if speculative prefetcher done its job well first time, hopefully it will be able to do it again next time and we do not need L2ARC. But if not, then reading from L2ARC may have sense even if it is prefetched. It may actually give us an interesting effect when first blocks of each file will be written to and read from L2ARC, but by the time prefetcher rump up the following data can be read from main pool. :)

It would be an interesting approach, useful for very large main pools with limited L2ARC capabilities, but I don't think it re-pays the added complexity. HDDs are so slow that reads should not touch them if data exist on fast SSD storage. It is not only about peak read speed, where HDDs are somewhat competitive. It really is about most HDD prioritizing sequential (and easy-to-coalesce) reads, in spite of both writes and random reads (whose latency greatly spikes). So leaving the main pool free to serve writes and uncached reads leads to better performance and should be the main target of L2ARC.

Regards.

Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L2ARC is always limited in capacity and/or performance, otherwise it would be a primary storage. Most of L2ARC-class devices are just unable to sustain full-speed write stream for years. The fact that ZFS throttles L2ARC writes saves their life. But it means we have to prioritize what data worth writing to L2ARC and what are not. May be we could use different policies depending on how full is L2ARC -- blind unprioritized overwrite of already full L2ARC makes no sense. My proposal should not be so hard to implement. Unless somebody else get to it first I may do it one day. At that point the condition in arc_read() indeed would not have much sense -- if we never write trash to L2ARC then we should have no trash hits there. From that perspective, hoping for further development, I approve this commit.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Oct 26, 2023
@behlendorf behlendorf merged commit 043c6ee into openzfs:master Oct 26, 2023
19 checks passed
@shodanshok
Copy link
Contributor Author

shodanshok commented Oct 26, 2023

First, thank both you and @behlendorf for accepting this commit.

L2ARC is always limited in capacity and/or performance, otherwise it would be a primary storage. Most of L2ARC-class devices are just unable to sustain full-speed write stream for years. The fact that ZFS throttles L2ARC writes saves their life.

My point is that current behavior is the worst possible because:

  • with l2arc_noprefetch=1 (default) L2ARC is written but often not used for reads;
  • with l2arc_noprefetch=0 L2ARC, while serving more reads, has to potentially endure even more writes (prefetched but not used buffers)

This patch fixes the "not used for reads" part without increasing write load on L2ARC.

My personal experience is that L2ARC is often abundantly fast - it is a cache (and not a primary storage) mainly due to limited capacity and resiliency. I understand that big HDD pools can sustain high streaming performance, but so can 2-4x NVMe cache drive.

Regards.

ixhamza pushed a commit to truenas/zfs that referenced this pull request Nov 20, 2023
Prefetched buffers are currently read from L2ARC if, and only if,
l2arc_noprefetch is set to non-default value of 0. This means that
a streaming read which can be served from L2ARC will instead engage
the main pool.

For example, consider what happens when a file is sequentially read:
- application requests contiguous data, engaging the prefetcher;
- ARC buffers are initially marked as prefetched but, as the calling
application consumes data, the prefetch tag is cleared;
- these "normal" buffers become eligible for L2ARC and are copied to it;
- re-reading the same file will *not* engage L2ARC even if it contains
the required buffers;
- main pool has to suffer another sequential read load, which (due to
most NCQ-enabled HDDs preferring sequential loads) can dramatically
increase latency for uncached random reads.

In other words, current behavior is to write data to L2ARC (wearing it)
without using the very same cache when reading back the same data. This
was probably useful many years ago to preserve L2ARC read bandwidth but,
with current SSD speed/size/price, it is vastly sub-optimal.

Setting l2arc_noprefetch=1, while enabling L2ARC to serve these reads,
means that even prefetched but unused buffers will be copied into L2ARC,
further increasing wear and load for potentially not-useful data.

This patch enable prefetched buffer to be read from L2ARC even when
l2arc_noprefetch=1 (default), increasing sequential read speed and
reducing load on the main pool without polluting L2ARC with not-useful
(ie: unused) prefetched data. Moreover, it clear users confusion about
L2ARC size increasing but not serving any IO when doing sequential
reads.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Gionatan Danti <[email protected]>
Closes openzfs#15451
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Dec 12, 2023
Prefetched buffers are currently read from L2ARC if, and only if,
l2arc_noprefetch is set to non-default value of 0. This means that
a streaming read which can be served from L2ARC will instead engage
the main pool.

For example, consider what happens when a file is sequentially read:
- application requests contiguous data, engaging the prefetcher;
- ARC buffers are initially marked as prefetched but, as the calling
application consumes data, the prefetch tag is cleared;
- these "normal" buffers become eligible for L2ARC and are copied to it;
- re-reading the same file will *not* engage L2ARC even if it contains
the required buffers;
- main pool has to suffer another sequential read load, which (due to
most NCQ-enabled HDDs preferring sequential loads) can dramatically
increase latency for uncached random reads.

In other words, current behavior is to write data to L2ARC (wearing it)
without using the very same cache when reading back the same data. This
was probably useful many years ago to preserve L2ARC read bandwidth but,
with current SSD speed/size/price, it is vastly sub-optimal.

Setting l2arc_noprefetch=1, while enabling L2ARC to serve these reads,
means that even prefetched but unused buffers will be copied into L2ARC,
further increasing wear and load for potentially not-useful data.

This patch enable prefetched buffer to be read from L2ARC even when
l2arc_noprefetch=1 (default), increasing sequential read speed and
reducing load on the main pool without polluting L2ARC with not-useful
(ie: unused) prefetched data. Moreover, it clear users confusion about
L2ARC size increasing but not serving any IO when doing sequential
reads.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Gionatan Danti <[email protected]>
Closes openzfs#15451
@shodanshok shodanshok deleted the l2prefetch branch June 29, 2024 17:22
@shodanshok shodanshok mentioned this pull request Nov 9, 2024
13 tasks
@adamdmoss
Copy link
Contributor

IMHO this is perfectly sensible and it's how I've run my pools for years...

@imwhocodes
Copy link

Hi, thanks for your work

Has this PR been added to 2.2.7 release?

I see it is in the patch-set but not in the release notes

Thanks,
Luca

@AllKind
Copy link
Contributor

AllKind commented Dec 18, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested) Type: Performance Performance improvement or performance problem
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants