Low IOPS/Core & ZVol not saturating NVME IOPS despite tons of cores #17032

vemana · 2025-02-06T13:10:54Z

vemana
Feb 6, 2025

Hello, I am evaluating ZFS for a series of projects each with varying storage requirements and seeing some surprising results. Hoping someone can confirm that these numbers make sense and/or help me tune it.

The data here is produced by a script & it has all the details on pool creation & the fio tests. The rationale is explained later but briefly, the following is data on a VDEV consisting of a single Micron 7450 PRO 7.68TB (1M 4k read IOPS), tested using fio with libaio, directio on zfs version 2.3.0. The different curves in each chart correspond to various files FIO ran tests on

ZSET, a ZFS file system
ZVOL, a ZFS Volume as block device. (FIO can test block devices)
ZVOL_EXT4 = ZVOL with ext4 on it
DISK_EXT4 = the target disk with ext4 on it

100% 4k random reads

ZSET 20x lower IOPS/core than ext4
Ext4 scales to drive's IOPS as cores increase but ZVOL maxes out at 1/2 the drive's IOPS. ZVOL_EXT4 maxes out at 1/4 of the drive's IOPS

4k Seq Read

ZSET IOPS/core is 16x lower than ext4 on single core
ZVOL again maxes out at 1/2 the drive's max IOPS

4k Seq Write

Once again, much lower IOPS/core for ZSET and once again, ZVOL maxes out this time at 40% of ext4's iops

Random R/W with 90% reads

Similar to others

Questions

I am seeing 10K read IOPS/core on ZFS file systems, 20x lower than ext4. Am I missing something?
I am seeing ZVol max out at 1/2 the drive's read IOPS on 100% random read workloads even after adding tons of cores. What am I missing?
My sense so far is that ZFS aggressively trades off CPU for reducing IOPS. I expected a bit of it based on reading prior to embarking on the evaluation but I am not sure if my numbers are in the right ball park or if I am missing something silly in the configuration. 10K read IOPS/core seems far too low.

Appreciate any inputs!

Anticipated workload (one of many; this is the subject for this thread)

4KB io; lots of small files
90% random reads
Negligible sync writes
No temporal locality expected

Goals for evalution

Form a mental model of ZFS and figure out if it's a good fit for the anticipated workload
Give ZFS the best chance: tune for the best IOPS performance sans most of its CPU costing features
So, use a single nvme drive as vdev, no mirror, no raid; no checksum, atime etc (exact config shared below)

Test System

Ubuntu 24.04 Server
ZFS 2.3.0 with O_DIRECT support; compiled per instructions
Threadripper 3995WX 64 core, 512 GB Ram
Several Micron 7450 PRO 7.68TB drives essentially new
- Rated for 1M IOPS of 4k random reads, 250K steady state write IOPS and 6 GB/s throughput. All parameters exceeded in local testing.
- Spec sheet
- Block size verified at 4K (optimal per drive controller)
System otherwise unloaded

Test Bench

Test with FIO across the following axes
- 5 distinct --numjobs={1,4,8,16,32} to test scaling across CPU
- 4 distinct "device"s from FIO's perspective
  - ZSET, a ZFS file system (Set from Dataset; ack slightly misnamed)
  - ZVOL, a ZFS Volume as block device. (FIO can test block devices)
  - ZVOL_EXT4 = ZVOL with ext4 on it
  - DISK_EXT4 = the target disk with ext4 on it
  - I want to isolate the overhead of the filesystem vs raw block device
- 4 distinct work shapes
  - 100% random read
  - sequential read
  - sequentail write
  - 90% read mix of 100% random r/w
- Block size = 4k. This is the block size that FIO requests
  - Testing 4k because it is the anticipated workload requirement
blkdiscard the drive for each of the 20 combinations (numjobs x device) and test the 4 workloads in the order above
- blkdiscard is to ensure that the drive's FTL is empty
- We could've blkdiscarded after each of the 80 combinations (instead of 20), but it should be a minor detail
The Pool consists of exactly 1 vdev, with exactly one aforementioned 7450 pro 7.68TB drive. No raid.

Pool, ZSet & ZVol parameters (see full script)

TANKS_MOUNTPOINT="/home/$USER/tanks"
ZPOOL_NAME="nvme"
ZPOOL_MOUNTPOINT="$TANKS_MOUNTPOINT/$ZPOOL_NAME"

zpool_create() { 
  stderrInRed "Creating one drive pool, no raidz..."
  sudo zpool create -f \
    -o ashift=12 \
    -o autotrim=on \
    "$ZPOOL_NAME" \
    -m "$ZPOOL_MOUNTPOINT" \
    "$DISK1"
}

ZSET_NAME="$ZPOOL_NAME/set1"
ZSET_MOUNTPOINT="$TANKS_MOUNTPOINT/$ZSET_NAME"

zset_create() {
  stderrInRed "Creating a dataset set1"
  sudo zfs create -P \
    -o atime=off \
    -o checksum=off \
    -o compression=off \
    -o dedup=off \
    -o direct=standard \
    -o encryption=off \
    -o prefetch=none \
    -o primarycache=metadata \
    -o recordsize=4k \
    -o secondarycache=none \
    -o sync=standard \
    "$ZSET_NAME"
}

ZVOL_NAME="nvme/vol1"
ZVOL_BLOCK_PATH="/dev/zvol/$ZVOL_NAME"
ZVOL_MOUNTPOINT="/mnt/$ZVOL_NAME"

zvol_create() {
  stderrInRed "Creating a zvol zv1"
  sudo zfs create -P \
    -b 4k \
    -o checksum=off \
    -o compression=off \
    -o encryption=off \
    -o prefetch=none \
    -o primarycache=metadata \
    -o secondarycache=none \
    -o sync=standard \
    -V 1024G \
    "$ZVOL_NAME"
}

zvol_make_ext4_and_mount() {
  sudo mkfs -t ext4 "$ZVOL_BLOCK_PATH"
  stderrInRed "Created Ext4 on zvol"

  sudo mkdir -p "$ZVOL_MOUNTPOINT"
  sudo mount -t auto "$ZVOL_BLOCK_PATH" "$ZVOL_MOUNTPOINT"
  stderrInRed "Mounted zvol at $ZVOL_MOUNTPOINT"
}

ZFS module parameters

# Most reads
options zfs zfs_vdev_sync_read_max_active=128
options zfs zfs_vdev_sync_read_min_active=16

# Most writes
options zfs zfs_vdev_async_write_max_active=128
options zfs zfs_vdev_async_write_min_active=16

# Sync writes (defined by app or dataset property)
options zfs zfs_vdev_sync_write_max_active=128
options zfs zfs_vdev_sync_write_min_active=16

# Prefetch reads
options zfs zfs_vdev_async_read_max_active=128
options zfs zfs_vdev_async_read_min_active=16

# The max active operations across leaf vdev (unclear if it means single device or whole vdev)
options zfs zfs_vdev_max_active=1000

# Don't coalesce reads or writes
options zfs zfs_vdev_aggregation_limit_non_rotating=0
# options zfs zfs_vdev_read_gap_limit=0 # Not needed given above

# If this number is N,
# X = (N/100) * max_active_per_vdev
# X is the max number of queued allocations before moving on to the next vdev
# I am guessing that when there's only one vdev, this number acts as a threshold
# for blocking; so, put this here just in case.
options zfs zfs_vdev_queue_depth_pct=1000

# For good measure, let's turn off allocation throttling altogether!
options zfs zio_dva_throttle_enabled=0

# Investigate zvol_threads (default 32 seems high)

FAQ

Question Why libaio fio testing and not test real world application usage?

At the moment, I am trying to understand the performance characteristics of ZFS and form a mental model. Something relatively simple like "it can do 100K IOPS/core with just basic checksumming and raidz2" is valuable to me. In Machine Learning, one avoids overfitting to the data at hand. Similarly, I try to avoid over-optimizing for my particular workload because the workloads will evolve.

In short, if I can't predict how ZFS will behave from IOPS & Throughput perspective, I won't use it.

Question EXT4 on ZFS is a waste. why you trying it?

In terms of filesystem features, Zset > ZVol_Ext4 > Disk_Ext4.

Disk_Ext4 is the baseline - it has no additional features beyond basic journalling.
ZVol_Ext4 offers more. It can do integrity. It can also do snapshotting
ZSet offers as much as ZVol_Ext4 but more performance and perhaps some more features.

As I form a mental model of ZFS, I like to do basic sanity checks and understand what am I getting for what cost?. Here, ZFS burns CPU for integrity and additional features. The question is how much. Then, I can decide if the cost is worth it for my purpose.

Direct IO ZSet is underperforming ZVol_Ext4 for upto 15 cores in all patterns and upto 25 cores for read heavy workloads. I wouldn't have predicted that without this test. It's good to do sanity checks to confirm one's mental model.

amotin · 2025-02-06T15:14:52Z

amotin
Feb 6, 2025
Collaborator

First look on your ZSET read numbers scaling linearly with a number of fio processes says that iodepth parameter may not work as expected. Considering ZFS does not support asynchronous operation, I suspect that libaio just executes requests one at a time per process. I haven't looked on its code, but from our look on io_uring some time ago, I remember that the last relies on tight integration of file systems with page cache, and since ZFS does not use page cache, it is not in a good position here. The other question is whether your actual application will use libaio, or you are testing something you may not really care? If your workload will include many processes, then you may want to switch from libaio backend to psync and scale number of the processes to the planned one.

Second, your tests show dramatically better performance of zvol vs file system. It can have only one explanation: the API you use to do the I/O (libaio) behaves differently for them, possibly because zvols decouple execution to a different threads, while libaio itself when running for file system doesn't. Generally performance of zvol should be identical to a performance of a single file on a file system, since that is what they are inside.

Third, use of ext4 on top of zvol makes no sense, and the only reason it is faster on some of your tests that ZFS native is better integration to page cache and libaio. Otherwise it should be a total resource waste.

And the last, you are saying that your target workload will include huge amount of small files, and same time you are testing performance of ONE zvol and ONE file. ZFS has a number of optimizations to scale-out performance when possible, while you are putting it in the most difficult situation of a single object of insanely small block size. You are testing not what you likely should. Test your real workload!

0 replies

tonyhutter · 2025-02-06T17:57:32Z

tonyhutter
Feb 6, 2025
Maintainer

In addition to what @amotin mentioned:

Add vdev property to bypass vdev queue #16591 may help with IOPS.
You could also try setting the module param zvol_use_blk_mq=1 (before importing your pool). It may also hurt performance though.
Try removing these from your test script:

    -o prefetch=none \
    -o primarycache=metadata \
    -o secondarycache=none \

0 replies

vemana · 2025-02-07T04:14:05Z

vemana
Feb 7, 2025
Author

@amotin Thanks for sharing your insight. I appreciate the time. I've updated graphs with testing on 60 cores and added a FAQ section to answer why 4k, why libaio, why not real workloads etc.

I suspect that libaio just executes requests one at a time per process.

No. libaoi just means that FIO simply maintains io_depth requests in flight at any point. The IOdepth distribution confirms this (>=64 = 100%). submit/complete batch sizes are 4=100% meaning that typically FIO Is adding 4 requests at a time and taking 4 completions out at a time while queue size remains 126 +/ 2.

   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% 
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
      issued rwts: total=4469371,0,0,0 short=0,0,0,0 dropped=0,0,0,0
      latency   : target=0, window=0, percentile=100.00%, depth=128

First look on your ZSET read numbers scaling linearly

I tested upto 60 cores & the levelling off is starting to show (see updated charts). I suspected this outcome because ZSet cannot perform noticeably better (if at all) than ZVol.

Considering ZFS does not support asynchronous operation

Good point. This is perhaps a likely cause.

At this point,

ZSet performs at 10K read IOPS/core. I did not notice any lock contention in flame graphs (will follow up on @tonyhutter's suggestions)
Unclear why ZVol is maxing out at 1/2 of drive's IOPS despite all the CPU it can get. I know this is "effective IOPS" delivered to client (FIO here) by ZFS and possibly the disk is maxing out IOPS. That would mean 100% overhead for DirectIO reads which could make sense: (1) ZVol is reading a metadata block to figure out where to send the actual read and (2) the actual read. I looked at zpool iostats and also the drive level stats from smartctl but I didn't get the sense it was maxing out disk. I will take a look with a fresh pair of eyes again.

1 reply

amotin Feb 7, 2025
Collaborator

libaoi just means that FIO simply maintains io_depth requests in flight at any point. The IOdepth distribution confirms this (>=64 = 100%). submit/complete batch sizes are 4=100% meaning that typically FIO Is adding 4 requests at a time and taking 4 completions out at a time while queue size remains 126 +/ 2.

Who maintains what and where? FIO uses one process (or thread) per job. Using libaio engine it can submit multiple requests to the kernel same time via AIO interface. ZFS's file system interface executes requests synchronously, so to run 1000 requests same time you'd need 1000 kernel threads to issue them. Usually that is not a problem, since not so many applications require multiple simultaneous I/Os form a single thread, or it can be covered by ZFS's speculative prefetch and asynchronous write, which you are trying to avoid here. So in this situation the only way for libaio to execute multiple requests for a single fio process is to create a pool of separate kernel threads to send the requests to ZFS. Whether it can do it -- good question. On FreeBSD AIO code in kernel does just that. Linux io_uring IIRC can do it depending on some factors. But I suppose all that fio can show you in the I/O stats you've quotes is that it sent all the requests to the kernel. It has no control about how they are executed. That is why I personally prefer to use psync fio engine, when possible, since in that case I can control what exactly I am testing. Will your actual application(s) use AIO? If no, then don't benchmark it!

IvanVolosyuk · 2025-02-07T04:35:57Z

IvanVolosyuk
Feb 7, 2025

Looks like dejavu to me. Very similar discussion at #16993 (comment)
Ext4 will use all free memory for caching, but you reject ZFS requests to do the same. It kinda makes sense that ext4 on ZFS will perform better - it can cache stuff. You should consider doing what @tonyhutter suggested here #17032 (comment)

2 replies

vemana Feb 7, 2025
Author

@IvanVolosyuk thank you. appreciate the time.

Ext4 will use all free memory for caching, but you reject ZFS requests to do the same. It kinda makes sense that ext4 on ZFS will perform better - it can cache stuff.

Are you saying that ext4 has its own cache (apart from Kernel page cache) and it caches even O_DIRECT calls? I am not familiar with this part of the stack but have been reading up but it all appears a bit undocumented and/or dark magic; the O_DIRECT also appears insufficiently documented and it is not clear to me how to establish whether ext4 calls are still hitting page cache (its own or kernel's) during testing. If you have any pointers, please let me know.

FWIW, I am happy to discard comparisions against ext4 and only draw conclusions from untainted metrics. I think the least tainted metric (per this testing methodology) is ZVol read IOPS/core. I am noticing around 170K IOPS/core on that metric. For comparision, reading from my raw disk device (/dev/nvme0n1) was 170K IOPS/core. This again is a pittance compared to SPDK's polled mode driver 6M IOPS/core.

EDIT: corected actual iops/core numbers.

vemana Feb 7, 2025
Author

I've repeated the tests with 1 job with iostat stats side by side refreshing every 1s. I believe there is no caching impacts for ext4 or zset numbers above. The IOPS reported by FIO is 10s of IOPS away from the IOPS reported by iostat (much less than 1% difference).

In other words, all the IOPS are coming from disk and not from memory for Disk_ext4, ZVol_ext4, ZSet and Disk. ZVol however does not always hit the underlying Disk - it seems that it can return without hitting the underlying disk if you ask it for data from a block that has never been written to. So, its numbers should be discarded. I think the rest are accurate.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low IOPS/Core & ZVol not saturating NVME IOPS despite tons of cores #17032

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Low IOPS/Core & ZVol not saturating NVME IOPS despite tons of cores #17032

vemana Feb 6, 2025

Replies: 4 comments · 3 replies

amotin Feb 6, 2025 Collaborator

tonyhutter Feb 6, 2025 Maintainer

vemana Feb 7, 2025 Author

amotin Feb 7, 2025 Collaborator

IvanVolosyuk Feb 7, 2025

vemana Feb 7, 2025 Author

vemana Feb 7, 2025 Author

vemana
Feb 6, 2025

Replies: 4 comments 3 replies

amotin
Feb 6, 2025
Collaborator

tonyhutter
Feb 6, 2025
Maintainer

vemana
Feb 7, 2025
Author

amotin Feb 7, 2025
Collaborator

IvanVolosyuk
Feb 7, 2025

vemana Feb 7, 2025
Author

vemana Feb 7, 2025
Author