L2ARC: Rework write throttling with DWPD rate limiting and parallel writes #18093

ixhamza · 2025-12-29T20:47:36Z

Motivation and Context

The L2ARC write throttling algorithm was designed years ago and has several limitations in modern environments:

Random sublist selection: The introduction of ARC multilists made L2ARC writing random, picking one sublist at a time and depending on whether it found something to write there, leading to inconsistent write patterns and uneven cache population.
Redundant tail scanning: Each L2ARC iteration scans from the tail of ARC eviction lists. Since lists are only filled from heads, we repeatedly scan the same already-processed entries, wasting CPU cycles. The scanning depth is limited by a conservative headroom setting to reduce this overhead, but it also limits write throughput.
Single feed thread: A single global feed thread served all L2ARC devices across the entire system sequentially. With multiple pools or multiple cache devices, this serialized all writes and prevented scaling throughput with additional devices.
Bursty and unstable rates: Write patterns are bursty and often don't reach target speed. After burst workloads end, L2ARC writing stops for extended periods, failing to maintain consistent write rates.
Obsolete boost mechanism: l2arc_write_boost was designed before persistent L2ARC existed. It tied write rate acceleration to ARC warmup state, but ARC being cold doesn't mean L2ARC should write faster if it's full, and ARC being warm doesn't mean L2ARC should slow down if it's empty.

Description

Even-depth multi-sublist scanning: Replace random single-sublist selection with systematic scanning across all sublists within each burst. Headroom is distributed evenly across sublists for consistent cache population.
Persistent markers: Save scan positions between iterations using per-sublist markers. This eliminates redundant tail scanning and significantly reduces CPU overhead. Large devices (capacity >= arc_c_max/2) use persistent markers; small devices retain original HEAD/TAIL behavior based on ARC warmup state.
ABD deferred free fix: Reorder destruction in arc_hdr_destroy() to destroy L1HDR before L2HDR, allowing proper ABD deferral when L2_WRITING flag is set. Prevents gang ABD panics with multiple L2ARC devices due to I/O aggregation reusing ABDs still in gang chains.
Per-device feed threads: Replace single global feed thread with per-device threads. Each L2ARC device runs independently with its own thread, enabling parallel writes. Sublist busy tracking coordinates between threads to prevent simultaneous access to the same sublist. Throughput scales linearly with cache devices.
DWPD-based rate limiting: Replace bursty algorithm with Drive Writes Per Day limiting to protect SSD endurance. New l2arc_dwpd_limit tunable (default 1) controls writes per day per device capacity. Removes l2arc_write_boost and l2arc_write_interval(). Features include:
- Budget calculated from device size × DWPD limit.
- Unused budget accumulates over 24-hour periods with automatic reset and carry-over.
- Controlled bursts (max 50MB [L2ARC_BURST_SIZE_MAX]) with adaptive intervals to achieve target rates.
- Applies after initial fill when total L2ARC capacity meets persist threshold (arc_c_max/2).
- First pass writes are free, DWPD budget starts fresh after device fills.

How Has This Been Tested?

New ZTS tests: l2arc_dwpd_pos, l2arc_dwpd_multidev_pos, l2arc_dwpd_persist_pos, l2arc_parallel_writes_pos.
CI Testing
Manual testing with multiple L2ARC devices

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Quality assurance (non-breaking change which makes the code more robust against bugs)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

shodanshok · 2025-12-31T07:56:11Z

Thanks for looking into L2ARC.

When using the sublist persistent marker, how do you feel about l2arc_headroom = 0 (ie: scan the entire sublist for eligible buffers)? We considered that in #15457 but decided for the current value of 8 instead to avoid excessive repeated scans of large ARC.

DeHackEd · 2025-12-31T14:17:37Z

Just as an idea, there are SSDs out there with a DWPD figure of less than 1.0, and people share SSDs for multiple purposes. Maybe the module parameter should be a percentage or other fractional unit?

ixhamza · 2026-01-02T20:06:41Z

When using the sublist persistent marker, how do you feel about l2arc_headroom = 0 (ie: scan the entire sublist for eligible buffers)? We considered that in #15457 but decided for the current value of 8 instead to avoid excessive repeated scans of large ARC.

@shodanshok - That's a very good point. With persistent markers, excessive repeated scans are largely addressed, though markers do reset after capacity/8 writes. Some headroom still makes sense since we're trying to visit all sublists in each feed. Bumping the default to a higher value might indeed be beneficial.

Just as an idea, there are SSDs out there with a DWPD figure of less than 1.0, and people share SSDs for multiple purposes. Maybe the module parameter should be a percentage or other fractional unit?

@DeHackEd - Good point! Current implementation can't handle fractional DWPD values. Switching the module parameter to percentage-based (e.g., 50 = 0.5 DWPD, 100 = 1.0 DWPD) would be beneficial. I'll update this in the current patch.

Right now I'm investigating an accounting issue that occurs during module unload when using parallel fio threads with L2ARC writes due to arc_release() change in 5b6c99d, so marking this PR as Draft. Will update once it's resolved and ready for review.

ixhamza · 2026-01-05T14:42:24Z

Ready for review. Updated DWPD parameter to percentage-based for fractional support, resolved arc_release() accounting issue, and rebased to master.

behlendorf

Nice work! I only had a few minor comments. As an experiment I've also requested a review from the bot, let's see how it does.

include/sys/arc_impl.h

module/zfs/arc.c

tests/zfs-tests/include/tunables.cfg

tests/zfs-tests/tests/functional/l2arc/l2arc_dwpd_ratelimit_pos.ksh

Copilot

Pull request overview

This pull request implements a comprehensive rework of L2ARC write throttling with DWPD (Drive Writes Per Day) rate limiting and parallel writes. The changes address several limitations in the original implementation including random sublist selection, redundant tail scanning, single global feed thread, and bursty write patterns.

Key Changes:

Replaces random sublist selection with systematic multi-sublist scanning with persistent markers
Implements per-device feed threads for parallel writes across multiple L2ARC devices
Adds DWPD-based rate limiting to protect SSD endurance with configurable limits
Fixes ABD deferred free logic in arc_hdr_destroy() to prevent gang ABD panics
Updates arc_release() to handle L2_WRITING state properly

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
tests/zfs-tests/tests/functional/trim/trim_l2arc.ksh	Adds L2ARC_DWPD_LIMIT tunable save/restore in cleanup
tests/zfs-tests/tests/functional/l2arc/l2arc_multidev_throughput_pos.ksh	New test verifying parallel write scaling with multiple devices
tests/zfs-tests/tests/functional/l2arc/l2arc_multidev_scaling_pos.ksh	New test verifying 2x throughput improvement with dual devices
tests/zfs-tests/tests/functional/l2arc/l2arc_dwpd_reimport_pos.ksh	New test verifying DWPD rate limiting persists across pool export/import
tests/zfs-tests/tests/functional/l2arc/l2arc_dwpd_ratelimit_pos.ksh	New test verifying DWPD ordering: unlimited > high > medium > low
tests/zfs-tests/tests/functional/cache/cache_012_pos.ksh	Adds L2ARC_DWPD_LIMIT tunable save/restore
tests/zfs-tests/tests/Makefile.am	Registers 4 new L2ARC test files
tests/zfs-tests/include/tunables.cfg	Adds L2ARC_DWPD_LIMIT tunable mapping
tests/runfiles/common.run	Adds 4 new tests to l2arc test suite
module/zfs/spa_misc.c	Removes global l2arc_start/l2arc_stop calls
module/zfs/arc.c	Core implementation of per-device threads, DWPD limiting, persistent markers, and ABD fixes
man/man4/zfs.4	Documents new l2arc_dwpd_limit parameter and updates l2arc_write_max description
include/sys/spa_impl.h	Adds l2arc_info_t structure to spa_t for L2ARC state tracking
include/sys/arc_impl.h	Defines L2ARC_FEED_TYPES, l2arc_info_t structure, and per-device thread fields
include/sys/arc.h	Removes l2arc_start() and l2arc_stop() function declarations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

module/zfs/arc.c

ixhamza · 2026-01-18T22:07:08Z

Thanks @behlendorf for your review. Pushed a commit to address yours and Copilot's feedback and rebased to master. Most of Copilot's feedback was false positive though.

ixhamza · 2026-01-19T17:33:47Z

Thanks @behlendorf for the approval. @amotin was planning to take one more look. It would be good to have a couple more people review it, or at least let it sit for a few more days before marking it as Accepted.

tests/zfs-tests/tests/functional/l2arc/l2arc_multidev_scaling_pos.ksh

tests/zfs-tests/tests/functional/l2arc/l2arc_multidev_throughput_pos.ksh

module/zfs/arc.c

ixhamza · 2026-01-21T21:46:33Z

Thanks @amotin for the detailed review. I’ve pushed a commit addressing your feedback and rebased onto master.

The introduction of ARC multilists made L2ARC writing quite random, depending on whether it found something to write in a randomly selected sublist. This created inconsistent write patterns and poor utilization of available sublists leading to uneven cache population. This commit replaces random selection with systematic scanning across all sublists within each burst. Fair headroom distribution ensures even-depth traversal across all sublists until the target write size is reached. Round-robin processing with random starting points eliminates sequential bias while maintaining predictable write behavior. The systematic approach provides consistent L2ARC filling patterns and better utilization of available ARC data across all sublists. Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>

This commit introduces per-sublist persistent markers that eliminate redundant tail scanning between L2ARC iterations, providing significant CPU efficiency improvements. Markers are pre-allocated during device initialization and properly cleaned up during device removal. The implementation uses conditional behavior based on device capacity: small devices (capacity < arc_c) retain original HEAD/TAIL scanning based on ARC warmup state, while large devices (capacity >= arc_c) use the persistent marker approach for optimal CPU efficiency. Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>

With multiple L2ARC devices, headers can be destroyed asynchronously (e.g., during zpool sync) while L2_WRITING is set. The original code destroyed L2HDR before L1HDR, causing ABDs to lose their device association (b_l2hdr.b_dev) when arc_hdr_free_abd() is called. This caused ABDs to be added to the global free-on-write list without device information. When any L2ARC device completed its write and attempted to free these orphaned ABDs, it would panic on ASSERT(!list_link_active(&abd->abd_gang_link)) because the ABD was still part of another device's vdev_queue I/O aggregation gang. Fix by extending l2ad_mtx lock scope to cover L1HDR destruction and reordering to destroy L1HDR before L2HDR when L2_WRITING is set. This ensures arc_hdr_free_abd() can access b_l2hdr.b_dev to properly tag ABDs with their device for deferred cleanup. Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>

When arc_release() is called on a header with a single buffer and L2_WRITING set, the L2HDR must be preserved for ABD cleanup (similar to the arc_hdr_destroy() case). If we destroy the L2HDR here, later arc_write() will allocate a new ABD and call arc_hdr_free_abd(), which needs b_l2hdr.b_dev to properly defer ABD cleanup, causing VERIFY(HDR_HAS_L2HDR(hdr)) to fail. Allocate a new header for the buffer in the single_buf_l2writing case (single buffer + L2_WRITING), leaving the original header with L2HDR intact. The original header becomes an "orphan" (no buffers, no b_pabd) but retains device association for ABD cleanup when l2arc_write_done() completes. The shared buffer case (HDR_SHARED_DATA) is excluded because L2ARC makes its own transformed copy via l2arc_apply_transforms(), so the original ABD is not used by the L2 write. The header can be safely reused without allocating a new one. For proper evictable space accounting, arc_buf_remove() must be called before remove_reference() in the single_buf_l2writing path. This ensures arc_evictable_space_increment() (during remove_reference) and arc_evictable_space_decrement() (during destruction) see the same state (b_buf=NULL), preventing accounting leaks that cause module unload to hang with non-zero esize. Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>

Transform L2ARC from single global feed thread to per-device threads, enabling parallel writes to multiple L2ARC devices. Each device runs its own feed thread independently, improving multi-device throughput. Previously, a single thread served all devices sequentially; now each device writes concurrently. Threads are created during device addition and torn down on removal. Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>

Add DWPD (Drive Writes Per Day) rate limiting to control L2ARC write speeds and protect SSD endurance. Write rate is constrained by the minimum of l2arc_write_max and DWPD-calculated budget. Devices accumulate unused write budget over 24-hour periods with automatic reset and carry-over. Writes occur in controlled bursts (max 50MB) with adaptive intervals to achieve target rates. Applies after initial device fill. Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>

Add l2arc_dwpd_limit, remove l2arc_write_boost, update related tunables. Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>

ixhamza · 2026-01-27T18:23:27Z

Updated as per feedback provided by @amotin in private and rebased onto master.

Add four new functional tests to validate L2ARC DWPD rate limiting and parallel write features: - l2arc_dwpd_ratelimit_pos: Verifies DWPD rate limiting with different values (0, 100, 1000, 10000) and ordering - l2arc_dwpd_reimport_pos: Verifies DWPD rate limiting persists after pool export/import - l2arc_multidev_scaling_pos: Verifies parallel write scaling ratio (dual devices achieve ~2× single device throughput) - l2arc_multidev_throughput_pos: Verifies absolute parallel write throughput scales with device count (~32MB/s per device) Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>

Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>

These fields became unused when ABD was introduced in a6255b7. Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>

amotin

Looks good to me.

ixhamza force-pushed the NAS-133831 branch 2 times, most recently from 21a3839 to 77f842a Compare December 30, 2025 10:27

ixhamza marked this pull request as draft January 2, 2026 20:06

github-actions bot added the Status: Work in Progress Not yet ready for general review label Jan 2, 2026

ixhamza force-pushed the NAS-133831 branch from 77f842a to 3582ab9 Compare January 5, 2026 09:52

ixhamza marked this pull request as ready for review January 5, 2026 14:42

ixhamza force-pushed the NAS-133831 branch from 3582ab9 to 38b52b9 Compare January 5, 2026 14:42

github-actions bot added Status: Code Review Needed Ready for review and testing and removed Status: Work in Progress Not yet ready for general review labels Jan 5, 2026

behlendorf self-requested a review January 5, 2026 22:12

ixhamza force-pushed the NAS-133831 branch from 38b52b9 to d04a3f9 Compare January 6, 2026 20:51

behlendorf reviewed Jan 8, 2026

View reviewed changes

include/sys/arc_impl.h Outdated Show resolved Hide resolved

module/zfs/arc.c Outdated Show resolved Hide resolved

tests/zfs-tests/include/tunables.cfg Outdated Show resolved Hide resolved

tests/zfs-tests/tests/functional/l2arc/l2arc_dwpd_ratelimit_pos.ksh Outdated Show resolved Hide resolved

behlendorf requested a review from Copilot January 8, 2026 22:36

Copilot started reviewing on behalf of behlendorf January 8, 2026 22:36 View session

Copilot AI reviewed Jan 8, 2026

View reviewed changes

module/zfs/arc.c Show resolved Hide resolved

module/zfs/arc.c Show resolved Hide resolved

module/zfs/arc.c Outdated Show resolved Hide resolved

module/zfs/arc.c Show resolved Hide resolved

module/zfs/arc.c Show resolved Hide resolved

module/zfs/arc.c Show resolved Hide resolved

ixhamza force-pushed the NAS-133831 branch from d04a3f9 to ea65de2 Compare January 18, 2026 22:04

ixhamza force-pushed the NAS-133831 branch from ea65de2 to 3d1a25c Compare January 18, 2026 22:13

behlendorf approved these changes Jan 19, 2026

View reviewed changes

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Jan 19, 2026

amotin reviewed Jan 19, 2026

View reviewed changes

amotin reviewed Jan 20, 2026

View reviewed changes

module/zfs/arc.c Outdated Show resolved Hide resolved

ixhamza force-pushed the NAS-133831 branch from 3d1a25c to b79ebe3 Compare January 21, 2026 21:44

github-actions bot removed the Status: Accepted Ready to integrate (reviewed, tested) label Jan 21, 2026

ixhamza force-pushed the NAS-133831 branch from b79ebe3 to d2d0e97 Compare January 23, 2026 17:32

ixhamza mentioned this pull request Jan 27, 2026

L2ARC: tag deferred ABDs with device to fix multi-device parallel writes #18158

Closed

8 tasks

ixhamza added 7 commits January 27, 2026 23:04

man: Update L2ARC tunables for DWPD and parallel writes

657367c

Add l2arc_dwpd_limit, remove l2arc_write_boost, update related tunables. Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>

ixhamza force-pushed the NAS-133831 branch from d2d0e97 to a291152 Compare January 27, 2026 18:04

ixhamza force-pushed the NAS-133831 branch from a291152 to 6c0e877 Compare January 27, 2026 22:31

ixhamza added 3 commits January 28, 2026 15:31

cache_012_pos: disable compression to ensure L2ARC wrap

2f7c661

Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>

arc: remove unused l2df_size and l2df_type from l2arc_data_free_t

9fd541c

These fields became unused when ABD was introduced in a6255b7. Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>

ixhamza force-pushed the NAS-133831 branch from 6c0e877 to 9fd541c Compare January 28, 2026 10:32

amotin approved these changes Jan 28, 2026

View reviewed changes

amotin added the Status: Code Review Needed Ready for review and testing label Jan 28, 2026

L2ARC: Rework write throttling with DWPD rate limiting and parallel writes #18093

Are you sure you want to change the base?

L2ARC: Rework write throttling with DWPD rate limiting and parallel writes #18093

Conversation

ixhamza commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

shodanshok commented Dec 31, 2025

Uh oh!

DeHackEd commented Dec 31, 2025

Uh oh!

ixhamza commented Jan 2, 2026

Uh oh!

ixhamza commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

behlendorf left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Key Changes:

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ixhamza commented Jan 18, 2026

Uh oh!

ixhamza commented Jan 19, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ixhamza commented Jan 21, 2026

Uh oh!

ixhamza commented Jan 27, 2026

Uh oh!

amotin left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ixhamza commented Dec 29, 2025 •

edited

Loading

ixhamza commented Jan 5, 2026 •

edited

Loading