Skip to content

Conversation

@ixhamza
Copy link
Member

@ixhamza ixhamza commented Dec 29, 2025

Motivation and Context

The L2ARC write throttling algorithm was designed years ago and has several limitations in modern environments:

  • Random sublist selection: The introduction of ARC multilists made L2ARC writing random, picking one sublist at a time and depending on whether it found something to write there, leading to inconsistent write patterns and uneven cache population.
  • Redundant tail scanning: Each L2ARC iteration scans from the tail of ARC eviction lists. Since lists are only filled from heads, we repeatedly scan the same already-processed entries, wasting CPU cycles. The scanning depth is limited by a conservative headroom setting to reduce this overhead, but it also limits write throughput.
  • Single feed thread: A single global feed thread served all L2ARC devices across the entire system sequentially. With multiple pools or multiple cache devices, this serialized all writes and prevented scaling throughput with additional devices.
  • Bursty and unstable rates: Write patterns are bursty and often don't reach target speed. After burst workloads end, L2ARC writing stops for extended periods, failing to maintain consistent write rates.
  • Obsolete boost mechanism: l2arc_write_boost was designed before persistent L2ARC existed. It tied write rate acceleration to ARC warmup state, but ARC being cold doesn't mean L2ARC should write faster if it's full, and ARC being warm doesn't mean L2ARC should slow down if it's empty.

Description

  • Even-depth multi-sublist scanning: Replace random single-sublist selection with systematic scanning across all sublists within each burst. Headroom is distributed evenly across sublists for consistent cache population.
  • Persistent markers: Save scan positions between iterations using per-sublist markers. This eliminates redundant tail scanning and significantly reduces CPU overhead. Large devices (capacity >= arc_c_max/2) use persistent markers; small devices retain original HEAD/TAIL behavior based on ARC warmup state.
  • ABD deferred free fix: Reorder destruction in arc_hdr_destroy() to destroy L1HDR before L2HDR, allowing proper ABD deferral when L2_WRITING flag is set. Prevents gang ABD panics with multiple L2ARC devices due to I/O aggregation reusing ABDs still in gang chains.
  • Per-device feed threads: Replace single global feed thread with per-device threads. Each L2ARC device runs independently with its own thread, enabling parallel writes. Sublist busy tracking coordinates between threads to prevent simultaneous access to the same sublist. Throughput scales linearly with cache devices.
  • DWPD-based rate limiting: Replace bursty algorithm with Drive Writes Per Day limiting to protect SSD endurance. New l2arc_dwpd_limit tunable (default 1) controls writes per day per device capacity. Removes l2arc_write_boost and l2arc_write_interval(). Features include:
    • Budget calculated from device size × DWPD limit.
    • Unused budget accumulates over 24-hour periods with automatic reset and carry-over.
    • Controlled bursts (max 50MB [L2ARC_BURST_SIZE_MAX]) with adaptive intervals to achieve target rates.
    • Applies after initial fill when total L2ARC capacity meets persist threshold (arc_c_max/2).
    • First pass writes are free, DWPD budget starts fresh after device fills.

How Has This Been Tested?

  • New ZTS tests: l2arc_dwpd_pos, l2arc_dwpd_multidev_pos, l2arc_dwpd_persist_pos, l2arc_parallel_writes_pos.
  • CI Testing
  • Manual testing with multiple L2ARC devices

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@ixhamza ixhamza force-pushed the NAS-133831 branch 2 times, most recently from 21a3839 to 77f842a Compare December 30, 2025 10:27
@shodanshok
Copy link
Contributor

Thanks for looking into L2ARC.

When using the sublist persistent marker, how do you feel about l2arc_headroom = 0 (ie: scan the entire sublist for eligible buffers)? We considered that in #15457 but decided for the current value of 8 instead to avoid excessive repeated scans of large ARC.

@DeHackEd
Copy link
Contributor

Just as an idea, there are SSDs out there with a DWPD figure of less than 1.0, and people share SSDs for multiple purposes. Maybe the module parameter should be a percentage or other fractional unit?

@ixhamza
Copy link
Member Author

ixhamza commented Jan 2, 2026

When using the sublist persistent marker, how do you feel about l2arc_headroom = 0 (ie: scan the entire sublist for eligible buffers)? We considered that in #15457 but decided for the current value of 8 instead to avoid excessive repeated scans of large ARC.

@shodanshok - That's a very good point. With persistent markers, excessive repeated scans are largely addressed, though markers do reset after capacity/8 writes. Some headroom still makes sense since we're trying to visit all sublists in each feed. Bumping the default to a higher value might indeed be beneficial.

Just as an idea, there are SSDs out there with a DWPD figure of less than 1.0, and people share SSDs for multiple purposes. Maybe the module parameter should be a percentage or other fractional unit?

@DeHackEd - Good point! Current implementation can't handle fractional DWPD values. Switching the module parameter to percentage-based (e.g., 50 = 0.5 DWPD, 100 = 1.0 DWPD) would be beneficial. I'll update this in the current patch.

Right now I'm investigating an accounting issue that occurs during module unload when using parallel fio threads with L2ARC writes due to arc_release() change in 5b6c99d, so marking this PR as Draft. Will update once it's resolved and ready for review.

@ixhamza ixhamza marked this pull request as draft January 2, 2026 20:06
@github-actions github-actions bot added the Status: Work in Progress Not yet ready for general review label Jan 2, 2026
@ixhamza ixhamza marked this pull request as ready for review January 5, 2026 14:42
@github-actions github-actions bot added Status: Code Review Needed Ready for review and testing and removed Status: Work in Progress Not yet ready for general review labels Jan 5, 2026
@ixhamza
Copy link
Member Author

ixhamza commented Jan 5, 2026

Ready for review. Updated DWPD parameter to percentage-based for fractional support, resolved arc_release() accounting issue, and rebased to master.

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I only had a few minor comments. As an experiment I've also requested a review from the bot, let's see how it does.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request implements a comprehensive rework of L2ARC write throttling with DWPD (Drive Writes Per Day) rate limiting and parallel writes. The changes address several limitations in the original implementation including random sublist selection, redundant tail scanning, single global feed thread, and bursty write patterns.

Key Changes:

  • Replaces random sublist selection with systematic multi-sublist scanning with persistent markers
  • Implements per-device feed threads for parallel writes across multiple L2ARC devices
  • Adds DWPD-based rate limiting to protect SSD endurance with configurable limits
  • Fixes ABD deferred free logic in arc_hdr_destroy() to prevent gang ABD panics
  • Updates arc_release() to handle L2_WRITING state properly

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/zfs-tests/tests/functional/trim/trim_l2arc.ksh Adds L2ARC_DWPD_LIMIT tunable save/restore in cleanup
tests/zfs-tests/tests/functional/l2arc/l2arc_multidev_throughput_pos.ksh New test verifying parallel write scaling with multiple devices
tests/zfs-tests/tests/functional/l2arc/l2arc_multidev_scaling_pos.ksh New test verifying 2x throughput improvement with dual devices
tests/zfs-tests/tests/functional/l2arc/l2arc_dwpd_reimport_pos.ksh New test verifying DWPD rate limiting persists across pool export/import
tests/zfs-tests/tests/functional/l2arc/l2arc_dwpd_ratelimit_pos.ksh New test verifying DWPD ordering: unlimited > high > medium > low
tests/zfs-tests/tests/functional/cache/cache_012_pos.ksh Adds L2ARC_DWPD_LIMIT tunable save/restore
tests/zfs-tests/tests/Makefile.am Registers 4 new L2ARC test files
tests/zfs-tests/include/tunables.cfg Adds L2ARC_DWPD_LIMIT tunable mapping
tests/runfiles/common.run Adds 4 new tests to l2arc test suite
module/zfs/spa_misc.c Removes global l2arc_start/l2arc_stop calls
module/zfs/arc.c Core implementation of per-device threads, DWPD limiting, persistent markers, and ABD fixes
man/man4/zfs.4 Documents new l2arc_dwpd_limit parameter and updates l2arc_write_max description
include/sys/spa_impl.h Adds l2arc_info_t structure to spa_t for L2ARC state tracking
include/sys/arc_impl.h Defines L2ARC_FEED_TYPES, l2arc_info_t structure, and per-device thread fields
include/sys/arc.h Removes l2arc_start() and l2arc_stop() function declarations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@ixhamza
Copy link
Member Author

ixhamza commented Jan 18, 2026

Thanks @behlendorf for your review. Pushed a commit to address yours and Copilot's feedback and rebased to master. Most of Copilot's feedback was false positive though.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Jan 19, 2026
@ixhamza
Copy link
Member Author

ixhamza commented Jan 19, 2026

Thanks @behlendorf for the approval. @amotin was planning to take one more look. It would be good to have a couple more people review it, or at least let it sit for a few more days before marking it as Accepted.

@github-actions github-actions bot removed the Status: Accepted Ready to integrate (reviewed, tested) label Jan 21, 2026
@ixhamza
Copy link
Member Author

ixhamza commented Jan 21, 2026

Thanks @amotin for the detailed review. I’ve pushed a commit addressing your feedback and rebased onto master.

The introduction of ARC multilists made L2ARC writing quite random,
depending on whether it found something to write in a randomly selected
sublist. This created inconsistent write patterns and poor utilization
of available sublists leading to uneven cache population.

This commit replaces random selection with systematic scanning across
all sublists within each burst. Fair headroom distribution ensures
even-depth traversal across all sublists until the target write size
is reached. Round-robin processing with random starting points eliminates
sequential bias while maintaining predictable write behavior.

The systematic approach provides consistent L2ARC filling patterns
and better utilization of available ARC data across all sublists.

Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
This commit introduces per-sublist persistent markers that eliminate
redundant tail scanning between L2ARC iterations, providing significant
CPU efficiency improvements. Markers are pre-allocated during device
initialization and properly cleaned up during device removal.

The implementation uses conditional behavior based on device capacity:
small devices (capacity < arc_c) retain original HEAD/TAIL scanning
based on ARC warmup state, while large devices (capacity >= arc_c)
use the persistent marker approach for optimal CPU efficiency.

Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
With multiple L2ARC devices, headers can be destroyed asynchronously
(e.g., during zpool sync) while L2_WRITING is set. The original code
destroyed L2HDR before L1HDR, causing ABDs to lose their device
association (b_l2hdr.b_dev) when arc_hdr_free_abd() is called.

This caused ABDs to be added to the global free-on-write list without
device information. When any L2ARC device completed its write and
attempted to free these orphaned ABDs, it would panic on
ASSERT(!list_link_active(&abd->abd_gang_link)) because the ABD was
still part of another device's vdev_queue I/O aggregation gang.

Fix by extending l2ad_mtx lock scope to cover L1HDR destruction and
reordering to destroy L1HDR before L2HDR when L2_WRITING is set. This
ensures arc_hdr_free_abd() can access b_l2hdr.b_dev to properly tag
ABDs with their device for deferred cleanup.

Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
When arc_release() is called on a header with a single buffer and
L2_WRITING set, the L2HDR must be preserved for ABD cleanup (similar
to the arc_hdr_destroy() case). If we destroy the L2HDR here, later
arc_write() will allocate a new ABD and call arc_hdr_free_abd(),
which needs b_l2hdr.b_dev to properly defer ABD cleanup, causing
VERIFY(HDR_HAS_L2HDR(hdr)) to fail.

Allocate a new header for the buffer in the single_buf_l2writing
case (single buffer + L2_WRITING), leaving the original header with
L2HDR intact. The original header becomes an "orphan" (no buffers, no
b_pabd) but retains device association for ABD cleanup when
l2arc_write_done() completes.

The shared buffer case (HDR_SHARED_DATA) is excluded because L2ARC
makes its own transformed copy via l2arc_apply_transforms(), so the
original ABD is not used by the L2 write. The header can be safely
reused without allocating a new one.

For proper evictable space accounting, arc_buf_remove() must be
called before remove_reference() in the single_buf_l2writing path.
This ensures arc_evictable_space_increment() (during remove_reference)
and arc_evictable_space_decrement() (during destruction) see the
same state (b_buf=NULL), preventing accounting leaks that cause
module unload to hang with non-zero esize.

Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Transform L2ARC from single global feed thread to per-device threads,
enabling parallel writes to multiple L2ARC devices. Each device runs
its own feed thread independently, improving multi-device throughput.
Previously, a single thread served all devices sequentially; now each
device writes concurrently. Threads are created during device addition
and torn down on removal.

Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Add DWPD (Drive Writes Per Day) rate limiting to control L2ARC write
speeds and protect SSD endurance. Write rate is constrained by the
minimum of l2arc_write_max and DWPD-calculated budget. Devices
accumulate unused write budget over 24-hour periods with automatic reset
and carry-over. Writes occur in controlled bursts (max 50MB) with
adaptive intervals to achieve target rates. Applies after initial device
fill.

Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Add l2arc_dwpd_limit, remove l2arc_write_boost, update related tunables.

Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
@ixhamza
Copy link
Member Author

ixhamza commented Jan 27, 2026

Updated as per feedback provided by @amotin in private and rebased onto master.

Add four new functional tests to validate L2ARC DWPD rate limiting and
parallel write features:

- l2arc_dwpd_ratelimit_pos: Verifies DWPD rate limiting with different
  values (0, 100, 1000, 10000) and ordering
- l2arc_dwpd_reimport_pos: Verifies DWPD rate limiting persists after
  pool export/import
- l2arc_multidev_scaling_pos: Verifies parallel write scaling ratio
  (dual devices achieve ~2× single device throughput)
- l2arc_multidev_throughput_pos: Verifies absolute parallel write
  throughput scales with device count (~32MB/s per device)

Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
These fields became unused when ABD was introduced in a6255b7.

Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@amotin amotin added the Status: Code Review Needed Ready for review and testing label Jan 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Status: Code Review Needed Ready for review and testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants