-
Notifications
You must be signed in to change notification settings - Fork 1.9k
L2ARC: Rework write throttling with DWPD rate limiting and parallel writes #18093
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
21a3839 to
77f842a
Compare
|
Thanks for looking into L2ARC. When using the sublist persistent marker, how do you feel about |
|
Just as an idea, there are SSDs out there with a DWPD figure of less than 1.0, and people share SSDs for multiple purposes. Maybe the module parameter should be a percentage or other fractional unit? |
@shodanshok - That's a very good point. With persistent markers, excessive repeated scans are largely addressed, though markers do reset after capacity/8 writes. Some headroom still makes sense since we're trying to visit all sublists in each feed. Bumping the default to a higher value might indeed be beneficial.
@DeHackEd - Good point! Current implementation can't handle fractional DWPD values. Switching the module parameter to percentage-based (e.g., 50 = 0.5 DWPD, 100 = 1.0 DWPD) would be beneficial. I'll update this in the current patch. Right now I'm investigating an accounting issue that occurs during module unload when using parallel fio threads with L2ARC writes due to |
|
Ready for review. Updated DWPD parameter to percentage-based for fractional support, resolved |
behlendorf
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! I only had a few minor comments. As an experiment I've also requested a review from the bot, let's see how it does.
tests/zfs-tests/tests/functional/l2arc/l2arc_dwpd_ratelimit_pos.ksh
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request implements a comprehensive rework of L2ARC write throttling with DWPD (Drive Writes Per Day) rate limiting and parallel writes. The changes address several limitations in the original implementation including random sublist selection, redundant tail scanning, single global feed thread, and bursty write patterns.
Key Changes:
- Replaces random sublist selection with systematic multi-sublist scanning with persistent markers
- Implements per-device feed threads for parallel writes across multiple L2ARC devices
- Adds DWPD-based rate limiting to protect SSD endurance with configurable limits
- Fixes ABD deferred free logic in
arc_hdr_destroy()to prevent gang ABD panics - Updates
arc_release()to handle L2_WRITING state properly
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/zfs-tests/tests/functional/trim/trim_l2arc.ksh | Adds L2ARC_DWPD_LIMIT tunable save/restore in cleanup |
| tests/zfs-tests/tests/functional/l2arc/l2arc_multidev_throughput_pos.ksh | New test verifying parallel write scaling with multiple devices |
| tests/zfs-tests/tests/functional/l2arc/l2arc_multidev_scaling_pos.ksh | New test verifying 2x throughput improvement with dual devices |
| tests/zfs-tests/tests/functional/l2arc/l2arc_dwpd_reimport_pos.ksh | New test verifying DWPD rate limiting persists across pool export/import |
| tests/zfs-tests/tests/functional/l2arc/l2arc_dwpd_ratelimit_pos.ksh | New test verifying DWPD ordering: unlimited > high > medium > low |
| tests/zfs-tests/tests/functional/cache/cache_012_pos.ksh | Adds L2ARC_DWPD_LIMIT tunable save/restore |
| tests/zfs-tests/tests/Makefile.am | Registers 4 new L2ARC test files |
| tests/zfs-tests/include/tunables.cfg | Adds L2ARC_DWPD_LIMIT tunable mapping |
| tests/runfiles/common.run | Adds 4 new tests to l2arc test suite |
| module/zfs/spa_misc.c | Removes global l2arc_start/l2arc_stop calls |
| module/zfs/arc.c | Core implementation of per-device threads, DWPD limiting, persistent markers, and ABD fixes |
| man/man4/zfs.4 | Documents new l2arc_dwpd_limit parameter and updates l2arc_write_max description |
| include/sys/spa_impl.h | Adds l2arc_info_t structure to spa_t for L2ARC state tracking |
| include/sys/arc_impl.h | Defines L2ARC_FEED_TYPES, l2arc_info_t structure, and per-device thread fields |
| include/sys/arc.h | Removes l2arc_start() and l2arc_stop() function declarations |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Thanks @behlendorf for your review. Pushed a commit to address yours and Copilot's feedback and rebased to master. Most of Copilot's feedback was false positive though. |
|
Thanks @behlendorf for the approval. @amotin was planning to take one more look. It would be good to have a couple more people review it, or at least let it sit for a few more days before marking it as Accepted. |
tests/zfs-tests/tests/functional/l2arc/l2arc_multidev_scaling_pos.ksh
Outdated
Show resolved
Hide resolved
tests/zfs-tests/tests/functional/l2arc/l2arc_multidev_throughput_pos.ksh
Show resolved
Hide resolved
|
Thanks @amotin for the detailed review. I’ve pushed a commit addressing your feedback and rebased onto |
The introduction of ARC multilists made L2ARC writing quite random, depending on whether it found something to write in a randomly selected sublist. This created inconsistent write patterns and poor utilization of available sublists leading to uneven cache population. This commit replaces random selection with systematic scanning across all sublists within each burst. Fair headroom distribution ensures even-depth traversal across all sublists until the target write size is reached. Round-robin processing with random starting points eliminates sequential bias while maintaining predictable write behavior. The systematic approach provides consistent L2ARC filling patterns and better utilization of available ARC data across all sublists. Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
This commit introduces per-sublist persistent markers that eliminate redundant tail scanning between L2ARC iterations, providing significant CPU efficiency improvements. Markers are pre-allocated during device initialization and properly cleaned up during device removal. The implementation uses conditional behavior based on device capacity: small devices (capacity < arc_c) retain original HEAD/TAIL scanning based on ARC warmup state, while large devices (capacity >= arc_c) use the persistent marker approach for optimal CPU efficiency. Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
With multiple L2ARC devices, headers can be destroyed asynchronously (e.g., during zpool sync) while L2_WRITING is set. The original code destroyed L2HDR before L1HDR, causing ABDs to lose their device association (b_l2hdr.b_dev) when arc_hdr_free_abd() is called. This caused ABDs to be added to the global free-on-write list without device information. When any L2ARC device completed its write and attempted to free these orphaned ABDs, it would panic on ASSERT(!list_link_active(&abd->abd_gang_link)) because the ABD was still part of another device's vdev_queue I/O aggregation gang. Fix by extending l2ad_mtx lock scope to cover L1HDR destruction and reordering to destroy L1HDR before L2HDR when L2_WRITING is set. This ensures arc_hdr_free_abd() can access b_l2hdr.b_dev to properly tag ABDs with their device for deferred cleanup. Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
When arc_release() is called on a header with a single buffer and L2_WRITING set, the L2HDR must be preserved for ABD cleanup (similar to the arc_hdr_destroy() case). If we destroy the L2HDR here, later arc_write() will allocate a new ABD and call arc_hdr_free_abd(), which needs b_l2hdr.b_dev to properly defer ABD cleanup, causing VERIFY(HDR_HAS_L2HDR(hdr)) to fail. Allocate a new header for the buffer in the single_buf_l2writing case (single buffer + L2_WRITING), leaving the original header with L2HDR intact. The original header becomes an "orphan" (no buffers, no b_pabd) but retains device association for ABD cleanup when l2arc_write_done() completes. The shared buffer case (HDR_SHARED_DATA) is excluded because L2ARC makes its own transformed copy via l2arc_apply_transforms(), so the original ABD is not used by the L2 write. The header can be safely reused without allocating a new one. For proper evictable space accounting, arc_buf_remove() must be called before remove_reference() in the single_buf_l2writing path. This ensures arc_evictable_space_increment() (during remove_reference) and arc_evictable_space_decrement() (during destruction) see the same state (b_buf=NULL), preventing accounting leaks that cause module unload to hang with non-zero esize. Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Transform L2ARC from single global feed thread to per-device threads, enabling parallel writes to multiple L2ARC devices. Each device runs its own feed thread independently, improving multi-device throughput. Previously, a single thread served all devices sequentially; now each device writes concurrently. Threads are created during device addition and torn down on removal. Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Add DWPD (Drive Writes Per Day) rate limiting to control L2ARC write speeds and protect SSD endurance. Write rate is constrained by the minimum of l2arc_write_max and DWPD-calculated budget. Devices accumulate unused write budget over 24-hour periods with automatic reset and carry-over. Writes occur in controlled bursts (max 50MB) with adaptive intervals to achieve target rates. Applies after initial device fill. Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Add l2arc_dwpd_limit, remove l2arc_write_boost, update related tunables. Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
|
Updated as per feedback provided by @amotin in private and rebased onto master. |
Add four new functional tests to validate L2ARC DWPD rate limiting and parallel write features: - l2arc_dwpd_ratelimit_pos: Verifies DWPD rate limiting with different values (0, 100, 1000, 10000) and ordering - l2arc_dwpd_reimport_pos: Verifies DWPD rate limiting persists after pool export/import - l2arc_multidev_scaling_pos: Verifies parallel write scaling ratio (dual devices achieve ~2× single device throughput) - l2arc_multidev_throughput_pos: Verifies absolute parallel write throughput scales with device count (~32MB/s per device) Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
These fields became unused when ABD was introduced in a6255b7. Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
amotin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
Motivation and Context
The L2ARC write throttling algorithm was designed years ago and has several limitations in modern environments:
l2arc_write_boostwas designed before persistent L2ARC existed. It tied write rate acceleration to ARC warmup state, but ARC being cold doesn't mean L2ARC should write faster if it's full, and ARC being warm doesn't mean L2ARC should slow down if it's empty.Description
(capacity >= arc_c_max/2)use persistent markers; small devices retain original HEAD/TAIL behavior based on ARC warmup state.arc_hdr_destroy()to destroy L1HDR before L2HDR, allowing proper ABD deferral when L2_WRITING flag is set. Prevents gang ABD panics with multiple L2ARC devices due to I/O aggregation reusing ABDs still in gang chains.2arc_dwpd_limittunable (default 1) controls writes per day per device capacity. Removesl2arc_write_boostandl2arc_write_interval(). Features include:device size × DWPD limit.L2ARC_BURST_SIZE_MAX]) with adaptive intervals to achieve target rates.arc_c_max/2).How Has This Been Tested?
l2arc_dwpd_pos,l2arc_dwpd_multidev_pos,l2arc_dwpd_persist_pos,l2arc_parallel_writes_pos.Types of changes
Checklist:
Signed-off-by.