-
Notifications
You must be signed in to change notification settings - Fork 822
[Executorch] make slice_copy parallel #15830
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Executorch] make slice_copy parallel #15830
Conversation
When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/) [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15830
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 New FailuresAs of commit 4c5d92d with merge base 9cd8402 ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/) ghstack-source-id: 323355686 Pull Request resolved: #15830
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request adds parallel processing to the slice_copy operation in ExecutorTorch to improve performance during large prefills in LLMs, where slice_copy can take 5-10% of execution time (primarily from rope implementation slicing).
Key Changes:
- Added multithreading support to
compute_slicefunction with workload-based thresholds - Parallel execution distributes work across leading dimensions using
parallel_for - Single-threaded fallback maintained for smaller workloads
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| kernels/portable/cpu/util/targets.bzl | Adds threadpool dependency required for parallel execution support |
| kernels/portable/cpu/util/slice_util.cpp | Implements parallel slice_copy with multithreading when leading_dims ≥ 8 and total_elements ≥ 32768, maintaining single-threaded fallback for smaller workloads |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (use_multithreading) { | ||
| // Use parallel_for to distribute work across leading dimensions | ||
| // Calculate grain size based on number of elements per leading dimension | ||
| const int64_t elements_per_leading_dim = length * trailing_dims; |
Copilot
AI
Nov 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable elements_per_leading_dim is calculated but never used. It appears this was intended for grain size calculation but MIN_LEADING_DIMS_FOR_MT is used instead. Consider removing this unused variable or using it to calculate a more dynamic grain size based on workload characteristics.
| const int64_t elements_per_leading_dim = length * trailing_dims; |
When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/) [ghstack-poisoned]
Pull Request resolved: #15830 When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/) ghstack-source-id: 324784683
When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/) [ghstack-poisoned]
Pull Request resolved: #15830 When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. ghstack-source-id: 324975499 Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/)
When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/) [ghstack-poisoned]
Pull Request resolved: #15830 When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. ghstack-source-id: 325110790 Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/)
When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/) [ghstack-poisoned]
Pull Request resolved: #15830 When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. ghstack-source-id: 325206643 Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/)
When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/) [ghstack-poisoned]
Pull Request resolved: #15830 When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. ghstack-source-id: 325251276 Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/)
When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/) [ghstack-poisoned]
Pull Request resolved: #15830 When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. ghstack-source-id: 325262631 Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/)
When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/) [ghstack-poisoned]
Pull Request resolved: #15830 When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. ghstack-source-id: 326932760 Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/)
When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/) [ghstack-poisoned]
Pull Request resolved: #15830 When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. ghstack-source-id: 326986639 Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/)
When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/) [ghstack-poisoned]
Pull Request resolved: #15830 When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. ghstack-source-id: 327095994 Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/)
When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/) [ghstack-poisoned]
Pull Request resolved: #15830 When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. ghstack-source-id: 327095994 Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/)
When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/) [ghstack-poisoned]
Pull Request resolved: #15830 When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. ghstack-source-id: 327106882 Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/)
When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/) [ghstack-poisoned]
Pull Request resolved: #15830 When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. ghstack-source-id: 327110500 Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/)
When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/) [ghstack-poisoned]
Pull Request resolved: #15830 When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. ghstack-source-id: 327113306 Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/)
When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/) [ghstack-poisoned]
Pull Request resolved: #15830 When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. ghstack-source-id: 327186826 Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/)
When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/) [ghstack-poisoned]
Pull Request resolved: #15830 When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. ghstack-source-id: 327614758 Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/)
When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/) [ghstack-poisoned]
Pull Request resolved: #15830 When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. ghstack-source-id: 327630241 Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/)
When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/) [ghstack-poisoned]
Pull Request resolved: #15830 When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. ghstack-source-id: 327688163 Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/)
1a73b51
into
gh/kimishpatel/214/base
Pull Request resolved: #15830 When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. ghstack-source-id: 327688163 Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/)
Pull Request resolved: pytorch#15830 When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. ghstack-source-id: 327688163 Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/)
Stack from ghstack (oldest at bottom):
When doing large prefills in LLMs, slice_copy takes about 5-10% time.
Mainly coming from slicing in the rope implementation.
Differential Revision: D85532081