-
Notifications
You must be signed in to change notification settings - Fork 820
Add a fast path for _clone_dim_order #15815
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a fast path for _clone_dim_order #15815
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15815
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New FailureAs of commit 421c6dc with merge base e774b77 ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
3f1cb30 to
929d52b
Compare
|
@GregoryComer has imported this pull request. If you are a Meta employee, you can view this in D86993338. |
929d52b to
421c6dc
Compare
|
Note that the moshi failure is pre-existing. |
Gasoonjia
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
### Summary Add a direct memcpy fast path for the portable _clone_dim_order op, as it can be a performance bottleneck. I'd like to more aggressively optimize these out of the graph, but this fast path should reduce the perf impact significantly. ### Test plan Existing correctness tests for the _clone_dim_order implementation should cover it. For performance, I did a quick test with a default dim order (1, 128, 256, 256) element tensor on an x86 server. This is mainly intended as a quick smoke test and not a proper benchmark. I included numbers for both optimized and debug builds. Optimized matters more, but super long debug runs can be painful for development. [Optimized Build] Before: 27.9 ms After: 6.4 ms [Debug Build] Before: 5947.01 ms After: 7.2 ms
Summary
Add a direct memcpy fast path for the portable _clone_dim_order op, as it can be a performance bottleneck. I'd like to more aggressively optimize these out of the graph, but this fast path should reduce the perf impact significantly.
Test plan
Existing correctness tests for the _clone_dim_order implementation should cover it.
For performance, I did a quick test with a default dim order (1, 128, 256, 256) element tensor on an x86 server. This is mainly intended as a quick smoke test and not a proper benchmark. I included numbers for both optimized and debug builds. Optimized matters more, but super long debug runs can be painful for development.
[Optimized Build]
Before: 27.9 ms
After: 6.4 ms
[Debug Build]
Before: 5947.01 ms
After: 7.2 ms