-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Enable vectorized minmax_element using Neon on ARM64 #5949
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Implement the namespace _Sorting algorithms using Neon, and enable _VECTORIZED_MINMAX on ARM64 targets.
|
I have only enabled _VECTORIZED_MINMAX_ELEMENT in the first instance, as it seemed to make some sense to enable the other _Sorting algorithms in separate PRs. This PR does not vectorize (u)int64_t on ARM64 as this was not faster than the scalar code. The benchmark results are below:
|
AlexGuteniev
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a maintainer, so I'm not completely confident in all these suggestions, but still confident enough to give them.
stl/src/vector_algorithms.cpp
Outdated
| }; | ||
|
|
||
| #ifdef _M_ARM64 | ||
| struct _Traits_8_neon : _Traits_8_base, _Traits_neon_base { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should not be providing_Traits_8_neon, and also should not provide _8 functions.
When/if minmax or is_sorted_until are vectorized. 8 traits can be added with only function needed there.
Because we should strive to avoid dead code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've remove _Traits_8_neon, and guarded the definitions of the _8 functions with #ifndef _M_ARM64.
Related: I also realize it didn't make sense to define minmax and is_sorted_until for the time being, so I have guarded those too.
I've left declarations of the _8 functions as-is in xutility / algorithm, as I figured we don't mind unused and undefined declarations - but let me know if we want to wrap those in guards also!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's confusing to have declarations of functions that are never defined.
|
Aha I see that (Stupidly I didn't realize that floating point functions were not exercised under VSO_0000000_vector_algorithms tests, sorry!) |
stl/src/vector_algorithms.cpp
Outdated
| const auto _V_pos = _Traits::_Get_v_pos(_Idx_min); | ||
| #else | ||
| const auto _V_pos = _Traits::_Get_v_pos(_Cur_idx_min, _H_pos); | ||
| #endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can simplify and always use _Idx_min.
Not sure if we need to do this here or as a follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to change here if we think it makes more sense than doing it separately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think let's do this here.
Should be fixed - it's a shame we don't have |
stl/src/vector_algorithms.cpp
Outdated
| static unsigned long _Get_first_h_pos(unsigned long _Mask) { | ||
| unsigned long _H_pos; | ||
| // CodeQL [SM02313] _H_pos is always initialized: element exists, so _Mask != 0. | ||
| _BitScanForward(&_H_pos, _Mask); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be _tzcnt_u32.
We assume that AVX2 implies BMI and BMI2.
I decided not to bother for uncommon code path back when I added AVX2 here, but since we have to, we can take advantage of it.
Note that SSE should stay _BitScanForward, with SSE4.2 we only assume popcnt from bit manipulations.
stl/src/vector_algorithms.cpp
Outdated
| static unsigned long _Get_last_h_pos(unsigned long _Mask) { | ||
| unsigned long _H_pos; | ||
| // CodeQL [SM02313] _H_pos is always initialized: element exists, so _Mask != 0. | ||
| _BitScanReverse(&_H_pos, _Mask); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this can be 31 - _lzcnt_u32.
We assume that AVX2 implies BMI and BMI2.
And we can bring no, this one isn't good._H_pos -= sizeof(_Cur_max_val) - 1; // Correct from highest val bit to lowest inside _Get_last_h_pos, so that for _lzcnt_u32 and for ARM64 the negations would cancel out.
AlexGuteniev
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now it looks good to me, though I haven't looked up what these intrinsics do.
Let's spam even more const though.
stl/src/vector_algorithms.cpp
Outdated
| uint64x2_t _Swapped = vextq_u64(_Cur_u, _Cur_u, 1); | ||
| uint64x2_t _Mask_lt = vcltq_u64(_Swapped, _Cur_u); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const maybe.
We generally try to add const for local such variables to aid understanding that variables are not modified, so that non-const stand out.
stl/src/vector_algorithms.cpp
Outdated
| uint64x2_t _Swapped = vextq_u64(_Cur_u, _Cur_u, 1); | ||
| uint64x2_t _Mask_gt = vcgtq_u64(_Swapped, _Cur_u); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And const here
stl/src/vector_algorithms.cpp
Outdated
|
|
||
| // CodeQL [SM02313] _H_pos is always initialized: element exists, so _Mask != 0. | ||
| _BitScanForward(&_H_pos, _Mask); | ||
| unsigned long _H_pos = _Traits::_Get_first_h_pos(_Mask); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wasn't const because of the inconvenient _BitScanForward, but now it is not used right here, we can add const.
stl/src/vector_algorithms.cpp
Outdated
|
|
||
| // CodeQL [SM02313] _H_pos is always initialized: we just tested `if (_Mask != 0)`. | ||
| _BitScanForward(&_H_pos, _Mask); | ||
| unsigned long _H_pos = _Traits::_Get_first_h_pos(_Mask); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wasn't const because of the inconvenient _BitScanForward, but now it is not used right here, we can add const.
stl/src/vector_algorithms.cpp
Outdated
|
|
||
| const auto _Is_less = _Traits::_Cmp_gt(_Right, _Left); | ||
| unsigned long _Mask = _Traits::_Mask(_Traits::_Mask_cast(_Is_less)); | ||
| auto _Mask = _Traits::_Mask(_Traits::_Mask_cast(_Is_less)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this is pre-existing, there should have been const before this change.
Presumably, was copied from another occurrence this way, where _Mask is potentially modified.
stl/src/vector_algorithms.cpp
Outdated
|
|
||
| // CodeQL [SM02313] _H_pos is always initialized: we just tested `if (_Mask != 0)`. | ||
| _BitScanForward(&_H_pos, _Mask); | ||
| unsigned long _H_pos = _Traits::_Get_first_h_pos(_Mask); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto const
stl/src/vector_algorithms.cpp
Outdated
| const auto _Is_less = _Traits::_Cmp_gt(_Right, _Left); | ||
| unsigned long _Mask = | ||
| _Traits::_Mask(_mm256_and_si256(_Traits::_Mask_cast(_Is_less), _Tail_mask)); | ||
| auto _Mask = _Traits::_Mask(_mm256_and_si256(_Traits::_Mask_cast(_Is_less), _Tail_mask)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto const
stl/src/vector_algorithms.cpp
Outdated
| #ifdef _M_ARM64 | ||
| if (_Byte_length(_First, _Last) >= 16) { | ||
| return _Minmax_impl<_Mode, typename _Traits::_Neon, _Sign>(_First, _Last); | ||
| } | ||
| #elif !defined(_M_ARM64EC) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also observe that providing ARM64 in minmax and is_sorted_until dispatches looks premature, as this PR does not try to enable them. But I don't see any problem with that, as the functions seen by the linker, like __std_minmax_1 for ARM64, are not provided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes good point - I have removed these all the same and added macro guards around the minmax and is_sorted_until dispatches, I agree these should get added later. I think I was originally trying to reduce the number of macro guards, but there's already quite a few now!
Nice - should have added all of these const-qualifiers :) |
|
Curious how Clang gets only modest speedup, but still gets speedup.
|
Clang does not auto-vectorize, both are scalar code - but Clang keeps the current minimum in a register whereas MSVC reloads it every iteration of the main loop. The extra load on the critical path makes MSVC a lot slower. |
Oh, the same problem it has on x86 and x64 too! May worth reporting on DevCom though, as this occurrence causes ridiculous slowdown. |
Sure, I will open a ticket on DevCom :) |
Implement the namespace _Sorting algorithms using Neon, and enable _VECTORIZED_MINMAX_ELEMENT on ARM64 targets.