Skip to content

Conversation

@hazzlim
Copy link
Contributor

@hazzlim hazzlim commented Dec 8, 2025

Implement the namespace _Sorting algorithms using Neon, and enable _VECTORIZED_MINMAX_ELEMENT on ARM64 targets.

Implement the namespace _Sorting algorithms using Neon, and enable
_VECTORIZED_MINMAX on ARM64 targets.
@hazzlim hazzlim requested a review from a team as a code owner December 8, 2025 17:36
@github-project-automation github-project-automation bot moved this to Initial Review in STL Code Reviews Dec 8, 2025
@hazzlim
Copy link
Contributor Author

hazzlim commented Dec 8, 2025

I have only enabled _VECTORIZED_MINMAX_ELEMENT in the first instance, as it seemed to make some sense to enable the other _Sorting algorithms in separate PRs.

This PR does not vectorize (u)int64_t on ARM64 as this was not faster than the scalar code.

The benchmark results are below:

Name MSVC Speedup Clang Speedup
bm<uint8_t, Op::Min>/8021 24.735 9.268
bm<uint8_t, Op::Min>/63 5.182 2.995
bm<uint8_t, Op::Max>/8021 24.695 9.561
bm<uint8_t, Op::Max>/63 4.896 2.976
bm<uint8_t, Op::Both>/8021 19.184 7.811
bm<uint8_t, Op::Both>/63 1.977 1.841
bm<uint16_t, Op::Min>/8021 12.053 4.524
bm<uint16_t, Op::Min>/31 3.052 2.089
bm<uint16_t, Op::Max>/8021 11.808 4.756
bm<uint16_t, Op::Max>/31 2.933 2.047
bm<uint16_t, Op::Both>/8021 5.426 4.052
bm<uint16_t, Op::Both>/31 1.413 1.521
bm<uint32_t, Op::Min>/8021 6.133 1.908
bm<uint32_t, Op::Min>/15 1.544 1.094
bm<uint32_t, Op::Max>/8021 6.074 1.92
bm<uint32_t, Op::Max>/15 1.53 1.132
bm<uint32_t, Op::Both>/8021 3.146 2.877
bm<uint32_t, Op::Both>/15 0.869 1.195
bm<int8_t, Op::Min>/8021 24.735 9.211
bm<int8_t, Op::Min>/63 5.222 2.778
bm<int8_t, Op::Max>/8021 25.244 9.286
bm<int8_t, Op::Max>/63 5.417 2.889
bm<int8_t, Op::Both>/8021 11.538 11.25
bm<int8_t, Op::Both>/63 1.989 1.76
bm<int16_t, Op::Min>/8021 11.953 4.667
bm<int16_t, Op::Min>/31 3.029 1.872
bm<int16_t, Op::Max>/8021 11.808 4.571
bm<int16_t, Op::Max>/31 3.123 1.882
bm<int16_t, Op::Both>/8021 6.582 5.729
bm<int16_t, Op::Both>/31 1.414 1.541
bm<int32_t, Op::Min>/8021 6.25 1.88
bm<int32_t, Op::Min>/15 1.6 1.135
bm<int32_t, Op::Max>/8021 6.133 1.867
bm<int32_t, Op::Max>/15 1.674 1.094
bm<int32_t, Op::Both>/8021 3.222 1.784
bm<int32_t, Op::Both>/15 0.877 0.903
bm<float, Op::Min>/8021 8.928 4.364
bm<float, Op::Min>/15 1.87 1.358
bm<float, Op::Max>/8021 9.111 4.267
bm<float, Op::Max>/15 2.062 1.371
bm<float, Op::Both>/8021 5.227 1.626
bm<float, Op::Both>/15 0.913 0.7
bm<double, Op::Min>/8021 4.426 2.029
bm<double, Op::Min>/7 0.929 0.731
bm<double, Op::Max>/8021 4.563 2.133
bm<double, Op::Max>/7 0.977 0.725
bm<double, Op::Both>/8021 2.583 0.786
bm<double, Op::Both>/7 0.445 0.402

Copy link
Contributor

@AlexGuteniev AlexGuteniev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a maintainer, so I'm not completely confident in all these suggestions, but still confident enough to give them.

};

#ifdef _M_ARM64
struct _Traits_8_neon : _Traits_8_base, _Traits_neon_base {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not be providing_Traits_8_neon, and also should not provide _8 functions.

When/if minmax or is_sorted_until are vectorized. 8 traits can be added with only function needed there.

Because we should strive to avoid dead code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've remove _Traits_8_neon, and guarded the definitions of the _8 functions with #ifndef _M_ARM64.

Related: I also realize it didn't make sense to define minmax and is_sorted_until for the time being, so I have guarded those too.

I've left declarations of the _8 functions as-is in xutility / algorithm, as I figured we don't mind unused and undefined declarations - but let me know if we want to wrap those in guards also!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's confusing to have declarations of functions that are never defined.

@StephanTLavavej StephanTLavavej added performance Must go faster ARM64 Related to the ARM64 architecture labels Dec 8, 2025
@github-project-automation github-project-automation bot moved this from Initial Review to Work In Progress in STL Code Reviews Dec 8, 2025
@hazzlim
Copy link
Contributor Author

hazzlim commented Dec 9, 2025

Aha I see that VSO_0000000_vector_algorithms_floats are failing... I will take a look there

(Stupidly I didn't realize that floating point functions were not exercised under VSO_0000000_vector_algorithms tests, sorry!)

const auto _V_pos = _Traits::_Get_v_pos(_Idx_min);
#else
const auto _V_pos = _Traits::_Get_v_pos(_Cur_idx_min, _H_pos);
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can simplify and always use _Idx_min.

Not sure if we need to do this here or as a follow up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to change here if we think it makes more sense than doing it separately?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think let's do this here.

@hazzlim
Copy link
Contributor Author

hazzlim commented Dec 9, 2025

Aha I see that VSO_0000000_vector_algorithms_floats are failing... I will take a look there

(Stupidly I didn't realize that floating point functions were not exercised under VSO_0000000_vector_algorithms tests, sorry!)

Should be fixed - it's a shame we don't have -flax-vector-conversions=false for Neon on MSVC 😢

static unsigned long _Get_first_h_pos(unsigned long _Mask) {
unsigned long _H_pos;
// CodeQL [SM02313] _H_pos is always initialized: element exists, so _Mask != 0.
_BitScanForward(&_H_pos, _Mask);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be _tzcnt_u32.
We assume that AVX2 implies BMI and BMI2.

I decided not to bother for uncommon code path back when I added AVX2 here, but since we have to, we can take advantage of it.

Note that SSE should stay _BitScanForward, with SSE4.2 we only assume popcnt from bit manipulations.

static unsigned long _Get_last_h_pos(unsigned long _Mask) {
unsigned long _H_pos;
// CodeQL [SM02313] _H_pos is always initialized: element exists, so _Mask != 0.
_BitScanReverse(&_H_pos, _Mask);
Copy link
Contributor

@AlexGuteniev AlexGuteniev Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this can be 31 - _lzcnt_u32.
We assume that AVX2 implies BMI and BMI2.

And we can bring _H_pos -= sizeof(_Cur_max_val) - 1; // Correct from highest val bit to lowest inside _Get_last_h_pos, so that for _lzcnt_u32 and for ARM64 the negations would cancel out. no, this one isn't good.

@StephanTLavavej StephanTLavavej moved this from Work In Progress to Initial Review in STL Code Reviews Dec 9, 2025
@StephanTLavavej StephanTLavavej self-assigned this Dec 10, 2025
Copy link
Contributor

@AlexGuteniev AlexGuteniev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it looks good to me, though I haven't looked up what these intrinsics do.
Let's spam even more const though.

Comment on lines 2450 to 2451
uint64x2_t _Swapped = vextq_u64(_Cur_u, _Cur_u, 1);
uint64x2_t _Mask_lt = vcltq_u64(_Swapped, _Cur_u);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const maybe.

We generally try to add const for local such variables to aid understanding that variables are not modified, so that non-const stand out.

Comment on lines 2457 to 2458
uint64x2_t _Swapped = vextq_u64(_Cur_u, _Cur_u, 1);
uint64x2_t _Mask_gt = vcgtq_u64(_Swapped, _Cur_u);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And const here


// CodeQL [SM02313] _H_pos is always initialized: element exists, so _Mask != 0.
_BitScanForward(&_H_pos, _Mask);
unsigned long _H_pos = _Traits::_Get_first_h_pos(_Mask);
Copy link
Contributor

@AlexGuteniev AlexGuteniev Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wasn't const because of the inconvenient _BitScanForward, but now it is not used right here, we can add const.


// CodeQL [SM02313] _H_pos is always initialized: we just tested `if (_Mask != 0)`.
_BitScanForward(&_H_pos, _Mask);
unsigned long _H_pos = _Traits::_Get_first_h_pos(_Mask);
Copy link
Contributor

@AlexGuteniev AlexGuteniev Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wasn't const because of the inconvenient _BitScanForward, but now it is not used right here, we can add const.


const auto _Is_less = _Traits::_Cmp_gt(_Right, _Left);
unsigned long _Mask = _Traits::_Mask(_Traits::_Mask_cast(_Is_less));
auto _Mask = _Traits::_Mask(_Traits::_Mask_cast(_Is_less));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this is pre-existing, there should have been const before this change.
Presumably, was copied from another occurrence this way, where _Mask is potentially modified.


// CodeQL [SM02313] _H_pos is always initialized: we just tested `if (_Mask != 0)`.
_BitScanForward(&_H_pos, _Mask);
unsigned long _H_pos = _Traits::_Get_first_h_pos(_Mask);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto const

const auto _Is_less = _Traits::_Cmp_gt(_Right, _Left);
unsigned long _Mask =
_Traits::_Mask(_mm256_and_si256(_Traits::_Mask_cast(_Is_less), _Tail_mask));
auto _Mask = _Traits::_Mask(_mm256_and_si256(_Traits::_Mask_cast(_Is_less), _Tail_mask));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto const

Comment on lines 3286 to 3290
#ifdef _M_ARM64
if (_Byte_length(_First, _Last) >= 16) {
return _Minmax_impl<_Mode, typename _Traits::_Neon, _Sign>(_First, _Last);
}
#elif !defined(_M_ARM64EC)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also observe that providing ARM64 in minmax and is_sorted_until dispatches looks premature, as this PR does not try to enable them. But I don't see any problem with that, as the functions seen by the linker, like __std_minmax_1 for ARM64, are not provided.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes good point - I have removed these all the same and added macro guards around the minmax and is_sorted_until dispatches, I agree these should get added later. I think I was originally trying to reduce the number of macro guards, but there's already quite a few now!

@hazzlim
Copy link
Contributor Author

hazzlim commented Dec 10, 2025

Now it looks good to me, though I haven't looked up what these intrinsics do. Let's spam even more const though.

Nice - should have added all of these const-qualifiers :)

@AlexGuteniev
Copy link
Contributor

AlexGuteniev commented Dec 11, 2025

Curious how Clang gets only modest speedup, but still gets speedup.
Does Clang auto-vectorize somehow? Does MSVC do something dumb here?

Name MSVC Speedup Clang Speedup
bm<uint8_t, Op::Min>/8021 24.735 9.268

@hazzlim
Copy link
Contributor Author

hazzlim commented Dec 11, 2025

Curious how Clang gets only modest speedup, but still gets speedup. Does Clang auto-vectorize somehow? Does MSVC do something dumb here?

Name
MSVC Speedup
Clang Speedup

bm<uint8_t, Op::Min>/8021
24.735
9.268

Clang does not auto-vectorize, both are scalar code - but Clang keeps the current minimum in a register whereas MSVC reloads it every iteration of the main loop. The extra load on the critical path makes MSVC a lot slower.

@AlexGuteniev
Copy link
Contributor

MSVC reloads it every iteration of the main loop

Oh, the same problem it has on x86 and x64 too!

May worth reporting on DevCom though, as this occurrence causes ridiculous slowdown.

@hazzlim
Copy link
Contributor Author

hazzlim commented Dec 11, 2025

MSVC reloads it every iteration of the main loop

Oh, the same problem it has on x86 and x64 too!

May worth reporting on DevCom though, as this occurrence causes ridiculous slowdown.

Sure, I will open a ticket on DevCom :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ARM64 Related to the ARM64 architecture performance Must go faster

Projects

Status: Initial Review

Development

Successfully merging this pull request may close these issues.

5 participants