Enable vectorized minmax_element using Neon on ARM64 #5949

hazzlim · 2025-12-08T17:36:10Z

Implement the namespace _Sorting algorithms using Neon, and enable _VECTORIZED_MINMAX_ELEMENT on ARM64 targets.

Implement the namespace _Sorting algorithms using Neon, and enable _VECTORIZED_MINMAX on ARM64 targets.

hazzlim · 2025-12-08T17:39:13Z

I have only enabled _VECTORIZED_MINMAX_ELEMENT in the first instance, as it seemed to make some sense to enable the other _Sorting algorithms in separate PRs.

This PR does not vectorize (u)int64_t on ARM64 as this was not faster than the scalar code.

The benchmark results are below:

Name	MSVC Speedup	Clang Speedup
bm<uint8_t, Op::Min>/8021	24.735	9.268
bm<uint8_t, Op::Min>/63	5.182	2.995
bm<uint8_t, Op::Max>/8021	24.695	9.561
bm<uint8_t, Op::Max>/63	4.896	2.976
bm<uint8_t, Op::Both>/8021	19.184	7.811
bm<uint8_t, Op::Both>/63	1.977	1.841
bm<uint16_t, Op::Min>/8021	12.053	4.524
bm<uint16_t, Op::Min>/31	3.052	2.089
bm<uint16_t, Op::Max>/8021	11.808	4.756
bm<uint16_t, Op::Max>/31	2.933	2.047
bm<uint16_t, Op::Both>/8021	5.426	4.052
bm<uint16_t, Op::Both>/31	1.413	1.521
bm<uint32_t, Op::Min>/8021	6.133	1.908
bm<uint32_t, Op::Min>/15	1.544	1.094
bm<uint32_t, Op::Max>/8021	6.074	1.92
bm<uint32_t, Op::Max>/15	1.53	1.132
bm<uint32_t, Op::Both>/8021	3.146	2.877
bm<uint32_t, Op::Both>/15	0.869	1.195
bm<int8_t, Op::Min>/8021	24.735	9.211
bm<int8_t, Op::Min>/63	5.222	2.778
bm<int8_t, Op::Max>/8021	25.244	9.286
bm<int8_t, Op::Max>/63	5.417	2.889
bm<int8_t, Op::Both>/8021	11.538	11.25
bm<int8_t, Op::Both>/63	1.989	1.76
bm<int16_t, Op::Min>/8021	11.953	4.667
bm<int16_t, Op::Min>/31	3.029	1.872
bm<int16_t, Op::Max>/8021	11.808	4.571
bm<int16_t, Op::Max>/31	3.123	1.882
bm<int16_t, Op::Both>/8021	6.582	5.729
bm<int16_t, Op::Both>/31	1.414	1.541
bm<int32_t, Op::Min>/8021	6.25	1.88
bm<int32_t, Op::Min>/15	1.6	1.135
bm<int32_t, Op::Max>/8021	6.133	1.867
bm<int32_t, Op::Max>/15	1.674	1.094
bm<int32_t, Op::Both>/8021	3.222	1.784
bm<int32_t, Op::Both>/15	0.877	0.903
bm<float, Op::Min>/8021	8.928	4.364
bm<float, Op::Min>/15	1.87	1.358
bm<float, Op::Max>/8021	9.111	4.267
bm<float, Op::Max>/15	2.062	1.371
bm<float, Op::Both>/8021	5.227	1.626
bm<float, Op::Both>/15	0.913	0.7
bm<double, Op::Min>/8021	4.426	2.029
bm<double, Op::Min>/7	0.929	0.731
bm<double, Op::Max>/8021	4.563	2.133
bm<double, Op::Max>/7	0.977	0.725
bm<double, Op::Both>/8021	2.583	0.786
bm<double, Op::Both>/7	0.445	0.402

AlexGuteniev

I'm not a maintainer, so I'm not completely confident in all these suggestions, but still confident enough to give them.

stl/inc/xutility

AlexGuteniev · 2025-12-08T18:18:07Z

stl/src/vector_algorithms.cpp

        };

+#ifdef _M_ARM64
+        struct _Traits_8_neon : _Traits_8_base, _Traits_neon_base {


I think we should not be providing_Traits_8_neon, and also should not provide _8 functions.

When/if minmax or is_sorted_until are vectorized. 8 traits can be added with only function needed there.

Because we should strive to avoid dead code.

I've remove _Traits_8_neon, and guarded the definitions of the _8 functions with #ifndef _M_ARM64.

Related: I also realize it didn't make sense to define minmax and is_sorted_until for the time being, so I have guarded those too.

I've left declarations of the _8 functions as-is in xutility / algorithm, as I figured we don't mind unused and undefined declarations - but let me know if we want to wrap those in guards also!

I think it's confusing to have declarations of functions that are never defined.

stl/src/vector_algorithms.cpp

stl/inc/xutility

hazzlim · 2025-12-09T14:59:43Z

Aha I see that VSO_0000000_vector_algorithms_floats are failing... I will take a look there

(Stupidly I didn't realize that floating point functions were not exercised under VSO_0000000_vector_algorithms tests, sorry!)

AlexGuteniev · 2025-12-09T16:08:02Z

stl/src/vector_algorithms.cpp

+                                const auto _V_pos = _Traits::_Get_v_pos(_Idx_min);
+#else
                                const auto _V_pos = _Traits::_Get_v_pos(_Cur_idx_min, _H_pos);
+#endif


I think we can simplify and always use _Idx_min.

Not sure if we need to do this here or as a follow up.

Ditto below.

Happy to change here if we think it makes more sense than doing it separately?

I think let's do this here.

hazzlim · 2025-12-09T16:10:07Z

Aha I see that VSO_0000000_vector_algorithms_floats are failing... I will take a look there

(Stupidly I didn't realize that floating point functions were not exercised under VSO_0000000_vector_algorithms tests, sorry!)

Should be fixed - it's a shame we don't have -flax-vector-conversions=false for Neon on MSVC 😢

AlexGuteniev · 2025-12-09T16:19:36Z

stl/src/vector_algorithms.cpp

+            static unsigned long _Get_first_h_pos(unsigned long _Mask) {
+                unsigned long _H_pos;
+                // CodeQL [SM02313] _H_pos is always initialized: element exists, so _Mask != 0.
+                _BitScanForward(&_H_pos, _Mask);


This can be _tzcnt_u32.
We assume that AVX2 implies BMI and BMI2.

I decided not to bother for uncommon code path back when I added AVX2 here, but since we have to, we can take advantage of it.

Note that SSE should stay _BitScanForward, with SSE4.2 we only assume popcnt from bit manipulations.

AlexGuteniev · 2025-12-09T16:22:00Z

stl/src/vector_algorithms.cpp

+            static unsigned long _Get_last_h_pos(unsigned long _Mask) {
+                unsigned long _H_pos;
+                // CodeQL [SM02313] _H_pos is always initialized: element exists, so _Mask != 0.
+                _BitScanReverse(&_H_pos, _Mask);


And this can be 31 - _lzcnt_u32.
We assume that AVX2 implies BMI and BMI2.

~~And we can bring _H_pos -= sizeof(_Cur_max_val) - 1; // Correct from highest val bit to lowest inside _Get_last_h_pos, so that for _lzcnt_u32 and for ARM64 the negations would cancel out.~~ no, this one isn't good.

AlexGuteniev

Now it looks good to me, though I haven't looked up what these intrinsics do.
Let's spam even more const though.

AlexGuteniev · 2025-12-10T06:54:38Z

stl/src/vector_algorithms.cpp

+                uint64x2_t _Swapped     = vextq_u64(_Cur_u, _Cur_u, 1);
+                uint64x2_t _Mask_lt     = vcltq_u64(_Swapped, _Cur_u);


const maybe.

We generally try to add const for local such variables to aid understanding that variables are not modified, so that non-const stand out.

AlexGuteniev · 2025-12-10T06:55:00Z

stl/src/vector_algorithms.cpp

+                uint64x2_t _Swapped     = vextq_u64(_Cur_u, _Cur_u, 1);
+                uint64x2_t _Mask_gt     = vcgtq_u64(_Swapped, _Cur_u);


And const here

AlexGuteniev · 2025-12-10T07:00:58Z

stl/src/vector_algorithms.cpp

-
-                                // CodeQL [SM02313] _H_pos is always initialized: element exists, so _Mask != 0.
-                                _BitScanForward(&_H_pos, _Mask);
+                                unsigned long _H_pos = _Traits::_Get_first_h_pos(_Mask);


This wasn't const because of the inconvenient _BitScanForward, but now it is not used right here, we can add const.

AlexGuteniev · 2025-12-10T07:02:13Z

stl/src/vector_algorithms.cpp

-
-                        // CodeQL [SM02313] _H_pos is always initialized: we just tested `if (_Mask != 0)`.
-                        _BitScanForward(&_H_pos, _Mask);
+                        unsigned long _H_pos = _Traits::_Get_first_h_pos(_Mask);


This wasn't const because of the inconvenient _BitScanForward, but now it is not used right here, we can add const.

AlexGuteniev · 2025-12-10T07:03:14Z

stl/src/vector_algorithms.cpp


                    const auto _Is_less = _Traits::_Cmp_gt(_Right, _Left);
-                    unsigned long _Mask = _Traits::_Mask(_Traits::_Mask_cast(_Is_less));
+                    auto _Mask          = _Traits::_Mask(_Traits::_Mask_cast(_Is_less));


And this is pre-existing, there should have been const before this change.
Presumably, was copied from another occurrence this way, where _Mask is potentially modified.

AlexGuteniev · 2025-12-10T07:03:37Z

stl/src/vector_algorithms.cpp

-
-                            // CodeQL [SM02313] _H_pos is always initialized: we just tested `if (_Mask != 0)`.
-                            _BitScanForward(&_H_pos, _Mask);
+                            unsigned long _H_pos = _Traits::_Get_first_h_pos(_Mask);


ditto const

AlexGuteniev · 2025-12-10T07:03:47Z

stl/src/vector_algorithms.cpp

                        const auto _Is_less = _Traits::_Cmp_gt(_Right, _Left);
-                        unsigned long _Mask =
-                            _Traits::_Mask(_mm256_and_si256(_Traits::_Mask_cast(_Is_less), _Tail_mask));
+                        auto _Mask = _Traits::_Mask(_mm256_and_si256(_Traits::_Mask_cast(_Is_less), _Tail_mask));


Ditto const

AlexGuteniev · 2025-12-10T08:50:16Z

stl/src/vector_algorithms.cpp

+#ifdef _M_ARM64
+            if (_Byte_length(_First, _Last) >= 16) {
+                return _Minmax_impl<_Mode, typename _Traits::_Neon, _Sign>(_First, _Last);
+            }
+#elif !defined(_M_ARM64EC)


I also observe that providing ARM64 in minmax and is_sorted_until dispatches looks premature, as this PR does not try to enable them. But I don't see any problem with that, as the functions seen by the linker, like __std_minmax_1 for ARM64, are not provided.

Ah yes good point - I have removed these all the same and added macro guards around the minmax and is_sorted_until dispatches, I agree these should get added later. I think I was originally trying to reduce the number of macro guards, but there's already quite a few now!

hazzlim · 2025-12-10T12:21:42Z

Now it looks good to me, though I haven't looked up what these intrinsics do. Let's spam even more const though.

Nice - should have added all of these const-qualifiers :)

AlexGuteniev · 2025-12-11T06:57:11Z

Curious how Clang gets only modest speedup, but still gets speedup.
Does Clang auto-vectorize somehow? Does MSVC do something dumb here?

Name MSVC Speedup Clang Speedup

bm<uint8_t, Op::Min>/8021 24.735 9.268

hazzlim · 2025-12-11T15:47:35Z

Curious how Clang gets only modest speedup, but still gets speedup. Does Clang auto-vectorize somehow? Does MSVC do something dumb here?

Name
MSVC Speedup
Clang Speedup

bm<uint8_t, Op::Min>/8021
24.735
9.268

Clang does not auto-vectorize, both are scalar code - but Clang keeps the current minimum in a register whereas MSVC reloads it every iteration of the main loop. The extra load on the critical path makes MSVC a lot slower.

AlexGuteniev · 2025-12-11T15:55:07Z

MSVC reloads it every iteration of the main loop

Oh, the same problem it has on x86 and x64 too!

May worth reporting on DevCom though, as this occurrence causes ridiculous slowdown.

hazzlim · 2025-12-11T16:04:06Z

MSVC reloads it every iteration of the main loop

Oh, the same problem it has on x86 and x64 too!

May worth reporting on DevCom though, as this occurrence causes ridiculous slowdown.

Sure, I will open a ticket on DevCom :)

Enable vectorized minmax_element using Neon on ARM64

e9ad403

Implement the namespace _Sorting algorithms using Neon, and enable _VECTORIZED_MINMAX on ARM64 targets.

hazzlim requested a review from a team as a code owner December 8, 2025 17:36

github-project-automation bot added this to STL Code Reviews Dec 8, 2025

github-project-automation bot moved this to Initial Review in STL Code Reviews Dec 8, 2025

AlexGuteniev reviewed Dec 8, 2025

View reviewed changes

StephanTLavavej added performance Must go faster ARM64 Related to the ARM64 architecture labels Dec 8, 2025

StephanTLavavej requested changes Dec 8, 2025

View reviewed changes

stl/inc/xutility Outdated Show resolved Hide resolved

github-project-automation bot moved this from Initial Review to Work In Progress in STL Code Reviews Dec 8, 2025

hazzlim added 4 commits December 9, 2025 12:17

Roll into _Is_min_max_optimization_safe

2329cc4

Don't const-qualify unnamed parameter

58f953b

Don't define minmax,is_sorted_until on ARM64 for now

9b849a3

Remove _Traits_8_neon and don't define *_8 functions

b3cd60c

Use non-floating point types where necessary in _Traits_d_neon

05f1ee9

AlexGuteniev reviewed Dec 9, 2025

View reviewed changes

StephanTLavavej moved this from Work In Progress to Initial Review in STL Code Reviews Dec 9, 2025

hazzlim added 3 commits December 9, 2025 23:02

Unify _Get_v_pos interface

1f1ee32

Use _tzcnt_u32/_lzcnt_u32 on avx

3a0a371

Don't declare _8 functions

0b93055

StephanTLavavej self-assigned this Dec 10, 2025

AlexGuteniev reviewed Dec 10, 2025

View reviewed changes

hazzlim added 2 commits December 10, 2025 11:55

Add missing const qualifiers

017caa1

Don't add dispatch for minmax,is_sorted_until for now

f995cbc

AlexGuteniev approved these changes Dec 10, 2025

View reviewed changes

hazzlim mentioned this pull request Dec 15, 2025

Add Neon implementation of minmax #5963

Draft

		uint64x2_t _Swapped = vextq_u64(_Cur_u, _Cur_u, 1);
		uint64x2_t _Mask_lt = vcltq_u64(_Swapped, _Cur_u);

		uint64x2_t _Swapped = vextq_u64(_Cur_u, _Cur_u, 1);
		uint64x2_t _Mask_gt = vcgtq_u64(_Swapped, _Cur_u);

Enable vectorized minmax_element using Neon on ARM64 #5949

Are you sure you want to change the base?

Enable vectorized minmax_element using Neon on ARM64 #5949

Conversation

hazzlim commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hazzlim commented Dec 8, 2025

Uh oh!

AlexGuteniev left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

hazzlim commented Dec 9, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hazzlim commented Dec 9, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlexGuteniev Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlexGuteniev left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlexGuteniev Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlexGuteniev Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hazzlim commented Dec 10, 2025

Uh oh!

AlexGuteniev commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hazzlim commented Dec 11, 2025

Uh oh!

AlexGuteniev commented Dec 11, 2025

Uh oh!

hazzlim commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

hazzlim commented Dec 8, 2025 •

edited

Loading

AlexGuteniev Dec 9, 2025 •

edited

Loading

AlexGuteniev Dec 10, 2025 •

edited

Loading

AlexGuteniev Dec 10, 2025 •

edited

Loading

AlexGuteniev commented Dec 11, 2025 •

edited

Loading