Skip to content

Conversation

@hazzlim
Copy link
Contributor

@hazzlim hazzlim commented Dec 8, 2025

Implement the namespace _Sorting algorithms using Neon, and enable _VECTORIZED_MINMAX_ELEMENT on ARM64 targets.

Implement the namespace _Sorting algorithms using Neon, and enable
_VECTORIZED_MINMAX on ARM64 targets.
@hazzlim hazzlim requested a review from a team as a code owner December 8, 2025 17:36
@github-project-automation github-project-automation bot moved this to Initial Review in STL Code Reviews Dec 8, 2025
@hazzlim
Copy link
Contributor Author

hazzlim commented Dec 8, 2025

I have only enabled _VECTORIZED_MINMAX_ELEMENT in the first instance, as it seemed to make some sense to enable the other _Sorting algorithms in separate PRs.

This PR does not vectorize (u)int64_t on ARM64 as this was not faster than the scalar code.

The benchmark results are below:

Name MSVC Speedup Clang Speedup
bm<uint8_t, Op::Min>/8021 24.735 9.268
bm<uint8_t, Op::Min>/63 5.182 2.995
bm<uint8_t, Op::Max>/8021 24.695 9.561
bm<uint8_t, Op::Max>/63 4.896 2.976
bm<uint8_t, Op::Both>/8021 19.184 7.811
bm<uint8_t, Op::Both>/63 1.977 1.841
bm<uint16_t, Op::Min>/8021 12.053 4.524
bm<uint16_t, Op::Min>/31 3.052 2.089
bm<uint16_t, Op::Max>/8021 11.808 4.756
bm<uint16_t, Op::Max>/31 2.933 2.047
bm<uint16_t, Op::Both>/8021 5.426 4.052
bm<uint16_t, Op::Both>/31 1.413 1.521
bm<uint32_t, Op::Min>/8021 6.133 1.908
bm<uint32_t, Op::Min>/15 1.544 1.094
bm<uint32_t, Op::Max>/8021 6.074 1.92
bm<uint32_t, Op::Max>/15 1.53 1.132
bm<uint32_t, Op::Both>/8021 3.146 2.877
bm<uint32_t, Op::Both>/15 0.869 1.195
bm<int8_t, Op::Min>/8021 24.735 9.211
bm<int8_t, Op::Min>/63 5.222 2.778
bm<int8_t, Op::Max>/8021 25.244 9.286
bm<int8_t, Op::Max>/63 5.417 2.889
bm<int8_t, Op::Both>/8021 11.538 11.25
bm<int8_t, Op::Both>/63 1.989 1.76
bm<int16_t, Op::Min>/8021 11.953 4.667
bm<int16_t, Op::Min>/31 3.029 1.872
bm<int16_t, Op::Max>/8021 11.808 4.571
bm<int16_t, Op::Max>/31 3.123 1.882
bm<int16_t, Op::Both>/8021 6.582 5.729
bm<int16_t, Op::Both>/31 1.414 1.541
bm<int32_t, Op::Min>/8021 6.25 1.88
bm<int32_t, Op::Min>/15 1.6 1.135
bm<int32_t, Op::Max>/8021 6.133 1.867
bm<int32_t, Op::Max>/15 1.674 1.094
bm<int32_t, Op::Both>/8021 3.222 1.784
bm<int32_t, Op::Both>/15 0.877 0.903
bm<float, Op::Min>/8021 8.928 4.364
bm<float, Op::Min>/15 1.87 1.358
bm<float, Op::Max>/8021 9.111 4.267
bm<float, Op::Max>/15 2.062 1.371
bm<float, Op::Both>/8021 5.227 1.626
bm<float, Op::Both>/15 0.913 0.7
bm<double, Op::Min>/8021 4.426 2.029
bm<double, Op::Min>/7 0.929 0.731
bm<double, Op::Max>/8021 4.563 2.133
bm<double, Op::Max>/7 0.977 0.725
bm<double, Op::Both>/8021 2.583 0.786
bm<double, Op::Both>/7 0.445 0.402

AlexGuteniev

This comment was marked as resolved.

@StephanTLavavej StephanTLavavej added performance Must go faster ARM64 Related to the ARM64 architecture labels Dec 8, 2025
@github-project-automation github-project-automation bot moved this from Initial Review to Work In Progress in STL Code Reviews Dec 8, 2025
@hazzlim

This comment was marked as resolved.

@hazzlim

This comment was marked as resolved.

@StephanTLavavej StephanTLavavej moved this from Work In Progress to Initial Review in STL Code Reviews Dec 9, 2025
@StephanTLavavej StephanTLavavej self-assigned this Dec 10, 2025
AlexGuteniev

This comment was marked as resolved.

@hazzlim

This comment was marked as resolved.

@AlexGuteniev

This comment was marked as resolved.

@hazzlim

This comment was marked as resolved.

@AlexGuteniev

This comment was marked as resolved.

@hazzlim

This comment was marked as resolved.

@StephanTLavavej StephanTLavavej removed their assignment Jan 6, 2026
@StephanTLavavej StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews Jan 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ARM64 Related to the ARM64 architecture performance Must go faster

Projects

Status: Ready To Merge

Development

Successfully merging this pull request may close these issues.

5 participants