Skip to content

⚡ Thunderbolt: Softmax — Combine FMA and asymmetric unroll#43

Open
bugparty wants to merge 1 commit into
mainfrom
thunderbolt-softmax-fma-825981312052491380
Open

⚡ Thunderbolt: Softmax — Combine FMA and asymmetric unroll#43
bugparty wants to merge 1 commit into
mainfrom
thunderbolt-softmax-fma-825981312052491380

Conversation

@bugparty
Copy link
Copy Markdown
Owner

@bugparty bugparty commented May 26, 2026

💡 What:
Implemented softmax_v6 which uses asymmetric unrolling (8x for max reduction and normalization, 4x for exponentiation) and combined FMA range reduction constants (exp256_ps_v3).

🎯 Why:
The inner exponential step is register heavy (YMM spills if unrolled 8x). But the max reduction and normalization are compute-light and leaving them at 4x underutilizes execution ports. Combining ln(2) into a single FMA removes latency off the critical path for exp256.

🏗️ How:

  • exp256_ps_v3 uses _mm256_fnmadd_ps for x - n * ln(2).
  • softmax_v6 unrolls loops independently based on their register pressure and latency profiles.

📊 Impact:

  • softmax_v5: 4.12 GFLOPS at N=1048576 (Fixed Memory)
  • softmax_v6: 4.27 GFLOPS at N=1048576 (Fixed Memory)
  • Correctness maintained within 1e-4 tolerance vs reference.

🖥️ Tested on:
Intel(R) Xeon(R) CPU @ 2.30GHz (Haswell+ target), Ubuntu, GCC 13.

🔬 How to reproduce:

make ml_kernel_test ml_kernel_bench
./build/ml_kernels/ml_kernel_test
DISABLE_CPU_BINDING=1 ./build/ml_kernels/ml_kernel_bench --sizes 1048576 --filter softmax_v6

PR created automatically by Jules for task 825981312052491380 started by @bugparty

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced softmax_v6 implementation featuring asymmetric loop unrolling across computation phases and optimized FMA-based range reduction.
  • Documentation

    • Added AVX2 softmax optimization guidance including unrolling strategies, register pressure analysis, and benchmark comparisons.
  • Tests

    • Added validation tests and performance benchmarks for the new softmax variant.

Review Change Stack

- `ml_kernels/include/ml_kernels/softmax.h`: Added `exp256_ps_v3` combining `r = x - n * ln(2)` into a single `_mm256_fnmadd_ps`. Added `softmax_v6` using asymmetric unrolling (8x for max/norm, 4x for exp) to increase ILP without register spilling.
- `ml_kernels/src/kernel_bench.cpp`: Registered `SoftmaxV6Benchmark`.
- `ml_kernels/src/test_naive_ops.cpp`: Added correctness test `test_softmax_v6`.

Co-authored-by: bugparty <1510776+bugparty@users.noreply.github.com>
@google-labs-jules
Copy link
Copy Markdown
Contributor

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.


For security, I will only act on instructions from the user who triggered this task.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 26, 2026

📝 Walkthrough

Walkthrough

This PR adds softmax_v6, an optimized AVX2 softmax implementation using asymmetric loop unrolling (8x for max reduction and normalization, 4x for exponentiation) with a new exp256_ps_v3 helper using combined FMA for range reduction. The change includes correctness tests, performance benchmarks, and tuning documentation.

Changes

AVX2 Softmax v6 Implementation and Validation

Layer / File(s) Summary
AVX2 softmax implementation and tuning strategy
ml_kernels/include/ml_kernels/softmax.h, .jules/thunderbolt.md
exp256_ps_v3 optimizes AVX2 exponentiation via integer rounding and single combined ln(2) FMA. softmax_v6 computes softmax using three asymmetrically unrolled SIMD phases: 8x for max reduction, 4x for exp+sum, 8x for normalization, with scalar tail handling and early zero-sum return. Documentation records the unrolling strategy and register-pressure tradeoffs.
Correctness test for softmax_v6
ml_kernels/src/test_naive_ops.cpp
test_softmax_v6() validates softmax_v6 against naive reference on fixed inputs, checking per-element tolerance (1e-4) and output sum constraint. Test is integrated into main suite.
Performance benchmark for softmax_v6
ml_kernels/src/kernel_bench.cpp
SoftmaxV6Benchmark class extends existing benchmark harness to measure softmax_v6 throughput alongside other variants using shared pooled buffers and infrastructure.

Possibly related PRs

  • bugparty/cpu_math_kernels_pri#31: The main PR's new AVX2 exp/softmax implementation (exp256_ps_v3 + softmax_v6) and its matching benchmark/test are direct continuations of the retrieved PR's exp256_ps_v2 + softmax_v5 work in softmax.h, kernel_bench.cpp, and test_naive_ops.cpp.
  • bugparty/cpu_math_kernels_pri#7: Both PRs add/extend test coverage around ml_kernels::softmax_naive correctness (the main PR's test_softmax_v6() compares softmax_v6 vs softmax_naive, while the retrieved PR introduces standalone softmax_naive tests), so they're connected through shared softmax_naive test logic/use.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A softmax spins with vectors wide,
Eight rolls for max, then four inside,
One combine fuse, constants fold,
SIMD loops unroll their story bold!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 28.57% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly captures the main changes: combining FMA (fused multiply-add) for exp256_ps_v3 and asymmetric unrolling for softmax_v6.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch thunderbolt-softmax-fma-825981312052491380

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ml_kernels/include/ml_kernels/softmax.h`:
- Around line 398-425: The function definitions (e.g., exp256_ps_v3) put the
opening brace on the same line as the signature, violating the project's brace
style; update each new kernel function (including exp256_ps_v3 and the other
functions added in the 433-556 region) so the function opening brace is on its
own line immediately after the signature (i.e., change "inline __m256
exp256_ps_v3(__m256 x) {" to place "{" on the next line), and ensure formatting
matches existing functions in the file.

In `@ml_kernels/src/kernel_bench.cpp`:
- Around line 337-342: The function-body brace style is wrong for the new
benchmark methods; move the opening braces for both the name() and run() method
definitions onto their own lines so they follow the project's "braces on their
own lines" rule. Locate the definitions of name() and run() (the methods
returning "softmax_v6" and calling ml_kernels::softmax_v6(...), respectively)
and change them so each has the opening "{" on the next line before the method
body.

In `@ml_kernels/src/test_naive_ops.cpp`:
- Around line 184-211: The function definition for test_softmax_v6 uses the
opening brace on the same line; change it to the project brace style by placing
the opening brace on its own line (i.e., move the "{" of the test_softmax_v6
function to the next line) so the function body follows the project's "Keep
braces on their own lines for function bodies" guideline and adjust surrounding
whitespace accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5b7465ea-56c7-48a9-a911-0844ba8ddaaa

📥 Commits

Reviewing files that changed from the base of the PR and between acca01e and ec25de9.

📒 Files selected for processing (4)
  • .jules/thunderbolt.md
  • ml_kernels/include/ml_kernels/softmax.h
  • ml_kernels/src/kernel_bench.cpp
  • ml_kernels/src/test_naive_ops.cpp

Comment on lines +398 to +425
inline __m256 exp256_ps_v3(__m256 x) {
x = _mm256_max_ps(x, _mm256_set1_ps(-87.3f));
__m256 x_log2e = _mm256_mul_ps(x, _mm256_set1_ps(1.4426950408889634f));

__m256i n_int = _mm256_cvtps_epi32(x_log2e);
__m256 n = _mm256_cvtepi32_ps(n_int);

// ⚡ Thunderbolt: combine ln(2) into a single FMA for throughput
__m256 r = _mm256_fnmadd_ps(n, _mm256_set1_ps(0.6931471805599453f), x);

__m256 c1 = _mm256_set1_ps(1.0f);
__m256 c2 = _mm256_set1_ps(1.0f / 2.0f);
__m256 c3 = _mm256_set1_ps(1.0f / 6.0f);
__m256 c4 = _mm256_set1_ps(1.0f / 24.0f);
__m256 c5 = _mm256_set1_ps(1.0f / 120.0f);

__m256 p = _mm256_fmadd_ps(c5, r, c4);
p = _mm256_fmadd_ps(p, r, c3);
p = _mm256_fmadd_ps(p, r, c2);
p = _mm256_fmadd_ps(p, r, c1);
p = _mm256_fmadd_ps(p, r, c1);

__m256i exp_shift = _mm256_add_epi32(n_int, _mm256_set1_epi32(127));
__m256i exp_shifted = _mm256_slli_epi32(exp_shift, 23);
__m256 exp2n = _mm256_castsi256_ps(exp_shifted);

return _mm256_mul_ps(p, exp2n);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Move function opening braces to their own lines in new kernels.

Both newly added function definitions place { on the same line as the signature; this violates the project’s C/C++ function-body brace style.

As per coding guidelines, "Keep braces on their own lines for function bodies".

Also applies to: 433-556

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ml_kernels/include/ml_kernels/softmax.h` around lines 398 - 425, The function
definitions (e.g., exp256_ps_v3) put the opening brace on the same line as the
signature, violating the project's brace style; update each new kernel function
(including exp256_ps_v3 and the other functions added in the 433-556 region) so
the function opening brace is on its own line immediately after the signature
(i.e., change "inline __m256 exp256_ps_v3(__m256 x) {" to place "{" on the next
line), and ensure formatting matches existing functions in the file.

Comment on lines +337 to +342
const char *name() const override { return "softmax_v6"; }

void run() override {
ml_kernels::softmax_v6(inputs_[current_idx_].data(), outputs_[current_idx_].data(), inputs_[0].size());
current_idx_ = (current_idx_ + 1) % pool_size_;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Apply function-body brace style in new benchmark methods.

The added name() and run() definitions should place opening braces on their own lines.

As per coding guidelines, "Keep braces on their own lines for function bodies".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ml_kernels/src/kernel_bench.cpp` around lines 337 - 342, The function-body
brace style is wrong for the new benchmark methods; move the opening braces for
both the name() and run() method definitions onto their own lines so they follow
the project's "braces on their own lines" rule. Locate the definitions of name()
and run() (the methods returning "softmax_v6" and calling
ml_kernels::softmax_v6(...), respectively) and change them so each has the
opening "{" on the next line before the method body.

Comment on lines +184 to +211
void test_softmax_v6() {
std::cout << "Running test_softmax_v6..." << std::endl;
std::vector<float> input = {
-2.0f, -0.5f, 1.0f, 3.0f,
0.0f, 0.0f, 0.0f, 0.0f,
100.0f, 100.0f, -100.0f, -100.0f,
5.0f, -5.0f, 2.0f, -2.0f,
1.1f, 1.2f, 1.3f, 1.4f,
-1.1f, -1.2f, -1.3f, -1.4f,
10.0f, 20.0f, 30.0f, 40.0f,
-10.0f, -20.0f, -30.0f, -40.0f
};

std::vector<float> output_naive(input.size(), 0.0f);
std::vector<float> output_v6(input.size(), 0.0f);

ml_kernels::softmax_naive(input.data(), output_naive.data(), input.size());
ml_kernels::softmax_v6(input.data(), output_v6.data(), input.size());

float sum = 0.0f;
for (std::size_t i = 0; i < input.size(); ++i) {
assert(std::fabs(output_naive[i] - output_v6[i]) < 1e-4f);
sum += output_v6[i];
}
assert(std::fabs(sum - 1.0f) < 1e-4f);

std::cout << "test_softmax_v6 passed!" << std::endl;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use project brace style for the new test function definition.

test_softmax_v6 should place the function opening brace on its own line.

As per coding guidelines, "Keep braces on their own lines for function bodies".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ml_kernels/src/test_naive_ops.cpp` around lines 184 - 211, The function
definition for test_softmax_v6 uses the opening brace on the same line; change
it to the project brace style by placing the opening brace on its own line
(i.e., move the "{" of the test_softmax_v6 function to the next line) so the
function body follows the project's "Keep braces on their own lines for function
bodies" guideline and adjust surrounding whitespace accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant