⚡ Thunderbolt: Softmax — Combine FMA and asymmetric unroll by bugparty · Pull Request #43 · bugparty/cpu_math_kernels_pri

bugparty · 2026-05-26T19:48:14Z

💡 What:
Implemented softmax_v6 which uses asymmetric unrolling (8x for max reduction and normalization, 4x for exponentiation) and combined FMA range reduction constants (exp256_ps_v3).

🎯 Why:
The inner exponential step is register heavy (YMM spills if unrolled 8x). But the max reduction and normalization are compute-light and leaving them at 4x underutilizes execution ports. Combining ln(2) into a single FMA removes latency off the critical path for exp256.

🏗️ How:

exp256_ps_v3 uses _mm256_fnmadd_ps for x - n * ln(2).
softmax_v6 unrolls loops independently based on their register pressure and latency profiles.

📊 Impact:

softmax_v5: 4.12 GFLOPS at N=1048576 (Fixed Memory)
softmax_v6: 4.27 GFLOPS at N=1048576 (Fixed Memory)
Correctness maintained within 1e-4 tolerance vs reference.

🖥️ Tested on:
Intel(R) Xeon(R) CPU @ 2.30GHz (Haswell+ target), Ubuntu, GCC 13.

🔬 How to reproduce:

make ml_kernel_test ml_kernel_bench
./build/ml_kernels/ml_kernel_test
DISABLE_CPU_BINDING=1 ./build/ml_kernels/ml_kernel_bench --sizes 1048576 --filter softmax_v6

PR created automatically by Jules for task 825981312052491380 started by @bugparty

Summary by CodeRabbit

Release Notes

New Features
- Introduced softmax_v6 implementation featuring asymmetric loop unrolling across computation phases and optimized FMA-based range reduction.
Documentation
- Added AVX2 softmax optimization guidance including unrolling strategies, register pressure analysis, and benchmark comparisons.
Tests
- Added validation tests and performance benchmarks for the new softmax variant.

- `ml_kernels/include/ml_kernels/softmax.h`: Added `exp256_ps_v3` combining `r = x - n * ln(2)` into a single `_mm256_fnmadd_ps`. Added `softmax_v6` using asymmetric unrolling (8x for max/norm, 4x for exp) to increase ILP without register spilling. - `ml_kernels/src/kernel_bench.cpp`: Registered `SoftmaxV6Benchmark`. - `ml_kernels/src/test_naive_ops.cpp`: Added correctness test `test_softmax_v6`. Co-authored-by: bugparty <1510776+bugparty@users.noreply.github.com>

google-labs-jules · 2026-05-26T19:48:16Z

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.

For security, I will only act on instructions from the user who triggered this task.

coderabbitai · 2026-05-26T19:48:35Z

📝 Walkthrough

Walkthrough

This PR adds softmax_v6, an optimized AVX2 softmax implementation using asymmetric loop unrolling (8x for max reduction and normalization, 4x for exponentiation) with a new exp256_ps_v3 helper using combined FMA for range reduction. The change includes correctness tests, performance benchmarks, and tuning documentation.

Changes

AVX2 Softmax v6 Implementation and Validation

Layer / File(s)	Summary
AVX2 softmax implementation and tuning strategy `ml_kernels/include/ml_kernels/softmax.h`, `.jules/thunderbolt.md`	`exp256_ps_v3` optimizes AVX2 exponentiation via integer rounding and single combined ln(2) FMA. `softmax_v6` computes softmax using three asymmetrically unrolled SIMD phases: 8x for max reduction, 4x for exp+sum, 8x for normalization, with scalar tail handling and early zero-sum return. Documentation records the unrolling strategy and register-pressure tradeoffs.
Correctness test for softmax_v6 `ml_kernels/src/test_naive_ops.cpp`	`test_softmax_v6()` validates `softmax_v6` against naive reference on fixed inputs, checking per-element tolerance (1e-4) and output sum constraint. Test is integrated into main suite.
Performance benchmark for softmax_v6 `ml_kernels/src/kernel_bench.cpp`	`SoftmaxV6Benchmark` class extends existing benchmark harness to measure `softmax_v6` throughput alongside other variants using shared pooled buffers and infrastructure.

Possibly related PRs

bugparty/cpu_math_kernels_pri#31: The main PR's new AVX2 exp/softmax implementation (exp256_ps_v3 + softmax_v6) and its matching benchmark/test are direct continuations of the retrieved PR's exp256_ps_v2 + softmax_v5 work in softmax.h, kernel_bench.cpp, and test_naive_ops.cpp.
bugparty/cpu_math_kernels_pri#7: Both PRs add/extend test coverage around ml_kernels::softmax_naive correctness (the main PR's test_softmax_v6() compares softmax_v6 vs softmax_naive, while the retrieved PR introduces standalone softmax_naive tests), so they're connected through shared softmax_naive test logic/use.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A softmax spins with vectors wide,
Eight rolls for max, then four inside,
One combine fuse, constants fold,
SIMD loops unroll their story bold! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 28.57% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly captures the main changes: combining FMA (fused multiply-add) for exp256_ps_v3 and asymmetric unrolling for softmax_v6.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch thunderbolt-softmax-fma-825981312052491380

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ml_kernels/include/ml_kernels/softmax.h`:
- Around line 398-425: The function definitions (e.g., exp256_ps_v3) put the
opening brace on the same line as the signature, violating the project's brace
style; update each new kernel function (including exp256_ps_v3 and the other
functions added in the 433-556 region) so the function opening brace is on its
own line immediately after the signature (i.e., change "inline __m256
exp256_ps_v3(__m256 x) {" to place "{" on the next line), and ensure formatting
matches existing functions in the file.

In `@ml_kernels/src/kernel_bench.cpp`:
- Around line 337-342: The function-body brace style is wrong for the new
benchmark methods; move the opening braces for both the name() and run() method
definitions onto their own lines so they follow the project's "braces on their
own lines" rule. Locate the definitions of name() and run() (the methods
returning "softmax_v6" and calling ml_kernels::softmax_v6(...), respectively)
and change them so each has the opening "{" on the next line before the method
body.

In `@ml_kernels/src/test_naive_ops.cpp`:
- Around line 184-211: The function definition for test_softmax_v6 uses the
opening brace on the same line; change it to the project brace style by placing
the opening brace on its own line (i.e., move the "{" of the test_softmax_v6
function to the next line) so the function body follows the project's "Keep
braces on their own lines for function bodies" guideline and adjust surrounding
whitespace accordingly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5b7465ea-56c7-48a9-a911-0844ba8ddaaa

📥 Commits

Reviewing files that changed from the base of the PR and between acca01e and ec25de9.

📒 Files selected for processing (4)

.jules/thunderbolt.md
ml_kernels/include/ml_kernels/softmax.h
ml_kernels/src/kernel_bench.cpp
ml_kernels/src/test_naive_ops.cpp

coderabbitai · 2026-05-26T19:59:52Z

+inline __m256 exp256_ps_v3(__m256 x) {
+    x = _mm256_max_ps(x, _mm256_set1_ps(-87.3f));
+    __m256 x_log2e = _mm256_mul_ps(x, _mm256_set1_ps(1.4426950408889634f));
+
+    __m256i n_int = _mm256_cvtps_epi32(x_log2e);
+    __m256 n = _mm256_cvtepi32_ps(n_int);
+
+    // ⚡ Thunderbolt: combine ln(2) into a single FMA for throughput
+    __m256 r = _mm256_fnmadd_ps(n, _mm256_set1_ps(0.6931471805599453f), x);
+
+    __m256 c1 = _mm256_set1_ps(1.0f);
+    __m256 c2 = _mm256_set1_ps(1.0f / 2.0f);
+    __m256 c3 = _mm256_set1_ps(1.0f / 6.0f);
+    __m256 c4 = _mm256_set1_ps(1.0f / 24.0f);
+    __m256 c5 = _mm256_set1_ps(1.0f / 120.0f);
+
+    __m256 p = _mm256_fmadd_ps(c5, r, c4);
+    p = _mm256_fmadd_ps(p, r, c3);
+    p = _mm256_fmadd_ps(p, r, c2);
+    p = _mm256_fmadd_ps(p, r, c1);
+    p = _mm256_fmadd_ps(p, r, c1);
+
+    __m256i exp_shift = _mm256_add_epi32(n_int, _mm256_set1_epi32(127));
+    __m256i exp_shifted = _mm256_slli_epi32(exp_shift, 23);
+    __m256 exp2n = _mm256_castsi256_ps(exp_shifted);
+
+    return _mm256_mul_ps(p, exp2n);
+}


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Move function opening braces to their own lines in new kernels.

Both newly added function definitions place { on the same line as the signature; this violates the project’s C/C++ function-body brace style.

As per coding guidelines, "Keep braces on their own lines for function bodies".

Also applies to: 433-556

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ml_kernels/include/ml_kernels/softmax.h` around lines 398 - 425, The function definitions (e.g., exp256_ps_v3) put the opening brace on the same line as the signature, violating the project's brace style; update each new kernel function (including exp256_ps_v3 and the other functions added in the 433-556 region) so the function opening brace is on its own line immediately after the signature (i.e., change "inline __m256 exp256_ps_v3(__m256 x) {" to place "{" on the next line), and ensure formatting matches existing functions in the file.

coderabbitai · 2026-05-26T19:59:52Z

+    const char *name() const override { return "softmax_v6"; }
+
+    void run() override {
+        ml_kernels::softmax_v6(inputs_[current_idx_].data(), outputs_[current_idx_].data(), inputs_[0].size());
+        current_idx_ = (current_idx_ + 1) % pool_size_;
+    }


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Apply function-body brace style in new benchmark methods.

The added name() and run() definitions should place opening braces on their own lines.

As per coding guidelines, "Keep braces on their own lines for function bodies".

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ml_kernels/src/kernel_bench.cpp` around lines 337 - 342, The function-body brace style is wrong for the new benchmark methods; move the opening braces for both the name() and run() method definitions onto their own lines so they follow the project's "braces on their own lines" rule. Locate the definitions of name() and run() (the methods returning "softmax_v6" and calling ml_kernels::softmax_v6(...), respectively) and change them so each has the opening "{" on the next line before the method body.

coderabbitai · 2026-05-26T19:59:52Z

+void test_softmax_v6() {
+    std::cout << "Running test_softmax_v6..." << std::endl;
+    std::vector<float> input = {
+        -2.0f, -0.5f, 1.0f, 3.0f,
+        0.0f, 0.0f, 0.0f, 0.0f,
+        100.0f, 100.0f, -100.0f, -100.0f,
+        5.0f, -5.0f, 2.0f, -2.0f,
+        1.1f, 1.2f, 1.3f, 1.4f,
+        -1.1f, -1.2f, -1.3f, -1.4f,
+        10.0f, 20.0f, 30.0f, 40.0f,
+        -10.0f, -20.0f, -30.0f, -40.0f
+    };
+
+    std::vector<float> output_naive(input.size(), 0.0f);
+    std::vector<float> output_v6(input.size(), 0.0f);
+
+    ml_kernels::softmax_naive(input.data(), output_naive.data(), input.size());
+    ml_kernels::softmax_v6(input.data(), output_v6.data(), input.size());
+
+    float sum = 0.0f;
+    for (std::size_t i = 0; i < input.size(); ++i) {
+        assert(std::fabs(output_naive[i] - output_v6[i]) < 1e-4f);
+        sum += output_v6[i];
+    }
+    assert(std::fabs(sum - 1.0f) < 1e-4f);
+
+    std::cout << "test_softmax_v6 passed!" << std::endl;
+}


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use project brace style for the new test function definition.

test_softmax_v6 should place the function opening brace on its own line.

As per coding guidelines, "Keep braces on their own lines for function bodies".

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ml_kernels/src/test_naive_ops.cpp` around lines 184 - 211, The function definition for test_softmax_v6 uses the opening brace on the same line; change it to the project brace style by placing the opening brace on its own line (i.e., move the "{" of the test_softmax_v6 function to the next line) so the function body follows the project's "Keep braces on their own lines for function bodies" guideline and adjust surrounding whitespace accordingly.

coderabbitai Bot reviewed May 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚡ Thunderbolt: Softmax — Combine FMA and asymmetric unroll#43

⚡ Thunderbolt: Softmax — Combine FMA and asymmetric unroll#43
bugparty wants to merge 1 commit into
mainfrom
thunderbolt-softmax-fma-825981312052491380

bugparty commented May 26, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

google-labs-jules Bot commented May 26, 2026

Uh oh!

coderabbitai Bot commented May 26, 2026 •

edited

Loading

Walkthrough

Changes

Possibly related PRs

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 26, 2026

Uh oh!

coderabbitai Bot May 26, 2026

Uh oh!

coderabbitai Bot May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bugparty commented May 26, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

google-labs-jules Bot commented May 26, 2026

Uh oh!

coderabbitai Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Possibly related PRs

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bugparty commented May 26, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 26, 2026 •

edited

Loading