⚡ Thunderbolt: Softmax — Combine FMA and asymmetric unroll#43
Conversation
- `ml_kernels/include/ml_kernels/softmax.h`: Added `exp256_ps_v3` combining `r = x - n * ln(2)` into a single `_mm256_fnmadd_ps`. Added `softmax_v6` using asymmetric unrolling (8x for max/norm, 4x for exp) to increase ILP without register spilling. - `ml_kernels/src/kernel_bench.cpp`: Registered `SoftmaxV6Benchmark`. - `ml_kernels/src/test_naive_ops.cpp`: Added correctness test `test_softmax_v6`. Co-authored-by: bugparty <1510776+bugparty@users.noreply.github.com>
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
📝 WalkthroughWalkthroughThis PR adds ChangesAVX2 Softmax v6 Implementation and Validation
Possibly related PRs
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@ml_kernels/include/ml_kernels/softmax.h`:
- Around line 398-425: The function definitions (e.g., exp256_ps_v3) put the
opening brace on the same line as the signature, violating the project's brace
style; update each new kernel function (including exp256_ps_v3 and the other
functions added in the 433-556 region) so the function opening brace is on its
own line immediately after the signature (i.e., change "inline __m256
exp256_ps_v3(__m256 x) {" to place "{" on the next line), and ensure formatting
matches existing functions in the file.
In `@ml_kernels/src/kernel_bench.cpp`:
- Around line 337-342: The function-body brace style is wrong for the new
benchmark methods; move the opening braces for both the name() and run() method
definitions onto their own lines so they follow the project's "braces on their
own lines" rule. Locate the definitions of name() and run() (the methods
returning "softmax_v6" and calling ml_kernels::softmax_v6(...), respectively)
and change them so each has the opening "{" on the next line before the method
body.
In `@ml_kernels/src/test_naive_ops.cpp`:
- Around line 184-211: The function definition for test_softmax_v6 uses the
opening brace on the same line; change it to the project brace style by placing
the opening brace on its own line (i.e., move the "{" of the test_softmax_v6
function to the next line) so the function body follows the project's "Keep
braces on their own lines for function bodies" guideline and adjust surrounding
whitespace accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 5b7465ea-56c7-48a9-a911-0844ba8ddaaa
📒 Files selected for processing (4)
.jules/thunderbolt.mdml_kernels/include/ml_kernels/softmax.hml_kernels/src/kernel_bench.cppml_kernels/src/test_naive_ops.cpp
| inline __m256 exp256_ps_v3(__m256 x) { | ||
| x = _mm256_max_ps(x, _mm256_set1_ps(-87.3f)); | ||
| __m256 x_log2e = _mm256_mul_ps(x, _mm256_set1_ps(1.4426950408889634f)); | ||
|
|
||
| __m256i n_int = _mm256_cvtps_epi32(x_log2e); | ||
| __m256 n = _mm256_cvtepi32_ps(n_int); | ||
|
|
||
| // ⚡ Thunderbolt: combine ln(2) into a single FMA for throughput | ||
| __m256 r = _mm256_fnmadd_ps(n, _mm256_set1_ps(0.6931471805599453f), x); | ||
|
|
||
| __m256 c1 = _mm256_set1_ps(1.0f); | ||
| __m256 c2 = _mm256_set1_ps(1.0f / 2.0f); | ||
| __m256 c3 = _mm256_set1_ps(1.0f / 6.0f); | ||
| __m256 c4 = _mm256_set1_ps(1.0f / 24.0f); | ||
| __m256 c5 = _mm256_set1_ps(1.0f / 120.0f); | ||
|
|
||
| __m256 p = _mm256_fmadd_ps(c5, r, c4); | ||
| p = _mm256_fmadd_ps(p, r, c3); | ||
| p = _mm256_fmadd_ps(p, r, c2); | ||
| p = _mm256_fmadd_ps(p, r, c1); | ||
| p = _mm256_fmadd_ps(p, r, c1); | ||
|
|
||
| __m256i exp_shift = _mm256_add_epi32(n_int, _mm256_set1_epi32(127)); | ||
| __m256i exp_shifted = _mm256_slli_epi32(exp_shift, 23); | ||
| __m256 exp2n = _mm256_castsi256_ps(exp_shifted); | ||
|
|
||
| return _mm256_mul_ps(p, exp2n); | ||
| } |
There was a problem hiding this comment.
Move function opening braces to their own lines in new kernels.
Both newly added function definitions place { on the same line as the signature; this violates the project’s C/C++ function-body brace style.
As per coding guidelines, "Keep braces on their own lines for function bodies".
Also applies to: 433-556
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@ml_kernels/include/ml_kernels/softmax.h` around lines 398 - 425, The function
definitions (e.g., exp256_ps_v3) put the opening brace on the same line as the
signature, violating the project's brace style; update each new kernel function
(including exp256_ps_v3 and the other functions added in the 433-556 region) so
the function opening brace is on its own line immediately after the signature
(i.e., change "inline __m256 exp256_ps_v3(__m256 x) {" to place "{" on the next
line), and ensure formatting matches existing functions in the file.
| const char *name() const override { return "softmax_v6"; } | ||
|
|
||
| void run() override { | ||
| ml_kernels::softmax_v6(inputs_[current_idx_].data(), outputs_[current_idx_].data(), inputs_[0].size()); | ||
| current_idx_ = (current_idx_ + 1) % pool_size_; | ||
| } |
There was a problem hiding this comment.
Apply function-body brace style in new benchmark methods.
The added name() and run() definitions should place opening braces on their own lines.
As per coding guidelines, "Keep braces on their own lines for function bodies".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@ml_kernels/src/kernel_bench.cpp` around lines 337 - 342, The function-body
brace style is wrong for the new benchmark methods; move the opening braces for
both the name() and run() method definitions onto their own lines so they follow
the project's "braces on their own lines" rule. Locate the definitions of name()
and run() (the methods returning "softmax_v6" and calling
ml_kernels::softmax_v6(...), respectively) and change them so each has the
opening "{" on the next line before the method body.
| void test_softmax_v6() { | ||
| std::cout << "Running test_softmax_v6..." << std::endl; | ||
| std::vector<float> input = { | ||
| -2.0f, -0.5f, 1.0f, 3.0f, | ||
| 0.0f, 0.0f, 0.0f, 0.0f, | ||
| 100.0f, 100.0f, -100.0f, -100.0f, | ||
| 5.0f, -5.0f, 2.0f, -2.0f, | ||
| 1.1f, 1.2f, 1.3f, 1.4f, | ||
| -1.1f, -1.2f, -1.3f, -1.4f, | ||
| 10.0f, 20.0f, 30.0f, 40.0f, | ||
| -10.0f, -20.0f, -30.0f, -40.0f | ||
| }; | ||
|
|
||
| std::vector<float> output_naive(input.size(), 0.0f); | ||
| std::vector<float> output_v6(input.size(), 0.0f); | ||
|
|
||
| ml_kernels::softmax_naive(input.data(), output_naive.data(), input.size()); | ||
| ml_kernels::softmax_v6(input.data(), output_v6.data(), input.size()); | ||
|
|
||
| float sum = 0.0f; | ||
| for (std::size_t i = 0; i < input.size(); ++i) { | ||
| assert(std::fabs(output_naive[i] - output_v6[i]) < 1e-4f); | ||
| sum += output_v6[i]; | ||
| } | ||
| assert(std::fabs(sum - 1.0f) < 1e-4f); | ||
|
|
||
| std::cout << "test_softmax_v6 passed!" << std::endl; | ||
| } |
There was a problem hiding this comment.
Use project brace style for the new test function definition.
test_softmax_v6 should place the function opening brace on its own line.
As per coding guidelines, "Keep braces on their own lines for function bodies".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@ml_kernels/src/test_naive_ops.cpp` around lines 184 - 211, The function
definition for test_softmax_v6 uses the opening brace on the same line; change
it to the project brace style by placing the opening brace on its own line
(i.e., move the "{" of the test_softmax_v6 function to the next line) so the
function body follows the project's "Keep braces on their own lines for function
bodies" guideline and adjust surrounding whitespace accordingly.
💡 What:
Implemented
softmax_v6which uses asymmetric unrolling (8x for max reduction and normalization, 4x for exponentiation) and combined FMA range reduction constants (exp256_ps_v3).🎯 Why:
The inner exponential step is register heavy (YMM spills if unrolled 8x). But the max reduction and normalization are compute-light and leaving them at 4x underutilizes execution ports. Combining
ln(2)into a single FMA removes latency off the critical path forexp256.🏗️ How:
exp256_ps_v3uses_mm256_fnmadd_psforx - n * ln(2).softmax_v6unrolls loops independently based on their register pressure and latency profiles.📊 Impact:
softmax_v5: 4.12 GFLOPS at N=1048576 (Fixed Memory)softmax_v6: 4.27 GFLOPS at N=1048576 (Fixed Memory)🖥️ Tested on:
Intel(R) Xeon(R) CPU @ 2.30GHz (Haswell+ target), Ubuntu, GCC 13.
🔬 How to reproduce:
PR created automatically by Jules for task 825981312052491380 started by @bugparty
Summary by CodeRabbit
Release Notes
New Features
Documentation
Tests