Arithmetic legalization for WaveASM backend#1078
Merged
Hardcode84 merged 60 commits intoiree-org:mainfrom Mar 18, 2026
Merged
Conversation
38660a7 to
0d70a67
Compare
0d70a67 to
22c6ee3
Compare
ftynse
reviewed
Mar 10, 2026
ftynse
reviewed
Mar 10, 2026
ftynse
reviewed
Mar 10, 2026
| // type (SGPR, VGPR, immediate) and produces any register type. Legalization | ||
| // replaces these with concrete SALU/VALU ops. | ||
| class ArithPseudoOp<string mnemonic, list<Trait> traits = []> | ||
| : WAVEASMOp<"arith." # mnemonic, !listconcat([Pure, WaveASM_ArithmeticOp], traits)>; |
Contributor
There was a problem hiding this comment.
Question: are these pure or can they set some flags in the background (which would be modeled through side effects)?
Contributor
Author
There was a problem hiding this comment.
They should be pure, but our current vcc handling is messy
ftynse
reviewed
Mar 10, 2026
ftynse
reviewed
Mar 10, 2026
ftynse
reviewed
Mar 10, 2026
Contributor
|
This looks reasonable. I had to review each commit in isolation, which will make it harder for you to address changes. Next time, you can target the PR at the branch of the base PR. Two major concerns, one carried over from the base PR:
|
baf9abc to
97223a6
Compare
ftynse
approved these changes
Mar 16, 2026
Contributor
ftynse
left a comment
There was a problem hiding this comment.
This is still vastly undertested and scarcely documented, but is an improvement over the initial state.
Add `water_waveasm_lowering_pipeline` that lowers high-level MLIR to LLVM dialect via water-opt (memref decomposition, lower-affine, int-range-optimizations, gpu/amdgpu→rocdl, vector→llvm), producing IR suitable for waveasm-translate consumption. Hook this into compile.py under `use_water_backend + backend=asm` so it takes priority over the legacy ASM flow. Relax the wave_runtime validation for this combination. Step 2 (waveasm-translate consuming LLVM dialect) is a TODO. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Tests the copy kernel through the new water_waveasm_lowering_pipeline path (use_water_backend=True, backend="asm"). Currently compile_to_mlir only — verifies the lowering to LLVM dialect succeeds. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Introduce --waveasm-translate-from-llvm pass that creates a ProgramOp from llvm.func kernels and strictly rejects any unhandled op. This is the entry point for incremental LLVM dialect support. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Handle the 12 LLVM/ROCDL ops needed for the copy kernel: - llvm.mlir.constant → v_mov_b32 - rocdl.workitem.id.x → precolored v0 - llvm.sext/zext/trunc (i32↔i64) → identity on 32-bit GPU - llvm.icmp → v_cmp_* (VCC implicit) - llvm.select → v_cndmask_b32 - llvm.mul → v_mul_lo_u32 - rocdl.make.buffer.rsrc → map to SRD from prologue - llvm.getelementptr (ptr<7>) → decompose into (SRD, voffset) - llvm.load/store (ptr<7>) → buffer_load_ushort/buffer_store_short The lit test exercises the full copy kernel path end-to-end. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Implement Step 2 of water_waveasm_lowering_pipeline: invoke waveasm-translate with --waveasm-translate-from-llvm, optimization passes, regalloc, and --emit-assembly. Step 3 assembles and links to HSACO via clang (skipped when compile_to_mlir=True). Also add use_buffer_ops=True to test_copy_water_waveasm (required for the buffer ops path). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
End-to-end pass that translates LLVM dialect gpu.module to gpu.binary: runs the full WaveASM pipeline (translation, optimization, regalloc, waitcnt, hazard mitigation), emits AMDGCN assembly, assembles and links to HSACO via ROCDL utilities, and embeds the binary inline in the IR. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
… final step - TranslateFromLLVM now inserts waveasm.program inside the gpu.module that contains the kernel, instead of at module top-level. The gpu.module is preserved (not erased). - GPUModuleToBinary is now a pure final step: expects already-optimized waveasm.program ops inside gpu.module, emits assembly, assembles + links to HSACO, and replaces each gpu.module with gpu.binary. - The full pipeline is now externally composable as individual passes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
…gram Keep the original llvm.func in gpu.module so gpu.launch_func verification passes, while placing waveasm.program alongside it with a mangled name (e.g. test__waveasm). The kernel_name attribute preserves the original name for assembly emission. GPUModuleToBinary erases everything when replacing gpu.module with gpu.binary. Also wires up the 3-step water-opt → waveasm-translate → water-opt pipeline and restricts the waveasm e2e test to CDNA4. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Map workgroup ID intrinsics to precolored system SGPRs, mirroring the existing gpu.block_id handler from the MLIR-level translation path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Materialize poison values as zero immediates since they represent undefined values with no meaningful content. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
When a GEP's base is another GEP result (not a direct buffer resource), add the offsets together using v_add_u32 and propagate the SRD. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
When GEPs operate on bare pointers (!llvm.ptr) before make.buffer.rsrc, propagate the mapper entry and accumulate the byte offset. The offset is then added to the voffset when creating buffer GEPs on the resulting ptr<7>. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Add handleAdd (v_add_u32) and extract handlePoison into a proper function matching the handleMul/handleConstant pattern. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Generalize extract/pack ops to accept any register type (VGPR, AGPR,
SGPR) instead of vector registers only. This enables SGPR pair
splitting and merging without dedicated ops.
Rewrite ArithLegalization to handle both i32 and i64 widths:
- i64 add: carry chain (s_add_u32 + s_addc_u32 / v_add_co_u32 + v_addc_co_u32)
- i64 mul: schoolbook decomposition (mul_lo, mul_hi, cross terms)
- i64 cmp: eq/ne via XOR+OR+compare-to-zero (ordered comparisons error)
- i64 select: split, select each half, merge
- trunc i64->i32: extract lo sub-register
- sext i32->i64: {lo, ashr(lo, 31)}
- zext i32->i64: {lo, 0}
The pass now reports errors on unsupported widths (anything not i32/i64).
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Two bugs in TranslateFromLLVMDialect.cpp: 1. handleSelect passed (cond, trueVal, falseVal) to Arith_SelectOp::create but ODS declares (falseVal, trueVal, condition) — swapped condition and falseVal. This caused i64 select legalization to try splitting the cmp placeholder (imm<1>) as if it were a 64-bit value, hitting the ExtractOp verifier: "operand #0 must be any register, but got '!waveasm.imm<1>'". 2. handleConstant treated all integer constants as i32, producing vreg size 1 for i64 values. Now splits i64 constants into lo/hi v_mov_b32 + pack into vreg<2,2>. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Buffer load/store instructions use 32-bit voffset operands. When a GEP index comes from an i64 value (e.g. llvm.zext), the arith legalization widens it to vreg<2> via RAUW, which then propagates into the buffer op operand. Insert an arith.trunc at translation time so the legalization pass extracts the lo half before the value reaches buffer ops. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
The register allocator does not insert copies for PackOp (it emits no assembly). When splitI64 created extract ops from a pack whose inputs were at non-contiguous physical registers, the extracted values read stale data, causing illegal memory accesses and wrong results at runtime. Fix: splitI64 now checks if the value is defined by a PackOp and returns the pack's original inputs directly, bypassing the extract/pack round-trip entirely. Filed wave-37d for the underlying regalloc issue. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Implements slt/sle/sgt/sge/ult/ule/ugt/uge for i64 operands using hi/lo decomposition: result = hiPred(hi) || (hiEq(hi) && loPred(lo)). Hi comparison uses the same signedness as the original predicate (strict variant); lo always uses unsigned. VALU path: materialize each sub-comparison via v_cmp + v_cndmask_b32, select based on hi equality, then set VCC from the final boolean. SALU path: compute hiCmp | (hiEq & loCmp) using s_and/s_or_b32. Also refactors v_cmp/s_cmp emission into emitVCmp/emitSCmp helpers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
- Carry ops (v_add_co_u32, v_addc_co_u32, etc.): emit implicit VCC as explicit operand. v_addc_co_u32/v_subb_co_u32 also append VCC as carry-in source. - v_cndmask_b32: always emit explicit VCC for VOP3 compatibility (required when literal operands force VOP3 encoding). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
i64 compares now materialize their boolean result to a VGPR instead of leaving VCC set, since intervening i64 add operations (v_add_co_u32) clobber VCC. The select legalization re-establishes VCC from the materialized boolean right before v_cndmask_b32 via ensureVCC(). Add lit test exercising the exact clobber scenario (i64 cmp → i64 add → i64 select). Made-with: Cursor Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
…implify pass - Rename Arith_XxxOp to ArithXxxOp (TableGen + all C++ references) for consistent naming with rest of codebase. - Collapse getRegWidth .Case<SRegType, VRegType, PSRegType, PVRegType> into single multi-type Case with auto lambda. - Replace getResult(0) with getDst() named accessor on carry ops. - Remove dead code in legalizeCmpI64 VGPR path (unreachable after both branches return early). - Replace pre-collect + accumulate-failure pattern with post-order walk + WalkResult::interrupt() (post-order walk is safe for erasure). - Fix legalization.md terminology: "generic ops" -> "pseudo-ops", remove "Proposed" from existing ops, update table to match implementation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
…galization
Error tests (6 new files):
- arith-legalization-error-{mul,cmp,select,trunc}.mlir: unsupported
operand width (4 dwords) produces diagnostic.
- arith-legalization-error-{sext,zext}.mlir: non-i32 source produces
diagnostic.
Operand wiring (arith-legalization.mlir):
- Use FileCheck capture variables ([[A:%.*]], [[B:%.*]], etc.) to
verify that correct SSA values flow into operations for i64 SALU/VALU
add carry chain, mul schoolbook, cmp eq/slt, and select.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Change lookupGEP to return std::optional<BufferPtrInfo> instead of raw pointer, and simplify lookupBaseOffset to a one-liner using DenseMap::lookup. Also collapse a redundant std::optional<Value> declaration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
resolve() now returns FailureOr<Value> instead of silently returning the original LLVM value (which gets erased, causing use-after-free). All call sites propagate the failure with descriptive diagnostics. Also extract handleBinaryOp and handleCastOp templates to eliminate duplication across add/mul and sext/zext/trunc handlers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Emit an error when both gpu.known_block_size and rocdl.reqd_work_group_size are present but disagree, instead of silently using whichever comes first. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Move the empty-assembly check into the per-program walk so that a program producing no output is caught immediately with a diagnostic on the specific program, rather than being masked by other programs that did produce output. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Poison previously always materialized as a single b32 zero regardless of the LLVM type width. Now i64 poison produces a packed pair of zero dwords, matching handleConstant behavior. Unsupported widths emit a diagnostic. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
The translation pass does not read data_layout, so the attribute is just noise in the test inputs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
- Assert ProgramOp body is non-empty instead of defensive check. - Fix test comment: "arith.add" -> "waveasm.arith.add". - Move local import of get_waveasm_translate to module level. - Fix print format to avoid space before ellipsis. - Replace unicode arrows with ASCII in water.py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Verifies operand wiring: icmp condition, true/false values, and constant materialization all flow through correctly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
The dedicated .Case<V_ADD_CO_U32, V_SUB_CO_U32, V_ADDC_CO_U32,
V_SUBB_CO_U32> handler already matches all _co_ ops, so the
mnemonic.contains("_co_") block in Default was unreachable.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Add ArithOrOp and ArithAndOp pseudo-ops, translate from llvm.or/llvm.and, and legalize to s_or_b32/b64, v_or_b32/b64, s_and_b32/b64, v_and_b32/b64. Unlike add, bitwise ops use native 64-bit instructions instead of lo/hi splitting. Fixes: llvm.or unhandled op error in dynamic copy kernel. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Blocked on regalloc supporting contiguous allocation constraints for PackOp inputs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
SGPR conditions (e.g. from s_cmp_*) were passed to v_cndmask_b32 without setting VCC. Now moved to VGPR via v_mov_b32 first, then v_cmp_ne_i32 sets VCC as expected. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Avoids a crash if the frontend ever produces multi-index GEPs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Duplicate move elimination is handled by the upstream CSE pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
- Use consistent terminology: pseudo-ops throughout, machine ops not machine instructions. - Mark pseudo-op table as implemented, not proposed. - Replace move caching with CSE reference. - Update open questions to resolved status. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
LLVM enables them selectively via amdgpu-no-workgroup-id-{y,z}
attributes, not unconditionally.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
The LLVM translator relies on the gfx950 preload ABI for scalar arg mapping. Bail out early with a diagnostic for unsupported targets instead of silently generating wrong code. Default target changed from gfx942 to gfx950. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
1abbf02 to
5f338f0
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Introduces a proper legalization stage between LLVM translation and optimization in the WaveASM pipeline. Instead of making register file (SGPR/VGPR) and width (i32/i64) decisions during translation, we now emit generic pseudo-ops and legalize them in a dedicated pass.
What changed
Design doc (
waveasm/docs/legalization.md)Generic arithmetic pseudo-ops (
WaveASMArithOps.td,WaveASMTypes.td)arith.add,arith.mul,arith.cmp,arith.select,arith.trunc,arith.sext,arith.zext.arith.cmpop withCmpPredicateenum (10 predicates) instead of per-predicate ops.Arithmetic legalization pass (
ArithLegalization.cpp)v_mov_b32when VALU has two SGPR operands.LLVM translation updated (
TranslateFromLLVM.cpp)handleAdd,handleMul,handleICmp,handleSelect,handleTrunc,handleSext,handleZextnow emit pseudo-ops.ensure32Bit/ensureVGPRhelpers — legalization pass handles this.Pipeline (
water.py)--waveasm-arith-legalizationafter--waveasm-translate-from-llvm.Also includes (from prior commits on this branch):
depends on #1077