Arithmetic legalization for WaveASM backend by Hardcode84 · Pull Request #1078 · iree-org/wave

Hardcode84 · 2026-03-08T10:20:30Z

Introduces a proper legalization stage between LLVM translation and optimization in the WaveASM pipeline. Instead of making register file (SGPR/VGPR) and width (i32/i64) decisions during translation, we now emit generic pseudo-ops and legalize them in a dedicated pass.

What changed

Design doc (waveasm/docs/legalization.md)

Documents the three-phase legalization strategy (type → register → instruction selection).
Compares our approach with the real LLVM AMDGPU backend.

Generic arithmetic pseudo-ops (WaveASMArithOps.td, WaveASMTypes.td)

7 new ops: arith.add, arith.mul, arith.cmp, arith.select, arith.trunc, arith.sext, arith.zext.
Single arith.cmp op with CmpPredicate enum (10 predicates) instead of per-predicate ops.
Register-file-agnostic: accept any operand type, produce any register type.

Arithmetic legalization pass (ArithLegalization.cpp)

Demand-driven divergence: VGPR operand → VALU path; all SGPR/imm → SALU path.
Constant bus enforcement: inserts v_mov_b32 when VALU has two SGPR operands.
i64 narrowing: wide precolored SGPRs narrowed to low sub-register.
Walk-based (not greedy rewriter) to avoid DCE of pure ops with no uses.

LLVM translation updated (TranslateFromLLVM.cpp)

handleAdd, handleMul, handleICmp, handleSelect, handleTrunc, handleSext, handleZext now emit pseudo-ops.
Removed ensure32Bit/ensureVGPR helpers — legalization pass handles this.

Pipeline (water.py)

Added --waveasm-arith-legalization after --waveasm-translate-from-llvm.

Also includes (from prior commits on this branch):

Scalar kernel arg support and dynamic dims in LLVM→WaveASM translation.
Copy kernel e2e test covering the full WaveASM pipeline.

depends on #1077

ftynse · 2026-03-10T15:11:52Z

+// type (SGPR, VGPR, immediate) and produces any register type. Legalization
+// replaces these with concrete SALU/VALU ops.
+class ArithPseudoOp<string mnemonic, list<Trait> traits = []>
+    : WAVEASMOp<"arith." # mnemonic, !listconcat([Pure, WaveASM_ArithmeticOp], traits)>;


Question: are these pure or can they set some flags in the background (which would be modeled through side effects)?

They should be pure, but our current vcc handling is messy

ftynse · 2026-03-10T15:32:03Z

This looks reasonable. I had to review each commit in isolation, which will make it harder for you to address changes. Next time, you can target the PR at the branch of the base PR.

Two major concerns, one carried over from the base PR:

lookup logic is unsafe, it may return the original value instead of the converted one while the owner of that value will be erased (you are also reinventing the pattern infra and are bound to hit the problems we had hit and since resolved there);
lack of testing for error conditions and diagnostics.

ftynse

This is still vastly undertested and scarcely documented, but is an improvement over the initial state.

Add `water_waveasm_lowering_pipeline` that lowers high-level MLIR to LLVM dialect via water-opt (memref decomposition, lower-affine, int-range-optimizations, gpu/amdgpu→rocdl, vector→llvm), producing IR suitable for waveasm-translate consumption. Hook this into compile.py under `use_water_backend + backend=asm` so it takes priority over the legacy ASM flow. Relax the wave_runtime validation for this combination. Step 2 (waveasm-translate consuming LLVM dialect) is a TODO. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Tests the copy kernel through the new water_waveasm_lowering_pipeline path (use_water_backend=True, backend="asm"). Currently compile_to_mlir only — verifies the lowering to LLVM dialect succeeds. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Introduce --waveasm-translate-from-llvm pass that creates a ProgramOp from llvm.func kernels and strictly rejects any unhandled op. This is the entry point for incremental LLVM dialect support. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Handle the 12 LLVM/ROCDL ops needed for the copy kernel: - llvm.mlir.constant → v_mov_b32 - rocdl.workitem.id.x → precolored v0 - llvm.sext/zext/trunc (i32↔i64) → identity on 32-bit GPU - llvm.icmp → v_cmp_* (VCC implicit) - llvm.select → v_cndmask_b32 - llvm.mul → v_mul_lo_u32 - rocdl.make.buffer.rsrc → map to SRD from prologue - llvm.getelementptr (ptr<7>) → decompose into (SRD, voffset) - llvm.load/store (ptr<7>) → buffer_load_ushort/buffer_store_short The lit test exercises the full copy kernel path end-to-end. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Implement Step 2 of water_waveasm_lowering_pipeline: invoke waveasm-translate with --waveasm-translate-from-llvm, optimization passes, regalloc, and --emit-assembly. Step 3 assembles and links to HSACO via clang (skipped when compile_to_mlir=True). Also add use_buffer_ops=True to test_copy_water_waveasm (required for the buffer ops path). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

End-to-end pass that translates LLVM dialect gpu.module to gpu.binary: runs the full WaveASM pipeline (translation, optimization, regalloc, waitcnt, hazard mitigation), emits AMDGCN assembly, assembles and links to HSACO via ROCDL utilities, and embeds the binary inline in the IR. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

… final step - TranslateFromLLVM now inserts waveasm.program inside the gpu.module that contains the kernel, instead of at module top-level. The gpu.module is preserved (not erased). - GPUModuleToBinary is now a pure final step: expects already-optimized waveasm.program ops inside gpu.module, emits assembly, assembles + links to HSACO, and replaces each gpu.module with gpu.binary. - The full pipeline is now externally composable as individual passes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

…gram Keep the original llvm.func in gpu.module so gpu.launch_func verification passes, while placing waveasm.program alongside it with a mangled name (e.g. test__waveasm). The kernel_name attribute preserves the original name for assembly emission. GPUModuleToBinary erases everything when replacing gpu.module with gpu.binary. Also wires up the 3-step water-opt → waveasm-translate → water-opt pipeline and restricts the waveasm e2e test to CDNA4. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Map workgroup ID intrinsics to precolored system SGPRs, mirroring the existing gpu.block_id handler from the MLIR-level translation path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Materialize poison values as zero immediates since they represent undefined values with no meaningful content. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

When a GEP's base is another GEP result (not a direct buffer resource), add the offsets together using v_add_u32 and propagate the SRD. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

When GEPs operate on bare pointers (!llvm.ptr) before make.buffer.rsrc, propagate the mapper entry and accumulate the byte offset. The offset is then added to the voffset when creating buffer GEPs on the resulting ptr<7>. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Add handleAdd (v_add_u32) and extract handlePoison into a proper function matching the handleMul/handleConstant pattern. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Generalize extract/pack ops to accept any register type (VGPR, AGPR, SGPR) instead of vector registers only. This enables SGPR pair splitting and merging without dedicated ops. Rewrite ArithLegalization to handle both i32 and i64 widths: - i64 add: carry chain (s_add_u32 + s_addc_u32 / v_add_co_u32 + v_addc_co_u32) - i64 mul: schoolbook decomposition (mul_lo, mul_hi, cross terms) - i64 cmp: eq/ne via XOR+OR+compare-to-zero (ordered comparisons error) - i64 select: split, select each half, merge - trunc i64->i32: extract lo sub-register - sext i32->i64: {lo, ashr(lo, 31)} - zext i32->i64: {lo, 0} The pass now reports errors on unsupported widths (anything not i32/i64). Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Two bugs in TranslateFromLLVMDialect.cpp: 1. handleSelect passed (cond, trueVal, falseVal) to Arith_SelectOp::create but ODS declares (falseVal, trueVal, condition) — swapped condition and falseVal. This caused i64 select legalization to try splitting the cmp placeholder (imm<1>) as if it were a 64-bit value, hitting the ExtractOp verifier: "operand #0 must be any register, but got '!waveasm.imm<1>'". 2. handleConstant treated all integer constants as i32, producing vreg size 1 for i64 values. Now splits i64 constants into lo/hi v_mov_b32 + pack into vreg<2,2>. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Buffer load/store instructions use 32-bit voffset operands. When a GEP index comes from an i64 value (e.g. llvm.zext), the arith legalization widens it to vreg<2> via RAUW, which then propagates into the buffer op operand. Insert an arith.trunc at translation time so the legalization pass extracts the lo half before the value reaches buffer ops. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

The register allocator does not insert copies for PackOp (it emits no assembly). When splitI64 created extract ops from a pack whose inputs were at non-contiguous physical registers, the extracted values read stale data, causing illegal memory accesses and wrong results at runtime. Fix: splitI64 now checks if the value is defined by a PackOp and returns the pack's original inputs directly, bypassing the extract/pack round-trip entirely. Filed wave-37d for the underlying regalloc issue. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Implements slt/sle/sgt/sge/ult/ule/ugt/uge for i64 operands using hi/lo decomposition: result = hiPred(hi) || (hiEq(hi) && loPred(lo)). Hi comparison uses the same signedness as the original predicate (strict variant); lo always uses unsigned. VALU path: materialize each sub-comparison via v_cmp + v_cndmask_b32, select based on hi equality, then set VCC from the final boolean. SALU path: compute hiCmp | (hiEq & loCmp) using s_and/s_or_b32. Also refactors v_cmp/s_cmp emission into emitVCmp/emitSCmp helpers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

- Carry ops (v_add_co_u32, v_addc_co_u32, etc.): emit implicit VCC as explicit operand. v_addc_co_u32/v_subb_co_u32 also append VCC as carry-in source. - v_cndmask_b32: always emit explicit VCC for VOP3 compatibility (required when literal operands force VOP3 encoding). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

i64 compares now materialize their boolean result to a VGPR instead of leaving VCC set, since intervening i64 add operations (v_add_co_u32) clobber VCC. The select legalization re-establishes VCC from the materialized boolean right before v_cndmask_b32 via ensureVCC(). Add lit test exercising the exact clobber scenario (i64 cmp → i64 add → i64 select). Made-with: Cursor Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

…implify pass - Rename Arith_XxxOp to ArithXxxOp (TableGen + all C++ references) for consistent naming with rest of codebase. - Collapse getRegWidth .Case<SRegType, VRegType, PSRegType, PVRegType> into single multi-type Case with auto lambda. - Replace getResult(0) with getDst() named accessor on carry ops. - Remove dead code in legalizeCmpI64 VGPR path (unreachable after both branches return early). - Replace pre-collect + accumulate-failure pattern with post-order walk + WalkResult::interrupt() (post-order walk is safe for erasure). - Fix legalization.md terminology: "generic ops" -> "pseudo-ops", remove "Proposed" from existing ops, update table to match implementation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

…galization Error tests (6 new files): - arith-legalization-error-{mul,cmp,select,trunc}.mlir: unsupported operand width (4 dwords) produces diagnostic. - arith-legalization-error-{sext,zext}.mlir: non-i32 source produces diagnostic. Operand wiring (arith-legalization.mlir): - Use FileCheck capture variables ([[A:%.*]], [[B:%.*]], etc.) to verify that correct SSA values flow into operations for i64 SALU/VALU add carry chain, mul schoolbook, cmp eq/slt, and select. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Change lookupGEP to return std::optional<BufferPtrInfo> instead of raw pointer, and simplify lookupBaseOffset to a one-liner using DenseMap::lookup. Also collapse a redundant std::optional<Value> declaration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

resolve() now returns FailureOr<Value> instead of silently returning the original LLVM value (which gets erased, causing use-after-free). All call sites propagate the failure with descriptive diagnostics. Also extract handleBinaryOp and handleCastOp templates to eliminate duplication across add/mul and sext/zext/trunc handlers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Emit an error when both gpu.known_block_size and rocdl.reqd_work_group_size are present but disagree, instead of silently using whichever comes first. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Move the empty-assembly check into the per-program walk so that a program producing no output is caught immediately with a diagnostic on the specific program, rather than being masked by other programs that did produce output. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Poison previously always materialized as a single b32 zero regardless of the LLVM type width. Now i64 poison produces a packed pair of zero dwords, matching handleConstant behavior. Unsupported widths emit a diagnostic. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

The translation pass does not read data_layout, so the attribute is just noise in the test inputs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

- Assert ProgramOp body is non-empty instead of defensive check. - Fix test comment: "arith.add" -> "waveasm.arith.add". - Move local import of get_waveasm_translate to module level. - Fix print format to avoid space before ellipsis. - Replace unicode arrows with ASCII in water.py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Verifies operand wiring: icmp condition, true/false values, and constant materialization all flow through correctly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

The dedicated .Case<V_ADD_CO_U32, V_SUB_CO_U32, V_ADDC_CO_U32, V_SUBB_CO_U32> handler already matches all _co_ ops, so the mnemonic.contains("_co_") block in Default was unreachable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Add ArithOrOp and ArithAndOp pseudo-ops, translate from llvm.or/llvm.and, and legalize to s_or_b32/b64, v_or_b32/b64, s_and_b32/b64, v_and_b32/b64. Unlike add, bitwise ops use native 64-bit instructions instead of lo/hi splitting. Fixes: llvm.or unhandled op error in dynamic copy kernel. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Blocked on regalloc supporting contiguous allocation constraints for PackOp inputs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

SGPR conditions (e.g. from s_cmp_*) were passed to v_cndmask_b32 without setting VCC. Now moved to VGPR via v_mov_b32 first, then v_cmp_ne_i32 sets VCC as expected. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Avoids a crash if the frontend ever produces multi-index GEPs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Duplicate move elimination is handled by the upstream CSE pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

- Use consistent terminology: pseudo-ops throughout, machine ops not machine instructions. - Mark pseudo-op table as implemented, not proposed. - Replace move caching with CSE reference. - Update open questions to resolved status. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

LLVM enables them selectively via amdgpu-no-workgroup-id-{y,z} attributes, not unconditionally. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

The LLVM translator relies on the gfx950 preload ABI for scalar arg mapping. Bail out early with a diagnostic for unsupported targets instead of silently generating wrong code. Default target changed from gfx942 to gfx950. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Hardcode84 requested a review from harsh-nod March 8, 2026 10:20

harsh-nod requested a review from raikonenfnu March 8, 2026 16:31

Hardcode84 force-pushed the llvm-asm-backend-dyn-copy branch from 38660a7 to 0d70a67 Compare March 8, 2026 17:14

Hardcode84 mentioned this pull request Mar 8, 2026

LLVM dialect -> WaveASM GEMM e2e #1079

Closed

Hardcode84 force-pushed the llvm-asm-backend-dyn-copy branch from 0d70a67 to 22c6ee3 Compare March 10, 2026 13:47

ftynse reviewed Mar 10, 2026

View reviewed changes

Comment thread waveasm/docs/legalization.md Outdated

Comment thread waveasm/docs/legalization.md Outdated

Comment thread waveasm/docs/legalization.md

Comment thread waveasm/docs/legalization.md Outdated

ftynse reviewed Mar 10, 2026

View reviewed changes

Comment thread waveasm/include/waveasm/Transforms/Passes.td Outdated

Comment thread waveasm/lib/Transforms/ArithLegalization.cpp Outdated

Comment thread waveasm/lib/Transforms/ArithLegalization.cpp Outdated

ftynse reviewed Mar 10, 2026

View reviewed changes

Comment thread waveasm/test/Transforms/translate-from-llvm-add.mlir Outdated

ftynse reviewed Mar 10, 2026

View reviewed changes

ftynse mentioned this pull request Mar 11, 2026

Initial LLVM dialect -> WaveASM translation pipeline #1077

Closed

Hardcode84 force-pushed the llvm-asm-backend-dyn-copy branch 3 times, most recently from baf9abc to 97223a6 Compare March 13, 2026 08:23

ftynse approved these changes Mar 16, 2026

View reviewed changes

Hardcode84 and others added 13 commits March 18, 2026 11:58

Handle llvm.mlir.poison in LLVM→WaveASM translation

07de30f

Materialize poison values as zero immediates since they represent undefined values with no meaningful content. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Hardcode84 and others added 27 commits March 18, 2026 11:58

Strip unused llvm.data_layout from TranslateFromLLVM tests

581f7f1

The translation pass does not read data_layout, so the attribute is just noise in the test inputs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Add TODOs for replacing RawOp SRD construction with typed ops

b1ddcb3

Blocked on regalloc supporting contiguous allocation constraints for PackOp inputs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Replace multi-index GEP assert with graceful emitOpError

2a783dd

Avoids a crash if the frontend ever produces multi-index GEPs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Remove 'Caches moves to avoid duplicates' from legalization doc

b9dae1c

Duplicate move elimination is handled by the upstream CSE pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Extract 0x7FFFFFFC magic number to kSRDDefaultNumRecords constant

f9e6250

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Fix incorrect comment about LLVM enabling all workgroup IDs

4282673

LLVM enables them selectively via amdgpu-no-workgroup-id-{y,z} attributes, not unconditionally. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Hardcode84 force-pushed the llvm-asm-backend-dyn-copy branch from 1abbf02 to 5f338f0 Compare March 18, 2026 11:05

Hardcode84 merged commit 77ffd04 into iree-org:main Mar 18, 2026
19 checks passed

Hardcode84 deleted the llvm-asm-backend-dyn-copy branch March 18, 2026 12:13

Conversation

Hardcode84 commented Mar 8, 2026

What changed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ftynse Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Hardcode84 Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ftynse commented Mar 10, 2026

Uh oh!

ftynse left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants