Skip to content

Arithmetic legalization for WaveASM backend#1078

Merged
Hardcode84 merged 60 commits intoiree-org:mainfrom
Hardcode84:llvm-asm-backend-dyn-copy
Mar 18, 2026
Merged

Arithmetic legalization for WaveASM backend#1078
Hardcode84 merged 60 commits intoiree-org:mainfrom
Hardcode84:llvm-asm-backend-dyn-copy

Conversation

@Hardcode84
Copy link
Copy Markdown
Contributor

Introduces a proper legalization stage between LLVM translation and optimization in the WaveASM pipeline. Instead of making register file (SGPR/VGPR) and width (i32/i64) decisions during translation, we now emit generic pseudo-ops and legalize them in a dedicated pass.

What changed

Design doc (waveasm/docs/legalization.md)

  • Documents the three-phase legalization strategy (type → register → instruction selection).
  • Compares our approach with the real LLVM AMDGPU backend.

Generic arithmetic pseudo-ops (WaveASMArithOps.td, WaveASMTypes.td)

  • 7 new ops: arith.add, arith.mul, arith.cmp, arith.select, arith.trunc, arith.sext, arith.zext.
  • Single arith.cmp op with CmpPredicate enum (10 predicates) instead of per-predicate ops.
  • Register-file-agnostic: accept any operand type, produce any register type.

Arithmetic legalization pass (ArithLegalization.cpp)

  • Demand-driven divergence: VGPR operand → VALU path; all SGPR/imm → SALU path.
  • Constant bus enforcement: inserts v_mov_b32 when VALU has two SGPR operands.
  • i64 narrowing: wide precolored SGPRs narrowed to low sub-register.
  • Walk-based (not greedy rewriter) to avoid DCE of pure ops with no uses.

LLVM translation updated (TranslateFromLLVM.cpp)

  • handleAdd, handleMul, handleICmp, handleSelect, handleTrunc, handleSext, handleZext now emit pseudo-ops.
  • Removed ensure32Bit/ensureVGPR helpers — legalization pass handles this.

Pipeline (water.py)

  • Added --waveasm-arith-legalization after --waveasm-translate-from-llvm.

Also includes (from prior commits on this branch):

  • Scalar kernel arg support and dynamic dims in LLVM→WaveASM translation.
  • Copy kernel e2e test covering the full WaveASM pipeline.

depends on #1077

@Hardcode84 Hardcode84 requested a review from harsh-nod March 8, 2026 10:20
@harsh-nod harsh-nod requested a review from raikonenfnu March 8, 2026 16:31
@Hardcode84 Hardcode84 force-pushed the llvm-asm-backend-dyn-copy branch from 38660a7 to 0d70a67 Compare March 8, 2026 17:14
@Hardcode84 Hardcode84 force-pushed the llvm-asm-backend-dyn-copy branch from 0d70a67 to 22c6ee3 Compare March 10, 2026 13:47
Comment thread waveasm/lib/Transforms/TranslateFromLLVMDialect.cpp Outdated
Comment thread waveasm/lib/Transforms/TranslateFromLLVMDialect.cpp Outdated
Comment thread waveasm/lib/Transforms/TranslateFromLLVMDialect.cpp Outdated
Comment thread waveasm/lib/Transforms/TranslateFromLLVMDialect.cpp Outdated
Comment thread waveasm/docs/legalization.md Outdated
Comment thread waveasm/docs/legalization.md Outdated
Comment thread waveasm/docs/legalization.md
Comment thread waveasm/docs/legalization.md Outdated
// type (SGPR, VGPR, immediate) and produces any register type. Legalization
// replaces these with concrete SALU/VALU ops.
class ArithPseudoOp<string mnemonic, list<Trait> traits = []>
: WAVEASMOp<"arith." # mnemonic, !listconcat([Pure, WaveASM_ArithmeticOp], traits)>;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: are these pure or can they set some flags in the background (which would be modeled through side effects)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They should be pure, but our current vcc handling is messy

Comment thread waveasm/include/waveasm/Transforms/Passes.td Outdated
Comment thread waveasm/lib/Transforms/ArithLegalization.cpp Outdated
Comment thread waveasm/lib/Transforms/ArithLegalization.cpp Outdated
Comment thread waveasm/test/Transforms/translate-from-llvm-add.mlir Outdated
Comment thread waveasm/lib/Transforms/ArithLegalization.cpp Outdated
Comment thread waveasm/lib/Transforms/ArithLegalization.cpp Outdated
Comment thread waveasm/lib/Transforms/ArithLegalization.cpp Outdated
Comment thread waveasm/lib/Transforms/ArithLegalization.cpp Outdated
Comment thread waveasm/test/Transforms/arith-legalization.mlir Outdated
@ftynse
Copy link
Copy Markdown
Contributor

ftynse commented Mar 10, 2026

This looks reasonable. I had to review each commit in isolation, which will make it harder for you to address changes. Next time, you can target the PR at the branch of the base PR.

Two major concerns, one carried over from the base PR:

  • lookup logic is unsafe, it may return the original value instead of the converted one while the owner of that value will be erased (you are also reinventing the pattern infra and are bound to hit the problems we had hit and since resolved there);
  • lack of testing for error conditions and diagnostics.

@Hardcode84 Hardcode84 force-pushed the llvm-asm-backend-dyn-copy branch 3 times, most recently from baf9abc to 97223a6 Compare March 13, 2026 08:23
Copy link
Copy Markdown
Contributor

@ftynse ftynse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still vastly undertested and scarcely documented, but is an improvement over the initial state.

Hardcode84 and others added 13 commits March 18, 2026 11:58
Add `water_waveasm_lowering_pipeline` that lowers high-level MLIR to
LLVM dialect via water-opt (memref decomposition, lower-affine,
int-range-optimizations, gpu/amdgpu→rocdl, vector→llvm), producing
IR suitable for waveasm-translate consumption.

Hook this into compile.py under `use_water_backend + backend=asm`
so it takes priority over the legacy ASM flow.  Relax the
wave_runtime validation for this combination.

Step 2 (waveasm-translate consuming LLVM dialect) is a TODO.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Tests the copy kernel through the new water_waveasm_lowering_pipeline
path (use_water_backend=True, backend="asm").  Currently compile_to_mlir
only — verifies the lowering to LLVM dialect succeeds.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Introduce --waveasm-translate-from-llvm pass that creates a ProgramOp
from llvm.func kernels and strictly rejects any unhandled op. This is
the entry point for incremental LLVM dialect support.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Handle the 12 LLVM/ROCDL ops needed for the copy kernel:
- llvm.mlir.constant → v_mov_b32
- rocdl.workitem.id.x → precolored v0
- llvm.sext/zext/trunc (i32↔i64) → identity on 32-bit GPU
- llvm.icmp → v_cmp_* (VCC implicit)
- llvm.select → v_cndmask_b32
- llvm.mul → v_mul_lo_u32
- rocdl.make.buffer.rsrc → map to SRD from prologue
- llvm.getelementptr (ptr<7>) → decompose into (SRD, voffset)
- llvm.load/store (ptr<7>) → buffer_load_ushort/buffer_store_short

The lit test exercises the full copy kernel path end-to-end.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Implement Step 2 of water_waveasm_lowering_pipeline: invoke
waveasm-translate with --waveasm-translate-from-llvm, optimization
passes, regalloc, and --emit-assembly. Step 3 assembles and links
to HSACO via clang (skipped when compile_to_mlir=True).

Also add use_buffer_ops=True to test_copy_water_waveasm (required
for the buffer ops path).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
End-to-end pass that translates LLVM dialect gpu.module to gpu.binary:
runs the full WaveASM pipeline (translation, optimization, regalloc,
waitcnt, hazard mitigation), emits AMDGCN assembly, assembles and links
to HSACO via ROCDL utilities, and embeds the binary inline in the IR.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
… final step

- TranslateFromLLVM now inserts waveasm.program inside the gpu.module
  that contains the kernel, instead of at module top-level. The gpu.module
  is preserved (not erased).
- GPUModuleToBinary is now a pure final step: expects already-optimized
  waveasm.program ops inside gpu.module, emits assembly, assembles + links
  to HSACO, and replaces each gpu.module with gpu.binary.
- The full pipeline is now externally composable as individual passes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
…gram

Keep the original llvm.func in gpu.module so gpu.launch_func verification
passes, while placing waveasm.program alongside it with a mangled name
(e.g. test__waveasm). The kernel_name attribute preserves the original
name for assembly emission. GPUModuleToBinary erases everything when
replacing gpu.module with gpu.binary. Also wires up the 3-step
water-opt → waveasm-translate → water-opt pipeline and restricts the
waveasm e2e test to CDNA4.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Map workgroup ID intrinsics to precolored system SGPRs, mirroring the
existing gpu.block_id handler from the MLIR-level translation path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Materialize poison values as zero immediates since they represent
undefined values with no meaningful content.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
When a GEP's base is another GEP result (not a direct buffer resource),
add the offsets together using v_add_u32 and propagate the SRD.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
When GEPs operate on bare pointers (!llvm.ptr) before make.buffer.rsrc,
propagate the mapper entry and accumulate the byte offset. The offset
is then added to the voffset when creating buffer GEPs on the resulting
ptr<7>.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Add handleAdd (v_add_u32) and extract handlePoison into a proper
function matching the handleMul/handleConstant pattern.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Hardcode84 and others added 27 commits March 18, 2026 11:58
Generalize extract/pack ops to accept any register type (VGPR, AGPR,
SGPR) instead of vector registers only. This enables SGPR pair
splitting and merging without dedicated ops.

Rewrite ArithLegalization to handle both i32 and i64 widths:

- i64 add: carry chain (s_add_u32 + s_addc_u32 / v_add_co_u32 + v_addc_co_u32)
- i64 mul: schoolbook decomposition (mul_lo, mul_hi, cross terms)
- i64 cmp: eq/ne via XOR+OR+compare-to-zero (ordered comparisons error)
- i64 select: split, select each half, merge
- trunc i64->i32: extract lo sub-register
- sext i32->i64: {lo, ashr(lo, 31)}
- zext i32->i64: {lo, 0}

The pass now reports errors on unsupported widths (anything not i32/i64).

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Two bugs in TranslateFromLLVMDialect.cpp:

1. handleSelect passed (cond, trueVal, falseVal) to Arith_SelectOp::create
   but ODS declares (falseVal, trueVal, condition) — swapped condition and
   falseVal. This caused i64 select legalization to try splitting the cmp
   placeholder (imm<1>) as if it were a 64-bit value, hitting the ExtractOp
   verifier: "operand #0 must be any register, but got '!waveasm.imm<1>'".

2. handleConstant treated all integer constants as i32, producing vreg size 1
   for i64 values. Now splits i64 constants into lo/hi v_mov_b32 + pack into
   vreg<2,2>.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Buffer load/store instructions use 32-bit voffset operands. When a GEP
index comes from an i64 value (e.g. llvm.zext), the arith legalization
widens it to vreg<2> via RAUW, which then propagates into the buffer op
operand. Insert an arith.trunc at translation time so the legalization
pass extracts the lo half before the value reaches buffer ops.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
The register allocator does not insert copies for PackOp (it emits no
assembly). When splitI64 created extract ops from a pack whose inputs
were at non-contiguous physical registers, the extracted values read
stale data, causing illegal memory accesses and wrong results at
runtime.

Fix: splitI64 now checks if the value is defined by a PackOp and
returns the pack's original inputs directly, bypassing the extract/pack
round-trip entirely. Filed wave-37d for the underlying regalloc issue.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Implements slt/sle/sgt/sge/ult/ule/ugt/uge for i64 operands using
hi/lo decomposition: result = hiPred(hi) || (hiEq(hi) && loPred(lo)).
Hi comparison uses the same signedness as the original predicate
(strict variant); lo always uses unsigned.

VALU path: materialize each sub-comparison via v_cmp + v_cndmask_b32,
select based on hi equality, then set VCC from the final boolean.
SALU path: compute hiCmp | (hiEq & loCmp) using s_and/s_or_b32.

Also refactors v_cmp/s_cmp emission into emitVCmp/emitSCmp helpers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
- Carry ops (v_add_co_u32, v_addc_co_u32, etc.): emit implicit VCC as
  explicit operand. v_addc_co_u32/v_subb_co_u32 also append VCC as
  carry-in source.
- v_cndmask_b32: always emit explicit VCC for VOP3 compatibility
  (required when literal operands force VOP3 encoding).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
i64 compares now materialize their boolean result to a VGPR instead of
leaving VCC set, since intervening i64 add operations (v_add_co_u32)
clobber VCC.  The select legalization re-establishes VCC from the
materialized boolean right before v_cndmask_b32 via ensureVCC().

Add lit test exercising the exact clobber scenario (i64 cmp → i64 add → i64 select).

Made-with: Cursor
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
…implify pass

- Rename Arith_XxxOp to ArithXxxOp (TableGen + all C++ references) for
  consistent naming with rest of codebase.
- Collapse getRegWidth .Case<SRegType, VRegType, PSRegType, PVRegType>
  into single multi-type Case with auto lambda.
- Replace getResult(0) with getDst() named accessor on carry ops.
- Remove dead code in legalizeCmpI64 VGPR path (unreachable after both
  branches return early).
- Replace pre-collect + accumulate-failure pattern with post-order walk
  + WalkResult::interrupt() (post-order walk is safe for erasure).
- Fix legalization.md terminology: "generic ops" -> "pseudo-ops",
  remove "Proposed" from existing ops, update table to match
  implementation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
…galization

Error tests (6 new files):
- arith-legalization-error-{mul,cmp,select,trunc}.mlir: unsupported
  operand width (4 dwords) produces diagnostic.
- arith-legalization-error-{sext,zext}.mlir: non-i32 source produces
  diagnostic.

Operand wiring (arith-legalization.mlir):
- Use FileCheck capture variables ([[A:%.*]], [[B:%.*]], etc.) to
  verify that correct SSA values flow into operations for i64 SALU/VALU
  add carry chain, mul schoolbook, cmp eq/slt, and select.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Change lookupGEP to return std::optional<BufferPtrInfo> instead of
raw pointer, and simplify lookupBaseOffset to a one-liner using
DenseMap::lookup. Also collapse a redundant std::optional<Value>
declaration.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
resolve() now returns FailureOr<Value> instead of silently returning
the original LLVM value (which gets erased, causing use-after-free).
All call sites propagate the failure with descriptive diagnostics.

Also extract handleBinaryOp and handleCastOp templates to eliminate
duplication across add/mul and sext/zext/trunc handlers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Emit an error when both gpu.known_block_size and
rocdl.reqd_work_group_size are present but disagree, instead of
silently using whichever comes first.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Move the empty-assembly check into the per-program walk so that a
program producing no output is caught immediately with a diagnostic
on the specific program, rather than being masked by other programs
that did produce output.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Poison previously always materialized as a single b32 zero regardless
of the LLVM type width. Now i64 poison produces a packed pair of zero
dwords, matching handleConstant behavior. Unsupported widths emit a
diagnostic.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
The translation pass does not read data_layout, so the attribute is
just noise in the test inputs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
- Assert ProgramOp body is non-empty instead of defensive check.
- Fix test comment: "arith.add" -> "waveasm.arith.add".
- Move local import of get_waveasm_translate to module level.
- Fix print format to avoid space before ellipsis.
- Replace unicode arrows with ASCII in water.py.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Verifies operand wiring: icmp condition, true/false values, and
constant materialization all flow through correctly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
The dedicated .Case<V_ADD_CO_U32, V_SUB_CO_U32, V_ADDC_CO_U32,
V_SUBB_CO_U32> handler already matches all _co_ ops, so the
mnemonic.contains("_co_") block in Default was unreachable.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Add ArithOrOp and ArithAndOp pseudo-ops, translate from llvm.or/llvm.and,
and legalize to s_or_b32/b64, v_or_b32/b64, s_and_b32/b64, v_and_b32/b64.
Unlike add, bitwise ops use native 64-bit instructions instead of
lo/hi splitting.

Fixes: llvm.or unhandled op error in dynamic copy kernel.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Blocked on regalloc supporting contiguous allocation constraints
for PackOp inputs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
SGPR conditions (e.g. from s_cmp_*) were passed to v_cndmask_b32
without setting VCC. Now moved to VGPR via v_mov_b32 first, then
v_cmp_ne_i32 sets VCC as expected.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Avoids a crash if the frontend ever produces multi-index GEPs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Duplicate move elimination is handled by the upstream CSE pass.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
- Use consistent terminology: pseudo-ops throughout, machine ops not
  machine instructions.
- Mark pseudo-op table as implemented, not proposed.
- Replace move caching with CSE reference.
- Update open questions to resolved status.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
LLVM enables them selectively via amdgpu-no-workgroup-id-{y,z}
attributes, not unconditionally.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
The LLVM translator relies on the gfx950 preload ABI for scalar arg
mapping. Bail out early with a diagnostic for unsupported targets
instead of silently generating wrong code. Default target changed
from gfx942 to gfx950.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
@Hardcode84 Hardcode84 force-pushed the llvm-asm-backend-dyn-copy branch from 1abbf02 to 5f338f0 Compare March 18, 2026 11:05
@Hardcode84 Hardcode84 merged commit 77ffd04 into iree-org:main Mar 18, 2026
19 checks passed
@Hardcode84 Hardcode84 deleted the llvm-asm-backend-dyn-copy branch March 18, 2026 12:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants