Make cuTile intrinsics 1-based by maleadt · Pull Request #155 · JuliaGPU/cuTile.jl

maleadt · 2026-03-30T11:53:39Z

This is an experiment.

Summary: Move 1-based to 0-based index lowering from the DSL (operations.jl) into a compiler rewrite pass (index_lower_pass!). The load/store/gather operations now pass through Julia's natural 1-based indices, and the pass inserts subi(elem, 1) for each index element in load_partition_view and store_partition_view calls during compilation.

The motivation here is to simplify the IR we emit (the layernorm example currently emits 4x as much SASS instructions as cuTile Python does). A large part of that IR cruft comes from the repetitive +1/-1 we do as part of the 0-based to 1-based index conversion. For example, bid(1) returns a 1-based index (via addi(blockId_x, 1)), then each load/store call emits its own subi(..., 1) to undo it, resulting in 3 redundant constant+subi pairs:

  %1 = addi %blockId_x, %cst_1_i32          // bid(1) = blockId_x + 1
  %2 = subi %1, %cst_1_i32_7                // load a: undo +1
  load_view_tko ... %pview[%2]
  %3 = subi %1, %cst_1_i32_9                // load b: undo +1 again
  load_view_tko ... %pview_8[%3]
  %5 = subi %1, %cst_1_i32_12               // store c: undo +1 again
  store_view_tko ... %pview_13[%5]

On this branch, the indices passed around are kept 1-based, and a pass converts them late at the load/store boundary, result in significantly simpler IR:

  %1 = addi %blockId_x, %cst_1_i32          // bid(1) = blockId_x + 1
  load_view_tko ... %pview[%1]              // pass lowered index, addi+subi cancelled
  load_view_tko ... %pview_7[%1]
  store_view_tko ... %pview_10[%1]

Although this is nice, it both doesn't improve performance, and makes it harder to compare our IR to cuTile Python's. So not sure we want this.

Move 1-based to 0-based index lowering from the DSL (operations.jl) into a compiler rewrite pass (index_lower_pass!). The load/store/gather operations now pass through Julia's natural 1-based indices, and the pass inserts subi(elem, 1) for each index element in load_partition_view and store_partition_view calls during compilation.

maleadt · 2026-04-06T09:40:37Z

Algebraic simplifications have turned out powerful enough to not need this.

maleadt mentioned this pull request Mar 30, 2026

Add algebraic simplifications to get rid of 1-based index IR bloat #156

Merged

maleadt closed this Apr 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make cuTile intrinsics 1-based#155

Make cuTile intrinsics 1-based#155
maleadt wants to merge 1 commit intomainfrom
tb/normalize

maleadt commented Mar 30, 2026 •

edited

Loading

Uh oh!

maleadt commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

maleadt commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maleadt commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

maleadt commented Mar 30, 2026 •

edited

Loading