Ct opt by alextmagro · Pull Request #519 · ROCm/TransformerEngine

alextmagro · 2026-04-04T02:20:47Z

Improvements to cast_transpose and cast for FP8 delayed scaling

Introduced rocm specific cast and cast+transpose functions tuned for MI350s and MI300s

For memory-bound kernels:
Cast Only: 2.85x speedup on average
Cast Transpose: 2.0x speedup on average

This PR contains benchmarking scripts, so was branched off of #507.

alextmagro · 2026-04-04T02:22:04Z

Claude results breakdown and process analysis

Hardware

AMD Instinct MI355X (CDNA 4, gfx950, 256 CUs, 288 GB HBM3E)
Peak HBM bandwidth: 7.276 TiB/s

Bandwidth Calculation

Path	FP32 bytes/elem	BF16 bytes/elem
Cast-only	5 (4r + 1w)	3 (2r + 1w)
Cast+transpose	6 (4r + 1w + 1w)	4 (2r + 1w + 1w)

Cast-Only

FP32->FP8 GPT-OSS

Shape	Base Time	Base %Peak	Opt Time	Opt %Peak	Speedup
2880x2880	29.8 us	17.4%	11.0 us	47.1%	2.71x
5120x2880	49.3 us	18.7%	17.2 us	53.6%	2.87x
5760x2880	53.6 us	19.4%	18.2 us	57.0%	2.95x
16384x2880	140.0 us	21.0%	37.1 us	79.6%	3.77x
2880x4096	40.4 us	18.3%	15.3 us	48.1%	2.64x
16384x4096	199.0 us	21.1%	49.1 us	85.5%	4.05x
16384x5120	247.0 us	21.3%	75.0 us	69.9%	3.29x

Arithmetic mean speedup: 3.18x | Geometric mean: 3.14x | Weighted (by elems): 3.50x
Optimized BW — Arithmetic: 4.58 Ti/s (63.0%) | Peak: 6.22 Ti/s (85.5%)

FP32->FP8 LLM (Llama/Qwen)

Shape	Base Time	Base %Peak	Opt Time	Opt %Peak	Speedup
1024x3584	16.5 us	13.9%	9.0 us	25.4%	1.83x
1024x4096	18.1 us	14.5%	9.3 us	28.2%	1.94x
2048x4096	30.2 us	17.4%	11.2 us	46.9%	2.70x
4096x4096	54.9 us	19.1%	18.6 us	56.5%	2.95x
1024x8192	30.5 us	17.2%	11.3 us	46.6%	2.70x
2048x8192	54.0 us	19.4%	18.3 us	57.3%	2.95x
4096x8192	102.0 us	20.6%	28.4 us	73.7%	3.59x
8192x8192	199.0 us	21.1%	49.1 us	85.5%	4.05x
16384x8192	391.0 us	21.5%	119.0 us	70.6%	3.29x
32768x8192	775.0 us	21.6%	235.0 us	71.3%	3.30x
1024x14336	47.9 us	19.1%	17.3 us	52.9%	2.77x
2048x14336	90.8 us	20.2%	26.1 us	70.2%	3.48x
4096x14336	176.0 us	20.9%	43.8 us	83.8%	4.02x
4096x16384	198.0 us	21.1%	49.0 us	85.6%	4.04x
8192x16384	391.0 us	21.5%	119.0 us	70.5%	3.29x
16384x16384	775.0 us	21.6%	236.0 us	71.1%	3.28x
32768x16384	817.0 us	41.1%	472.0 us	71.1%	1.73x
1024x18944	62.1 us	19.5%	20.0 us	60.5%	3.10x
2048x28672	176.0 us	20.9%	43.8 us	83.9%	4.02x
4096x28672	342.0 us	21.4%	105.0 us	70.2%	3.26x
8192x28672	678.0 us	21.6%	206.0 us	71.2%	3.29x
16384x28672	800.0 us	36.7%	411.0 us	71.5%	1.95x
2048x29568	179.0 us	21.2%	44.9 us	84.3%	3.99x
8192x29568	700.0 us	21.6%	213.0 us	71.2%	3.29x
8192x53248	796.0 us	34.3%	384.0 us	71.0%	2.07x

Arithmetic mean speedup: 3.07x | Geometric mean: 2.98x | Weighted (by elems): 2.53x
Optimized BW — Arithmetic: 4.80 Ti/s (66.0%) | Peak: 6.23 Ti/s (85.6%)

FP32->FP8 GPT-OSS MoE

Shape	Base Time	Base %Peak	Opt Time	Opt %Peak	Speedup
64x2880	6.9 us	1.7%	7.1 us	1.6%	0.97x
256x2880	7.9 us	5.8%	7.9 us	5.8%	1.00x
320x2880	8.3 us	6.9%	8.2 us	7.1%	1.02x
496x2880	9.7 us	9.2%	8.6 us	10.4%	1.13x
1792x2880	20.5 us	15.7%	9.6 us	33.8%	2.15x
64x5760	7.1 us	3.3%	7.1 us	3.2%	0.99x
256x5760	10.0 us	9.2%	8.4 us	11.0%	1.19x
320x5760	10.8 us	10.6%	8.5 us	13.6%	1.27x
496x5760	13.8 us	12.9%	8.7 us	20.4%	1.58x
1792x5760	36.4 us	17.7%	14.5 us	44.5%	2.51x

Arithmetic mean speedup: 1.38x | Geometric mean: 1.30x | Weighted (by elems): 2.11x
Optimized BW — Arithmetic: 1.10 Ti/s (15.1%) | Peak: 3.24 Ti/s (44.5%)

BF16->FP8 GPT-OSS

Shape	Base Time	Base %Peak	Opt Time	Opt %Peak	Speedup
2880x2880	18.7 us	16.6%	10.1 us	30.8%	1.85x
5120x2880	28.3 us	19.5%	11.7 us	47.4%	2.42x
5760x2880	31.2 us	20.0%	13.5 us	45.9%	2.31x
16384x2880	74.8 us	23.7%	26.6 us	66.6%	2.81x
2880x4096	23.4 us	18.9%	10.9 us	40.4%	2.15x
16384x4096	104.0 us	24.1%	34.2 us	73.7%	3.04x
16384x5120	127.0 us	24.7%	40.8 us	77.2%	3.11x

Arithmetic mean speedup: 2.53x | Geometric mean: 2.49x | Weighted (by elems): 2.89x
Optimized BW — Arithmetic: 3.97 Ti/s (54.6%) | Peak: 5.62 Ti/s (77.2%)

BF16->FP8 LLM (Llama/Qwen)

Shape	Base Time	Base %Peak	Opt Time	Opt %Peak	Speedup
1024x3584	11.8 us	11.6%	8.9 us	15.4%	1.32x
1024x4096	12.6 us	12.4%	9.2 us	17.0%	1.36x
2048x4096	18.9 us	16.6%	10.2 us	30.9%	1.85x
4096x4096	31.5 us	20.0%	14.1 us	44.7%	2.23x
1024x8192	18.9 us	16.6%	10.2 us	30.9%	1.85x
2048x8192	31.1 us	20.2%	13.9 us	45.3%	2.24x
4096x8192	55.7 us	22.6%	21.4 us	58.8%	2.60x
8192x8192	104.0 us	24.1%	34.5 us	73.0%	3.01x
16384x8192	202.0 us	24.9%	60.3 us	83.5%	3.35x
32768x8192	398.0 us	25.3%	144.0 us	70.1%	2.76x
1024x14336	27.9 us	19.7%	11.8 us	46.8%	2.36x
2048x14336	49.7 us	22.2%	19.9 us	55.4%	2.50x
4096x14336	92.8 us	23.7%	31.1 us	70.7%	2.98x
4096x16384	103.0 us	24.3%	34.3 us	73.4%	3.00x
8192x16384	203.0 us	24.8%	60.4 us	83.3%	3.36x
16384x16384	398.0 us	25.3%	141.0 us	71.3%	2.82x
32768x16384	787.0 us	25.6%	286.0 us	70.5%	2.75x
1024x18944	34.9 us	20.8%	15.9 us	45.8%	2.19x
2048x28672	92.9 us	23.7%	31.1 us	70.7%	2.99x
4096x28672	178.0 us	24.8%	53.9 us	81.7%	3.30x
8192x28672	349.0 us	25.2%	124.0 us	71.0%	2.81x
16384x28672	693.0 us	25.4%	249.0 us	70.8%	2.78x
2048x29568	93.8 us	24.2%	31.7 us	71.6%	2.96x
8192x29568	359.0 us	25.3%	128.0 us	70.7%	2.80x
8192x53248	643.0 us	25.4%	228.0 us	71.7%	2.82x

Arithmetic mean speedup: 2.60x | Geometric mean: 2.53x | Weighted (by elems): 2.85x
Optimized BW — Arithmetic: 4.35 Ti/s (59.8%) | Peak: 6.08 Ti/s (83.5%)

BF16->FP8 GPT-OSS MoE

Shape	Base Time	Base %Peak	Opt Time	Opt %Peak	Speedup
64x2880	6.9 us	1.0%	7.1 us	1.0%	0.97x
256x2880	7.4 us	3.8%	7.7 us	3.6%	0.96x
320x2880	7.6 us	4.6%	8.0 us	4.3%	0.95x
496x2880	8.2 us	6.6%	8.4 us	6.4%	0.97x
1792x2880	13.7 us	14.1%	9.4 us	20.7%	1.46x
64x5760	6.9 us	2.0%	7.2 us	1.9%	0.97x
256x5760	8.3 us	6.7%	8.3 us	6.6%	1.00x
320x5760	8.8 us	7.9%	8.4 us	8.2%	1.04x
496x5760	10.3 us	10.5%	8.7 us	12.3%	1.18x
1792x5760	21.0 us	18.4%	10.5 us	36.7%	2.00x

Arithmetic mean speedup: 1.15x | Geometric mean: 1.12x | Weighted (by elems): 1.68x
Optimized BW — Arithmetic: 0.74 Ti/s (10.2%) | Peak: 2.67 Ti/s (36.7%)

Cast+Transpose

FP32->FP8 GPT-OSS

Shape	Base Time	Base %Peak	Opt Time	Opt %Peak	Speedup
2880x2880	33.4 us	18.6%	18.0 us	34.6%	1.86x
5120x2880	39.0 us	28.3%	21.9 us	50.4%	1.78x
5760x2880	41.9 us	29.7%	27.2 us	45.8%	1.54x
16384x2880	61.3 us	57.7%	48.4 us	73.2%	1.27x
2880x4096	44.3 us	20.0%	21.1 us	41.9%	2.10x
16384x4096	120.0 us	41.9%	68.8 us	73.2%	1.74x
16384x5120	115.0 us	54.6%	104.0 us	60.5%	1.11x

Arithmetic mean speedup: 1.63x | Geometric mean: 1.59x | Weighted (by elems): 1.38x
Optimized BW — Arithmetic: 3.94 Ti/s (54.2%) | Peak: 5.32 Ti/s (73.2%)

FP32->FP8 LLM (Llama/Qwen)

Shape	Base Time	Base %Peak	Opt Time	Opt %Peak	Speedup
1024x3584	18.0 us	15.3%	9.1 us	30.1%	1.97x
1024x4096	19.6 us	16.1%	9.3 us	33.7%	2.10x
2048x4096	29.0 us	21.7%	12.1 us	51.9%	2.40x
4096x4096	29.4 us	42.9%	23.7 us	53.2%	1.24x
1024x8192	29.0 us	21.7%	12.1 us	52.0%	2.40x
2048x8192	28.3 us	44.4%	20.7 us	60.8%	1.37x
4096x8192	46.5 us	54.1%	35.7 us	70.6%	1.30x
8192x8192	119.0 us	42.3%	64.9 us	77.6%	1.83x
16384x8192	230.0 us	43.8%	199.0 us	50.5%	1.16x
32768x8192	443.0 us	45.4%	387.0 us	52.0%	1.14x
1024x14336	39.0 us	28.3%	18.6 us	59.2%	2.10x
2048x14336	41.4 us	53.3%	30.1 us	73.1%	1.38x
4096x14336	83.4 us	52.8%	54.7 us	80.5%	1.52x
4096x16384	103.0 us	48.8%	66.1 us	76.1%	1.56x
8192x16384	195.0 us	51.6%	179.0 us	56.1%	1.09x
16384x16384	403.0 us	50.0%	377.0 us	53.4%	1.07x
32768x16384	789.0 us	51.0%	703.0 us	57.2%	1.12x
1024x18944	33.2 us	43.8%	23.1 us	62.9%	1.44x
2048x28672	84.3 us	52.3%	55.0 us	80.1%	1.53x
4096x28672	163.0 us	54.0%	138.0 us	64.0%	1.18x
8192x28672	299.0 us	59.0%	286.0 us	61.6%	1.05x
16384x28672	615.0 us	57.3%	556.0 us	63.4%	1.11x
2048x29568	87.3 us	52.0%	56.6 us	80.2%	1.54x
8192x29568	310.0 us	58.6%	286.0 us	63.5%	1.08x
8192x53248	547.0 us	59.9%	523.0 us	62.6%	1.05x

Arithmetic mean speedup: 1.47x | Geometric mean: 1.42x | Weighted (by elems): 1.15x
Optimized BW — Arithmetic: 4.44 Ti/s (61.1%) | Peak: 5.86 Ti/s (80.5%)

FP32->FP8 GPT-OSS MoE

Shape	Base Time	Base %Peak	Opt Time	Opt %Peak	Speedup
64x2880	8.4 us	1.6%	6.9 us	2.0%	1.22x
256x2880	14.3 us	3.9%	7.8 us	7.1%	1.83x
320x2880	16.6 us	4.2%	11.8 us	5.9%	1.41x
496x2880	15.8 us	6.8%	18.5 us	5.8%	0.85x
1792x2880	22.8 us	17.0%	11.2 us	34.6%	2.04x
64x5760	10.3 us	2.7%	7.1 us	3.9%	1.44x
256x5760	22.8 us	4.8%	8.5 us	13.0%	2.67x
320x5760	17.0 us	8.1%	12.7 us	10.9%	1.34x
496x5760	21.2 us	10.1%	21.6 us	9.9%	0.98x
1792x5760	32.0 us	24.2%	15.3 us	50.5%	2.09x

Arithmetic mean speedup: 1.59x | Geometric mean: 1.50x | Weighted (by elems): 1.94x
Optimized BW — Arithmetic: 1.04 Ti/s (14.3%) | Peak: 3.67 Ti/s (50.5%)

BF16->FP8 GPT-OSS

Shape	Base Time	Base %Peak	Opt Time	Opt %Peak	Speedup
2880x2880	32.0 us	13.0%	23.3 us	17.8%	1.37x
5120x2880	37.9 us	19.4%	18.3 us	40.3%	2.07x
5760x2880	40.6 us	20.4%	27.8 us	29.9%	1.46x
16384x2880	51.4 us	45.9%	38.0 us	62.0%	1.35x
2880x4096	28.2 us	20.9%	17.1 us	34.4%	1.65x
16384x4096	112.0 us	30.0%	50.6 us	66.3%	2.21x
16384x5120	210.0 us	20.0%	54.1 us	77.6%	3.88x

Arithmetic mean speedup: 2.00x | Geometric mean: 1.87x | Weighted (by elems): 2.27x
Optimized BW — Arithmetic: 3.41 Ti/s (46.9%) | Peak: 5.65 Ti/s (77.6%)

BF16->FP8 LLM (Llama/Qwen)

Shape	Base Time	Base %Peak	Opt Time	Opt %Peak	Speedup
1024x3584	18.1 us	10.1%	9.3 us	19.7%	1.94x
1024x4096	19.9 us	10.6%	9.4 us	22.2%	2.11x
2048x4096	23.6 us	17.8%	11.2 us	37.5%	2.11x
4096x4096	26.3 us	31.9%	16.7 us	50.4%	1.57x
1024x8192	23.8 us	17.7%	11.3 us	37.3%	2.11x
2048x8192	25.9 us	32.4%	17.1 us	49.0%	1.51x
4096x8192	60.2 us	27.9%	30.3 us	55.4%	1.99x
8192x8192	109.0 us	30.8%	50.5 us	66.4%	2.16x
16384x8192	383.0 us	17.5%	86.4 us	77.6%	4.43x
32768x8192	694.0 us	19.3%	286.0 us	46.8%	2.43x
2048x14336	36.8 us	39.9%	23.8 us	61.6%	1.55x
4096x14336	110.0 us	26.7%	41.2 us	71.3%	2.67x
4096x16384	111.0 us	30.2%	47.4 us	70.8%	2.34x
8192x16384	379.0 us	17.7%	84.4 us	79.5%	4.49x
16384x16384	742.0 us	18.1%	296.0 us	45.3%	2.51x
32768x16384	1.37 ms	19.6%	528.0 us	50.9%	2.60x
1024x18944	31.8 us	30.5%	21.8 us	44.5%	1.46x
2048x28672	111.0 us	26.4%	41.0 us	71.6%	2.71x
4096x28672	296.0 us	19.9%	73.1 us	80.4%	4.05x
8192x28672	590.0 us	19.9%	200.0 us	58.7%	2.95x
16384x28672	1.19 ms	19.7%	388.0 us	60.5%	3.07x
2048x29568	112.0 us	27.0%	41.8 us	72.5%	2.68x
8192x29568	614.0 us	19.7%	190.0 us	63.6%	3.23x
8192x53248	1.09 ms	20.0%	354.0 us	61.7%	3.07x

Arithmetic mean speedup: 2.57x | Geometric mean: 2.45x | Weighted (by elems): 2.88x
Optimized BW — Arithmetic: 4.11 Ti/s (56.5%) | Peak: 5.85 Ti/s (80.4%)

BF16->FP8 GPT-OSS MoE

Shape	Base Time	Base %Peak	Opt Time	Opt %Peak	Speedup
64x2880	8.6 us	1.1%	10.8 us	0.9%	0.79x
256x2880	14.4 us	2.6%	12.7 us	2.9%	1.13x
320x2880	16.5 us	2.8%	20.5 us	2.3%	0.80x
496x2880	20.6 us	3.5%	29.6 us	2.4%	0.70x
1792x2880	22.1 us	11.7%	14.1 us	18.3%	1.57x
64x5760	10.4 us	1.8%	7.2 us	2.6%	1.44x
256x5760	15.0 us	4.9%	8.6 us	8.6%	1.74x
320x5760	17.2 us	5.4%	12.7 us	7.2%	1.35x
496x5760	23.1 us	6.2%	26.1 us	5.5%	0.89x
1792x5760	30.5 us	16.9%	13.2 us	39.0%	2.31x

Arithmetic mean speedup: 1.27x | Geometric mean: 1.19x | Weighted (by elems): 1.96x
Optimized BW — Arithmetic: 0.65 Ti/s (9.0%) | Peak: 2.84 Ti/s (39.0%)

Summary

Cast-Only (new 1D grid-stride kernel)

The baseline uses a generic VectorizedUnaryKernel with per-element scalar processing and no FP8 hardware intrinsics.

Metric	FP32->FP8	BF16->FP8
Peak optimized BW	6.26 Ti/s (86.0%)	6.08 Ti/s (83.5%)
Geometric mean speedup (LLM shapes)	~3.0x	~2.7x
Geometric mean speedup (GPT-OSS)	~3.1x	~2.6x

Cast+Transpose (optimized tiled kernel)

The baseline uses the NVIDIA RTC kernel compiled via hipRTC, which selects tile sizes dynamically but lacks AMD-specific optimizations (no FP8 intrinsics, no NT stores, no occupancy-aware LOAD cap).

Metric	FP32->FP8	BF16->FP8
Peak optimized BW	5.84 Ti/s (80.3%)	5.87 Ti/s (80.6%)
Geometric mean speedup (LLM shapes)	~1.4x	~2.2x
Geometric mean speedup (GPT-OSS)	~1.5x	~2.1x

Optimizations Applied (Kept)

Cast+Transpose Kernel (`rocm_cast_transpose.cuh`)

OVecT packed FP8 shared memory — smem stores CVec<fp8,8> instead of float, 4x smaller footprint, avoids bank conflicts with +1 padding
Register transpose during load — accumulate transposed FP8 into local_t[j2][iter].val[i2] during the load phase, avoiding a separate transpose pass
Non-temporal stores for output_c (rowwise) and output_t (transposed) — __builtin_nontemporal_store confirmed as global_store_* ... nt in assembly
gfx950 FP8 packed intrinsics — __builtin_amdgcn_cvt_scalef32_pk_fp8_f32 with scale=1.0 and pre-multiply (the intrinsic's scale param is E8M0 format, not arbitrary float)
word_select packing — two intrinsic calls with word_select=false/true pack 4 FP8 values into one uint32
Two-launch row strategy — STORE=8 for bulk, then best-fit single launch for remainder (STORE=4 if rem%128==0, STORE=2 if rem%64==0, general kernel otherwise). Max 2 launches for any M value
Column cascade — LOAD_SZ checks for single-launch alignment, then cascades to smaller LOAD sizes for remainder columns
CVec standalone vector type — aligned vector struct with load(), nt_load(), store(), nt_store() methods. No dependency on TE's Vec infrastructure
BF16/FP16 LOAD capped at 8 — LOAD=16 for BF16 uses 211 VGPRs (2 waves/SIMD). Capping at LOAD=8 uses 125 VGPRs (4 waves/SIMD), doubling occupancy

Cast-Only Kernel (`rocm_cast.cuh`)

Dedicated 1D grid-stride kernel — flat 1D indexing over M*N elements. No tiling, no cascade, single kernel launch for any shape
256 threads/block, 16 elements/thread — 42 VGPRs (from assembly), 10 waves/SIMD max, 0 scratch, 32 bytes LDS (amax only)
Direct FP8 packing into OVec — intrinsic results written directly into the output CVec via reinterpret_cast<uint32_t*>. No intermediate converted[] array, which preserves the NT store hint through the compiler
Non-temporal stores — CVec::nt_store() confirmed as global_store_dwordx4 ... nt in assembly (required eliminating the intermediate array to prevent the compiler from dropping the NT hint)
gfx950 FP8 packed intrinsics — same as cast+transpose, 4 intrinsic pairs per 16 elements
Dynamic grid sizing — cu_count blocks for FP32 and small BF16 tensors; cu_count*2 for BF16 tensors >128M elements (crossover point determined empirically)
Scalar tail for non-aligned element counts (rarely exercised — model dimensions are multiples of 16+)

Optimizations Tried and Rejected

Cast+Transpose

Thin-M kernel (1 thread/column, row-splitting for M<256) — only 12 blocks for N=2880 across 256 CUs (4.7% utilization). The tiled cascade was 2-3x faster for M>=64
Hardware transpose via ds_read_tr8_b64 (gfx950 v2 kernel) — identical performance to v1. The output_t scattered write pattern is the bottleneck, not the transpose method
WARP_SIZE=64 — creates TILE_M=512 and smem=33KB, exceeding practical LDS budget per workgroup
Multi-tile K=2 (2 column tiles per block) — helps FP32 but catastrophically hurts BF16 on small shapes by halving block count
Full row cascade (STORE 8->4->2->1->scalar) — produces up to 5 kernel launches for non-aligned M (e.g., M=496). Replaced with two-launch strategy: STORE=8 + best-fit remainder (STORE=4/2/general)

Cast-Only

DO_TRANSPOSE template on tiled kernel — reused the cast+transpose kernel with transpose disabled. Tiled structure imposed unnecessary alignment constraints and cascade overhead. The dedicated 1D kernel was 10-20% faster on small shapes and eliminated all cascade issues
Unconditional non-temporal loads — severe regressions across all shapes (up to -35% on 16384x4096). The L2 cache provides value for coalescing/prefetching even with read-once data on MI355X. NT loads deprioritize LRU eviction but also appear to disable hardware prefetch
Conditional NT loads (runtime branch on tensor size >512MB) — LLVM merged both branch paths during optimization and dropped the !nontemporal metadata. Only unconditional NT or template-parameterized NT can emit the nt flag
Grid size 1024 blocks — uniformly worse than 512 or 256. More blocks = more scheduling overhead + more atomicMax contention on amax
Grid size 128 blocks — large shapes collapse (1004 us vs 668 us for FP32 131072x5760). Insufficient parallelism for HBM saturation

Known Limitations

Cast+Transpose output_t scatter — the transposed output writes to output_t[col * num_rows + row], where adjacent threads write to cache lines spaced num_rows bytes apart. This scattered pattern caps bandwidth at ~80% of peak regardless of kernel optimizations
BF16 2880-col cascade — 2880 is not divisible by TILE_N=128 for BF16 (LOAD=8), requiring a 2-launch column cascade (2816 + 64 cols)
Cast+Transpose MoE regression for non-aligned M — shapes like 320x2880 and 496x2880 trigger the general-kernel fallback, which is slower than the baseline RTC kernel's single-launch approach. Production MoE workloads use multi_cast_transpose (batched) which amortizes this
ECC overhead — HBM3E uses on-die ECC, consuming ~6.25% of raw bandwidth for parity metadata. The theoretical ceiling for any streaming kernel is ~93.75% of advertised peak
Scale multiply overhead — the FP8 packed intrinsic's scale parameter uses E8M0 (power-of-2 exponent) format, not arbitrary float. We must pre-multiply by scale and pass 1.0 to the intrinsic, adding 16-128 extra v_mul_f32 instructions per tile

alextmagro added 6 commits April 4, 2026 04:54

MXFP8 Cast Kernel Optimizations

924e094

cleanup and use of TRANSFORMER_ENGINE_SWITCH_CONDITION

38cea29

Rework utils to use test/cpp/test_common*

057c7a2

Fix clamping with fast_tanh

70948a8

Formatting and comments

4b7f139

hipify add and CMake/include cleanup

c7c4a03

alextmagro force-pushed the ct_opt branch 2 times, most recently from 754dc68 to 5c5f6fe Compare April 4, 2026 05:09

Optimized Cast_Transpose for FP8

9435f7a

alextmagro force-pushed the ct_opt branch from 5c5f6fe to 9435f7a Compare April 4, 2026 05:30

alextmagro marked this pull request as ready for review April 4, 2026 05:31

alextmagro requested review from ipanfilo, wangye805 and wenchenvincent as code owners April 4, 2026 05:31

alextmagro requested review from aris134 and matthiasdiener April 4, 2026 05:32

alextmagro added the ci-level 3 CI test level 3 label Apr 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ct opt#519

Ct opt#519
alextmagro wants to merge 7 commits intodevfrom
ct_opt

alextmagro commented Apr 4, 2026

Uh oh!

alextmagro commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alextmagro commented Apr 4, 2026

Uh oh!

alextmagro commented Apr 4, 2026

Claude results breakdown and process analysis

Hardware

Bandwidth Calculation

Cast-Only

FP32->FP8 GPT-OSS

FP32->FP8 LLM (Llama/Qwen)

FP32->FP8 GPT-OSS MoE

BF16->FP8 GPT-OSS

BF16->FP8 LLM (Llama/Qwen)

BF16->FP8 GPT-OSS MoE

Cast+Transpose

FP32->FP8 GPT-OSS

FP32->FP8 LLM (Llama/Qwen)

FP32->FP8 GPT-OSS MoE

BF16->FP8 GPT-OSS

BF16->FP8 LLM (Llama/Qwen)

BF16->FP8 GPT-OSS MoE

Summary

Cast-Only (new 1D grid-stride kernel)

Cast+Transpose (optimized tiled kernel)

Optimizations Applied (Kept)

Cast+Transpose Kernel (rocm_cast_transpose.cuh)

Cast-Only Kernel (rocm_cast.cuh)

Optimizations Tried and Rejected

Cast+Transpose

Cast-Only

Known Limitations

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Cast+Transpose Kernel (`rocm_cast_transpose.cuh`)

Cast-Only Kernel (`rocm_cast.cuh`)