Environment
GPU: AMD Radeon AI PRO R9700 (32GB GDDR6, gfx1201, RDNA4)
ROCm: 7.2.1
OS: Ubuntu 24.04 LTS
Frameworks affected: vLLM, SGLang, ROCm TransformerEngine
Problem
AMD's official product guide advertises "128 AI accelerators with FP8
support." ROCm 7.2.1 silently dequantizes all FP8 weights to FP32 on
gfx1201 with no warning. The AI accelerators do zero FP8 work.
Throughput is ~18-22 tok/s instead of expected ~35-40 tok/s.
Root Cause
gfx1201 is missing from _ARCH_TO_DEVICE in
aiter/ops/triton/utils/arch_info.py causing silent FP32 fallback.
Fix (community validated)
'gfx1201': 'MI350X'
RDNA4 uses FP8 E4M3FN identical to MI350X. Triton kernel path works
correctly. Non-breaking for existing CDNA deployments.
Request: Official ETA for merging this two-line fix into AITER mainline.
Environment
GPU: AMD Radeon AI PRO R9700 (32GB GDDR6, gfx1201, RDNA4)
ROCm: 7.2.1
OS: Ubuntu 24.04 LTS
Frameworks affected: vLLM, SGLang, ROCm TransformerEngine
Problem
AMD's official product guide advertises "128 AI accelerators with FP8
support." ROCm 7.2.1 silently dequantizes all FP8 weights to FP32 on
gfx1201 with no warning. The AI accelerators do zero FP8 work.
Throughput is ~18-22 tok/s instead of expected ~35-40 tok/s.
Root Cause
gfx1201 is missing from _ARCH_TO_DEVICE in
aiter/ops/triton/utils/arch_info.py causing silent FP32 fallback.
Fix (community validated)
'gfx1201': 'MI350X'
RDNA4 uses FP8 E4M3FN identical to MI350X. Triton kernel path works
correctly. Non-breaking for existing CDNA deployments.
Request: Official ETA for merging this two-line fix into AITER mainline.