Skip to content

Sync with Microsoft ONNX Runtime - 21032026#983

Open
Jaswanth51 wants to merge 7 commits intoovep-developfrom
sync_msft_21032026
Open

Sync with Microsoft ONNX Runtime - 21032026#983
Jaswanth51 wants to merge 7 commits intoovep-developfrom
sync_msft_21032026

Conversation

@Jaswanth51
Copy link

Daily backmerge from ORT main to ovep-develop. Do NOT squash or rebase - use merge commit only.

patryk-kaiser-ARM and others added 7 commits March 17, 2026 10:13
…icrosoft#26773)

**Description**
This PR integrates Arm® KleidiAI™ SME2 BF16 kernel through MLAS SBGEMM
path.

Rework of microsoft#24346

**Motivation and Context**
This kernel provides performance improvements on SME-enabled devices.

---------

Signed-off-by: Patryk Kaiser <patryk.kaiser@arm.com>
Upgrading dependency to resolve CVE-2026-27904, which is lighting up
some component governance issues with internal-MSFT builds of ORT.

Co-authored-by: Kevin Taha <kevintaha@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…icrosoft#27694)

### Description

fix condition of the following definitions:
- DAWN_ENABLE_VULKAN
- DAWN_ENABLE_D3D12
…#27688)

This deletes 3 per-head-size .cu files and merges their content into a
single file to avoid dependency during cuda compiling.

Currently, masked_multihead_attention_kernel template is implemented in
decoder_masked_multihead_attention_impl.cu‎. The other three .cu files
use the masked_multihead_attention_kernel template but not include the
implementation. That causes problem when they are built in cuda plugin
ep.
microsoft#27671)

## Description

This PR fixes longstanding MLAS issues that were causing
`NhwcTransformerTests.*` and `QDQTransformerTests.*` failures in
quantized convolution paths (see
microsoft#27670). The failures
were not in the graph transformers themselves; they came from incorrect
qgemm dispatch selection and broken backend kernel behavior in specific
AVX2-VNNI and AMX paths.

The fix removes incorrect `U8U8` dispatch upgrades, avoids a broken
AVX2-VNNI row-panel fallback, and corrects the AMX `U8S8` 32-row kernel
path. It also adds MLAS regression coverage for the conv-shaped qgemm
dimensions that exposed the problems.

## Summary of Changes

### Dispatch Selection Fixes

| File | Change |
|------|--------|
| `onnxruntime/core/mlas/lib/platform.cpp` | Remove three incorrect
assignments that upgraded `GemmU8U8Dispatch` to `U8S8` dispatch objects
in the AVXVNNI, AVX512VNNI, and AMX feature paths. |

### AVX2-VNNI Kernel Fix

| File | Change |
|------|--------|
| `onnxruntime/core/mlas/lib/qgemm_kernel_avx2.cpp` | Reduce `StrideM`
from `6` to `4` for the `U8U8`, `S8S8`, and `S8U8` AVX2-VNNI qgemm
dispatch objects so they never enter the legacy `>4` row fallback path.
|

### AMX Kernel Fix

| File | Change |
|------|--------|
| `onnxruntime/core/mlas/lib/qgemm_kernel_amx.cpp` | Replace the broken
pipelined `CountM >= 32` `U8S8` AMX fast path with the same per-K tile
update pattern already used by the working smaller-row path. |

### Regression Coverage

| File | Change |
|------|--------|
| `onnxruntime/test/mlas/unittest/test_qgemm_fixture.h` | Add MLAS qgemm
regression cases for conv-like shapes `6x30x207` and `169x30x207` in
packed/non-packed and int32 or fp32 variants. |

## Root Cause

There were three separate MLAS correctness issues:

1. `platform.cpp` was incorrectly overwriting `GemmU8U8Dispatch` with
`U8S8` dispatch objects when newer CPU features were detected. That
caused `U8U8` conv workloads to run through the wrong dispatch path.
2. The AVX2-VNNI qgemm dispatch objects advertised an `M` stride of `6`,
but the assembly kernel only handled VNNI packing safely up to 4 rows.
For 5- or 6-row panels it fell back to an older AVX2 path with
incompatible packing and sign assumptions.
3. The AMX `U8S8` qgemm kernel had a bug in its `CountM >= 32` fast
path. The smaller-row AMX path was correct, but the 32-row pipelined
update logic produced wrong accumulators for conv-shaped workloads and
caused the remaining QDQ/NHWC failures on AMX-capable hosts.

## Why This Fix

- The `platform.cpp` cleanup restores the intended `U8U8` dispatch
selection on feature-rich x86 hosts.
- The AVX2-VNNI stride change is a targeted mitigation that avoids the
known-bad legacy fallback until that assembly path is corrected.
- The AMX kernel change keeps the AMX `U8S8` dispatch enabled, but
replaces the broken 32-row implementation with a proven update pattern
that matches the working smaller-row path.
- The new MLAS regression tests cover the exact conv-derived qgemm
shapes that exposed the bug, so future dispatch or kernel changes will
fail at the MLAS layer before surfacing as transformer test regressions.

## Testing

- `cd build/cuda/Release && ./onnxruntime_mlas_test
--gtest_filter='QGemmU8S8_*169xN30xK207*:*QGemmU8S8_*6xN30xK207*'`
- `cd build/cuda/Release && ./onnxruntime_test_all
--gtest_filter='NhwcTransformerTests.*:QDQTransformerTests.*'`
- Verified that the filtered transformer suite passes with AMX `U8S8`
dispatch enabled.

## Motivation and Context

These test failures had been present for a long time and were initially
attributed to transformer rewrites because they surfaced in NHWC and QDQ
test suites. Investigation showed that the optimized graphs were
structurally correct and that the failures came from lower-level MLAS
qgemm execution instead. Fixing the behavior in MLAS is the right layer
because it restores correctness for both direct qgemm coverage and
higher-level quantized conv paths.

## Checklist

- [x] Tests added/updated
- [x] No breaking changes
- [x] CI passes
## Description

This PR fixes clang-specific build failures that show up in both the
standalone clang build and the CUDA clang build. It keeps the
build-system changes targeted, prefers source fixes where the warnings
indicate real type or declaration issues, and avoids broader warning
suppression than necessary for the CUDA provider target.

## Summary of Changes

### Build System

| File | Change |
|------|--------|
| `cmake/CMakeLists.txt` | Stop forwarding `-Wshorten-64-to-32` through
CUDA host compilation where the GNU host compiler does not recognize it.
|
| `cmake/onnxruntime_providers_cuda.cmake` | Add targeted clang
`-Wno-error` handling for warning classes that are currently triggered
by CUDA provider code and third-party CUDA headers under clang. |

### CPU / Common clang fixes

| File | Change |
|------|--------|
| `onnxruntime/core/common/cpuid_info.cc` | Replace the
clang-incompatible `__builtin_cpu_supports("waitpkg")` path with the
CPUID-bit check for TPAUSE detection. |
| `onnxruntime/test/framework/allocation_planner_test.cc` | Refactor
`typeid` assertions to avoid clang's potentially-evaluated-expression
warning while keeping test coverage unchanged. |

### CUDA provider and contrib fixes

| File | Change |
|------|--------|
| `onnxruntime/contrib_ops/cuda/utils/dump_cuda_tensor.h` | Mark the
`IConsoleDumper` overrides explicitly while leaving CUDA-only overloads
unchanged. |
| `onnxruntime/contrib_ops/cuda/bert/group_query_attention.cc` | Use
`template` on the dependent `GetAttrOrDefault` call so clang parses it
correctly. |
| `onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_api.cc` |
Make narrowing conversions to flash-attention parameter fields explicit.
|
| `onnxruntime/contrib_ops/cuda/quantization/matmul_nbits.cc` | Make the
`nbits_` conversion explicit when calling the CUDA helper. |
| `onnxruntime/contrib_ops/cuda/quantization/moe_quantization.cc` |
Restrict the GCC-only warning pragma so clang does not treat it as an
unknown warning option. |
|
`onnxruntime/contrib_ops/cuda/transformers/generation_device_helper.cc`
| Fix explicit state-field assignments to use the actual `int` field
type. |
| `onnxruntime/core/providers/cuda/cuda_mempool_arena.h` | Remove an
unused private field that clang flagged in the CUDA provider build. |

## Testing

Tested CPU and CUDA 12.8 builds in Azure Linux with
- clang 18.1.8
- gcc 13.2
- cmake 4.2.3

Example for CPU build:
```
export CC=clang
export CXX=clang++
bash build.sh --config RelWithDebInfo --parallel --cmake_extra_defines onnxruntime_BUILD_UNIT_TESTS=ON
```
## Motivation and Context

Clang is stricter than GCC/MSVC in a few areas that affect this tree:
CUDA host flag forwarding, explicit narrowing, dependent template
parsing, warnings emitted from third-party CUDA headers, and RTTI/typeid
expressions in tests. The goal here is to keep the staged fix minimal
and maintainable by correcting real source issues where practical and
confining warning downgrades to the CUDA provider target where
third-party header noise is currently unavoidable.
@Jaswanth51 Jaswanth51 requested a review from ankitm3k March 20, 2026 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants