Sync with Microsoft ONNX Runtime - 21032026 by Jaswanth51 · Pull Request #983 · intel/onnxruntime

Jaswanth51 · 2026-03-20T21:00:28Z

Daily backmerge from ORT main to ovep-develop. Do NOT squash or rebase - use merge commit only.

…icrosoft#26773) **Description** This PR integrates Arm® KleidiAI™ SME2 BF16 kernel through MLAS SBGEMM path. Rework of microsoft#24346 **Motivation and Context** This kernel provides performance improvements on SME-enabled devices. --------- Signed-off-by: Patryk Kaiser <patryk.kaiser@arm.com>

Upgrading dependency to resolve CVE-2026-27904, which is lighting up some component governance issues with internal-MSFT builds of ORT. Co-authored-by: Kevin Taha <kevintaha@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…icrosoft#27694) ### Description fix condition of the following definitions: - DAWN_ENABLE_VULKAN - DAWN_ENABLE_D3D12

…#27688) This deletes 3 per-head-size .cu files and merges their content into a single file to avoid dependency during cuda compiling. Currently, masked_multihead_attention_kernel template is implemented in decoder_masked_multihead_attention_impl.cu‎. The other three .cu files use the masked_multihead_attention_kernel template but not include the implementation. That causes problem when they are built in cuda plugin ep.

microsoft#27671) ## Description This PR fixes longstanding MLAS issues that were causing `NhwcTransformerTests.*` and `QDQTransformerTests.*` failures in quantized convolution paths (see microsoft#27670). The failures were not in the graph transformers themselves; they came from incorrect qgemm dispatch selection and broken backend kernel behavior in specific AVX2-VNNI and AMX paths. The fix removes incorrect `U8U8` dispatch upgrades, avoids a broken AVX2-VNNI row-panel fallback, and corrects the AMX `U8S8` 32-row kernel path. It also adds MLAS regression coverage for the conv-shaped qgemm dimensions that exposed the problems. ## Summary of Changes ### Dispatch Selection Fixes | File | Change | |------|--------| | `onnxruntime/core/mlas/lib/platform.cpp` | Remove three incorrect assignments that upgraded `GemmU8U8Dispatch` to `U8S8` dispatch objects in the AVXVNNI, AVX512VNNI, and AMX feature paths. | ### AVX2-VNNI Kernel Fix | File | Change | |------|--------| | `onnxruntime/core/mlas/lib/qgemm_kernel_avx2.cpp` | Reduce `StrideM` from `6` to `4` for the `U8U8`, `S8S8`, and `S8U8` AVX2-VNNI qgemm dispatch objects so they never enter the legacy `>4` row fallback path. | ### AMX Kernel Fix | File | Change | |------|--------| | `onnxruntime/core/mlas/lib/qgemm_kernel_amx.cpp` | Replace the broken pipelined `CountM >= 32` `U8S8` AMX fast path with the same per-K tile update pattern already used by the working smaller-row path. | ### Regression Coverage | File | Change | |------|--------| | `onnxruntime/test/mlas/unittest/test_qgemm_fixture.h` | Add MLAS qgemm regression cases for conv-like shapes `6x30x207` and `169x30x207` in packed/non-packed and int32 or fp32 variants. | ## Root Cause There were three separate MLAS correctness issues: 1. `platform.cpp` was incorrectly overwriting `GemmU8U8Dispatch` with `U8S8` dispatch objects when newer CPU features were detected. That caused `U8U8` conv workloads to run through the wrong dispatch path. 2. The AVX2-VNNI qgemm dispatch objects advertised an `M` stride of `6`, but the assembly kernel only handled VNNI packing safely up to 4 rows. For 5- or 6-row panels it fell back to an older AVX2 path with incompatible packing and sign assumptions. 3. The AMX `U8S8` qgemm kernel had a bug in its `CountM >= 32` fast path. The smaller-row AMX path was correct, but the 32-row pipelined update logic produced wrong accumulators for conv-shaped workloads and caused the remaining QDQ/NHWC failures on AMX-capable hosts. ## Why This Fix - The `platform.cpp` cleanup restores the intended `U8U8` dispatch selection on feature-rich x86 hosts. - The AVX2-VNNI stride change is a targeted mitigation that avoids the known-bad legacy fallback until that assembly path is corrected. - The AMX kernel change keeps the AMX `U8S8` dispatch enabled, but replaces the broken 32-row implementation with a proven update pattern that matches the working smaller-row path. - The new MLAS regression tests cover the exact conv-derived qgemm shapes that exposed the bug, so future dispatch or kernel changes will fail at the MLAS layer before surfacing as transformer test regressions. ## Testing - `cd build/cuda/Release && ./onnxruntime_mlas_test --gtest_filter='QGemmU8S8_*169xN30xK207*:*QGemmU8S8_*6xN30xK207*'` - `cd build/cuda/Release && ./onnxruntime_test_all --gtest_filter='NhwcTransformerTests.*:QDQTransformerTests.*'` - Verified that the filtered transformer suite passes with AMX `U8S8` dispatch enabled. ## Motivation and Context These test failures had been present for a long time and were initially attributed to transformer rewrites because they surfaced in NHWC and QDQ test suites. Investigation showed that the optimized graphs were structurally correct and that the failures came from lower-level MLAS qgemm execution instead. Fixing the behavior in MLAS is the right layer because it restores correctness for both direct qgemm coverage and higher-level quantized conv paths. ## Checklist - [x] Tests added/updated - [x] No breaking changes - [x] CI passes

## Description This PR fixes clang-specific build failures that show up in both the standalone clang build and the CUDA clang build. It keeps the build-system changes targeted, prefers source fixes where the warnings indicate real type or declaration issues, and avoids broader warning suppression than necessary for the CUDA provider target. ## Summary of Changes ### Build System | File | Change | |------|--------| | `cmake/CMakeLists.txt` | Stop forwarding `-Wshorten-64-to-32` through CUDA host compilation where the GNU host compiler does not recognize it. | | `cmake/onnxruntime_providers_cuda.cmake` | Add targeted clang `-Wno-error` handling for warning classes that are currently triggered by CUDA provider code and third-party CUDA headers under clang. | ### CPU / Common clang fixes | File | Change | |------|--------| | `onnxruntime/core/common/cpuid_info.cc` | Replace the clang-incompatible `__builtin_cpu_supports("waitpkg")` path with the CPUID-bit check for TPAUSE detection. | | `onnxruntime/test/framework/allocation_planner_test.cc` | Refactor `typeid` assertions to avoid clang's potentially-evaluated-expression warning while keeping test coverage unchanged. | ### CUDA provider and contrib fixes | File | Change | |------|--------| | `onnxruntime/contrib_ops/cuda/utils/dump_cuda_tensor.h` | Mark the `IConsoleDumper` overrides explicitly while leaving CUDA-only overloads unchanged. | | `onnxruntime/contrib_ops/cuda/bert/group_query_attention.cc` | Use `template` on the dependent `GetAttrOrDefault` call so clang parses it correctly. | | `onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_api.cc` | Make narrowing conversions to flash-attention parameter fields explicit. | | `onnxruntime/contrib_ops/cuda/quantization/matmul_nbits.cc` | Make the `nbits_` conversion explicit when calling the CUDA helper. | | `onnxruntime/contrib_ops/cuda/quantization/moe_quantization.cc` | Restrict the GCC-only warning pragma so clang does not treat it as an unknown warning option. | | `onnxruntime/contrib_ops/cuda/transformers/generation_device_helper.cc` | Fix explicit state-field assignments to use the actual `int` field type. | | `onnxruntime/core/providers/cuda/cuda_mempool_arena.h` | Remove an unused private field that clang flagged in the CUDA provider build. | ## Testing Tested CPU and CUDA 12.8 builds in Azure Linux with - clang 18.1.8 - gcc 13.2 - cmake 4.2.3 Example for CPU build: ``` export CC=clang export CXX=clang++ bash build.sh --config RelWithDebInfo --parallel --cmake_extra_defines onnxruntime_BUILD_UNIT_TESTS=ON ``` ## Motivation and Context Clang is stricter than GCC/MSVC in a few areas that affect this tree: CUDA host flag forwarding, explicit narrowing, dependent template parsing, warnings emitted from third-party CUDA headers, and RTTI/typeid expressions in tests. The goal here is to keep the staged fix minimal and maintainable by correcting real source issues where practical and confining warning downgrades to the CUDA provider target where third-party header noise is currently unavoidable.

patryk-kaiser-ARM and others added 7 commits March 17, 2026 10:13

[webgpu] fix condition of DAWN_ENABLE_VULKAN and DAWN_ENABLE_D3D12 (m…

672e3bb

…icrosoft#27694) ### Description fix condition of the following definitions: - DAWN_ENABLE_VULKAN - DAWN_ENABLE_D3D12

Merge remote-tracking branch 'origin/master' into sync_msft_21032026

d7de752

Jaswanth51 requested a review from ankitm3k March 20, 2026 21:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync with Microsoft ONNX Runtime - 21032026#983

Sync with Microsoft ONNX Runtime - 21032026#983
Jaswanth51 wants to merge 7 commits intoovep-developfrom
sync_msft_21032026

Jaswanth51 commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Jaswanth51 commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants