Sync with Microsoft ONNX Runtime - 21032026#983
Open
Jaswanth51 wants to merge 7 commits intoovep-developfrom
Open
Sync with Microsoft ONNX Runtime - 21032026#983Jaswanth51 wants to merge 7 commits intoovep-developfrom
Jaswanth51 wants to merge 7 commits intoovep-developfrom
Conversation
…icrosoft#26773) **Description** This PR integrates Arm® KleidiAI™ SME2 BF16 kernel through MLAS SBGEMM path. Rework of microsoft#24346 **Motivation and Context** This kernel provides performance improvements on SME-enabled devices. --------- Signed-off-by: Patryk Kaiser <patryk.kaiser@arm.com>
Upgrading dependency to resolve CVE-2026-27904, which is lighting up some component governance issues with internal-MSFT builds of ORT. Co-authored-by: Kevin Taha <kevintaha@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…icrosoft#27694) ### Description fix condition of the following definitions: - DAWN_ENABLE_VULKAN - DAWN_ENABLE_D3D12
…#27688) This deletes 3 per-head-size .cu files and merges their content into a single file to avoid dependency during cuda compiling. Currently, masked_multihead_attention_kernel template is implemented in decoder_masked_multihead_attention_impl.cu. The other three .cu files use the masked_multihead_attention_kernel template but not include the implementation. That causes problem when they are built in cuda plugin ep.
microsoft#27671) ## Description This PR fixes longstanding MLAS issues that were causing `NhwcTransformerTests.*` and `QDQTransformerTests.*` failures in quantized convolution paths (see microsoft#27670). The failures were not in the graph transformers themselves; they came from incorrect qgemm dispatch selection and broken backend kernel behavior in specific AVX2-VNNI and AMX paths. The fix removes incorrect `U8U8` dispatch upgrades, avoids a broken AVX2-VNNI row-panel fallback, and corrects the AMX `U8S8` 32-row kernel path. It also adds MLAS regression coverage for the conv-shaped qgemm dimensions that exposed the problems. ## Summary of Changes ### Dispatch Selection Fixes | File | Change | |------|--------| | `onnxruntime/core/mlas/lib/platform.cpp` | Remove three incorrect assignments that upgraded `GemmU8U8Dispatch` to `U8S8` dispatch objects in the AVXVNNI, AVX512VNNI, and AMX feature paths. | ### AVX2-VNNI Kernel Fix | File | Change | |------|--------| | `onnxruntime/core/mlas/lib/qgemm_kernel_avx2.cpp` | Reduce `StrideM` from `6` to `4` for the `U8U8`, `S8S8`, and `S8U8` AVX2-VNNI qgemm dispatch objects so they never enter the legacy `>4` row fallback path. | ### AMX Kernel Fix | File | Change | |------|--------| | `onnxruntime/core/mlas/lib/qgemm_kernel_amx.cpp` | Replace the broken pipelined `CountM >= 32` `U8S8` AMX fast path with the same per-K tile update pattern already used by the working smaller-row path. | ### Regression Coverage | File | Change | |------|--------| | `onnxruntime/test/mlas/unittest/test_qgemm_fixture.h` | Add MLAS qgemm regression cases for conv-like shapes `6x30x207` and `169x30x207` in packed/non-packed and int32 or fp32 variants. | ## Root Cause There were three separate MLAS correctness issues: 1. `platform.cpp` was incorrectly overwriting `GemmU8U8Dispatch` with `U8S8` dispatch objects when newer CPU features were detected. That caused `U8U8` conv workloads to run through the wrong dispatch path. 2. The AVX2-VNNI qgemm dispatch objects advertised an `M` stride of `6`, but the assembly kernel only handled VNNI packing safely up to 4 rows. For 5- or 6-row panels it fell back to an older AVX2 path with incompatible packing and sign assumptions. 3. The AMX `U8S8` qgemm kernel had a bug in its `CountM >= 32` fast path. The smaller-row AMX path was correct, but the 32-row pipelined update logic produced wrong accumulators for conv-shaped workloads and caused the remaining QDQ/NHWC failures on AMX-capable hosts. ## Why This Fix - The `platform.cpp` cleanup restores the intended `U8U8` dispatch selection on feature-rich x86 hosts. - The AVX2-VNNI stride change is a targeted mitigation that avoids the known-bad legacy fallback until that assembly path is corrected. - The AMX kernel change keeps the AMX `U8S8` dispatch enabled, but replaces the broken 32-row implementation with a proven update pattern that matches the working smaller-row path. - The new MLAS regression tests cover the exact conv-derived qgemm shapes that exposed the bug, so future dispatch or kernel changes will fail at the MLAS layer before surfacing as transformer test regressions. ## Testing - `cd build/cuda/Release && ./onnxruntime_mlas_test --gtest_filter='QGemmU8S8_*169xN30xK207*:*QGemmU8S8_*6xN30xK207*'` - `cd build/cuda/Release && ./onnxruntime_test_all --gtest_filter='NhwcTransformerTests.*:QDQTransformerTests.*'` - Verified that the filtered transformer suite passes with AMX `U8S8` dispatch enabled. ## Motivation and Context These test failures had been present for a long time and were initially attributed to transformer rewrites because they surfaced in NHWC and QDQ test suites. Investigation showed that the optimized graphs were structurally correct and that the failures came from lower-level MLAS qgemm execution instead. Fixing the behavior in MLAS is the right layer because it restores correctness for both direct qgemm coverage and higher-level quantized conv paths. ## Checklist - [x] Tests added/updated - [x] No breaking changes - [x] CI passes
## Description
This PR fixes clang-specific build failures that show up in both the
standalone clang build and the CUDA clang build. It keeps the
build-system changes targeted, prefers source fixes where the warnings
indicate real type or declaration issues, and avoids broader warning
suppression than necessary for the CUDA provider target.
## Summary of Changes
### Build System
| File | Change |
|------|--------|
| `cmake/CMakeLists.txt` | Stop forwarding `-Wshorten-64-to-32` through
CUDA host compilation where the GNU host compiler does not recognize it.
|
| `cmake/onnxruntime_providers_cuda.cmake` | Add targeted clang
`-Wno-error` handling for warning classes that are currently triggered
by CUDA provider code and third-party CUDA headers under clang. |
### CPU / Common clang fixes
| File | Change |
|------|--------|
| `onnxruntime/core/common/cpuid_info.cc` | Replace the
clang-incompatible `__builtin_cpu_supports("waitpkg")` path with the
CPUID-bit check for TPAUSE detection. |
| `onnxruntime/test/framework/allocation_planner_test.cc` | Refactor
`typeid` assertions to avoid clang's potentially-evaluated-expression
warning while keeping test coverage unchanged. |
### CUDA provider and contrib fixes
| File | Change |
|------|--------|
| `onnxruntime/contrib_ops/cuda/utils/dump_cuda_tensor.h` | Mark the
`IConsoleDumper` overrides explicitly while leaving CUDA-only overloads
unchanged. |
| `onnxruntime/contrib_ops/cuda/bert/group_query_attention.cc` | Use
`template` on the dependent `GetAttrOrDefault` call so clang parses it
correctly. |
| `onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_api.cc` |
Make narrowing conversions to flash-attention parameter fields explicit.
|
| `onnxruntime/contrib_ops/cuda/quantization/matmul_nbits.cc` | Make the
`nbits_` conversion explicit when calling the CUDA helper. |
| `onnxruntime/contrib_ops/cuda/quantization/moe_quantization.cc` |
Restrict the GCC-only warning pragma so clang does not treat it as an
unknown warning option. |
|
`onnxruntime/contrib_ops/cuda/transformers/generation_device_helper.cc`
| Fix explicit state-field assignments to use the actual `int` field
type. |
| `onnxruntime/core/providers/cuda/cuda_mempool_arena.h` | Remove an
unused private field that clang flagged in the CUDA provider build. |
## Testing
Tested CPU and CUDA 12.8 builds in Azure Linux with
- clang 18.1.8
- gcc 13.2
- cmake 4.2.3
Example for CPU build:
```
export CC=clang
export CXX=clang++
bash build.sh --config RelWithDebInfo --parallel --cmake_extra_defines onnxruntime_BUILD_UNIT_TESTS=ON
```
## Motivation and Context
Clang is stricter than GCC/MSVC in a few areas that affect this tree:
CUDA host flag forwarding, explicit narrowing, dependent template
parsing, warnings emitted from third-party CUDA headers, and RTTI/typeid
expressions in tests. The goal here is to keep the staged fix minimal
and maintainable by correcting real source issues where practical and
confining warning downgrades to the CUDA provider target where
third-party header noise is currently unavoidable.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Daily backmerge from ORT main to ovep-develop. Do NOT squash or rebase - use merge commit only.