Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
2452 commits
Select commit Hold shift + click to select a range
fbba40a
Model Package Support (#27786)
chilo-ms Apr 8, 2026
c159603
Remove unnecessary model package test (#28015)
chilo-ms Apr 9, 2026
c6ff87f
fix target_ids out of boundary in TreeEnsemble* (#27951)
xadupre Apr 9, 2026
9e3614b
[webgpu] Set `is_channels_last` to true by default in `ComputeMatMul`…
Jiawei-Shao Apr 9, 2026
9d7e6d5
[CPU/CUDA ep] Improve DeformConv op performance (#27824)
ShirasawaSama Apr 9, 2026
127704c
Validate g_idx values in MatMulNBits to prevent OOB read (#27582)
vraspar Apr 9, 2026
9c90e79
Add bounds validation for LinearClassifier coefficients (#27989)
vraspar Apr 9, 2026
cd20524
Add pre-commit git hook to run lintrunner on staged files (#28013)
vraspar Apr 9, 2026
3ac6040
webgpu support for qwen3.5 (#27996)
guschmue Apr 9, 2026
efc89b5
mlas/arm64: add BF16 fast-math conv kernels for NCHW/NCHWc paths (#27…
milpuz01 Apr 9, 2026
3fad293
ICM fixes (3/n) (#27925)
hariharans29 Apr 9, 2026
7afe4c2
[Plugin EP] Port graph capture/replay APIs (#27958)
adrianlizarraga Apr 10, 2026
ce91376
Modify scale & offset of WhereDummyDq (#27109)
qti-hungjuiw Apr 10, 2026
58a87dc
Add CUDA Graph support for the CUDA plugin EP (#28002)
tianleiwu Apr 10, 2026
bbb0cd0
Fix plugin EP profiling timestamp skew on macOS (#27994)
tianleiwu Apr 10, 2026
5e55544
[VitisAI] fix dangling pointer (#27949)
amd-genmingz Apr 10, 2026
87b0643
centralise feed authentication for ADO pipelines (#27997)
eserscor Apr 10, 2026
c36c422
[WebGPU EP] Fuse QMoE 1-token decode path to reduce GPU dispatches (#…
qjia7 Apr 10, 2026
3471a5c
Bump lodash from 4.17.23 to 4.18.1 in /js/web (#27990)
dependabot[bot] Apr 10, 2026
ce92643
fix a security issue in SVM* (#27950)
xadupre Apr 10, 2026
a2742d0
[QNN-EP] Add additional guards for file mapping (#27871)
quic-calvnguy Apr 10, 2026
177001c
Bump fast-xml-parser from 4.5.3 to 4.5.6 in /js/react_native/e2e (#28…
dependabot[bot] Apr 10, 2026
2012ae0
Bump next from 16.1.5 to 16.2.3 in /js/web/test/e2e/exports/testcases…
dependabot[bot] Apr 10, 2026
fd4dea3
Bump yaml from 2.7.0 to 2.8.3 in /js/react_native/e2e (#27863)
dependabot[bot] Apr 10, 2026
0bae67d
[Optimizer] Fix ConstantFolding crash with missing optional outputs (…
TsofnatMaman Apr 12, 2026
daa74b4
ICM fixes (2/n) (#27922)
hariharans29 Apr 13, 2026
50129c9
[CI] fix: bad relative ref to setup-feeds template (#28050)
sanaa-hamel-microsoft Apr 13, 2026
ffbc5e8
Handle int overflow in rnn (#28003)
tianleiwu Apr 13, 2026
f7113bd
[CUDA EP Plugin] ResourceAcountant integration (#28028)
yuslepukhin Apr 14, 2026
63dfc33
fix: idempotent feed setup (#28034)
sanaa-hamel-microsoft Apr 14, 2026
6f5d9c0
Bump lodash from 4.17.23 to 4.18.1 in /js/react_native/e2e (#27966)
dependabot[bot] Apr 14, 2026
97e0a00
[CUDA] QuantizeLinear and DequantizeLinear opset 25 (#28046)
tianleiwu Apr 14, 2026
6c2c99c
Move WASM builds to large VMs (#28069)
eserscor Apr 15, 2026
58f99ea
[C] fix: ADO CI spuriously failing format check due to picking up NuG…
sanaa-hamel-microsoft Apr 15, 2026
dff81de
[CI] chore: remove QNN and python DML from release meta-pipeline (#28…
sanaa-hamel-microsoft Apr 15, 2026
57161cc
Bump picomatch from 4.0.2 to 4.0.4 in /js/web/test/e2e/exports/testca…
dependabot[bot] Apr 15, 2026
e39541c
Add tensor size validation for MatMulBnb4 to prevent OOB read via K/N…
vraspar Apr 15, 2026
22c18ab
Bump electron from 38.1.2 to 39.8.5 in /js/web (#27974)
dependabot[bot] Apr 15, 2026
f002c14
Bump follow-redirects from 1.15.6 to 1.16.0 in /onnxruntime/test/wasm…
dependabot[bot] Apr 15, 2026
3bca941
Fix FlashAttentionDecodeSplitVx indirect dispatch input ordering (#27…
jchen10 Apr 15, 2026
e19d953
Bump follow-redirects from 1.15.6 to 1.16.0 in /js/web (#28056)
dependabot[bot] Apr 15, 2026
9e2364d
Add more checks and add functional coverage for bifurcation_detector …
yuslepukhin Apr 15, 2026
f9c83ae
Fix the python pipeline (#28084)
eserscor Apr 16, 2026
0e73535
Fix heap OOB write in MaxPoolGrad via indices bounds validation (#27903)
vraspar Apr 16, 2026
eb7a1bf
[WebGPU EP] Reduce forward declaration boilerplate in kernel registra…
edgchen1 Apr 16, 2026
b7804b0
Update global-agent comments to use version-agnostic GitHub URL (#28061)
Copilot Apr 16, 2026
3b8482c
Added a dimension check in `MatMulComputeHelper` to ensure the inner…
yuslepukhin Apr 16, 2026
d4c0e3e
removes webgpu from onnxruntime-foundry-nuget package (#27929)
prathikr Apr 16, 2026
badb3dc
Bump picomatch from 2.3.1 to 2.3.2 in /js (#27846)
dependabot[bot] Apr 16, 2026
d23b8d4
Bump picomatch from 2.3.1 to 2.3.2 in /js/common (#27847)
dependabot[bot] Apr 16, 2026
1c99080
Bump picomatch from 2.3.1 to 2.3.2 in /js/react_native/e2e (#27849)
dependabot[bot] Apr 16, 2026
0ba449e
Bump picomatch from 2.3.1 to 2.3.2 in /js/web (#27850)
dependabot[bot] Apr 16, 2026
5192b90
Validate token_id bounds in NGramRepeatBlock to prevent OOB write (#2…
vraspar Apr 16, 2026
37298f7
Bump vite from 6.4.1 to 6.4.2 in /js/web/test/e2e/exports/testcases/v…
dependabot[bot] Apr 16, 2026
de0b292
Do not create numpy on top of Tensor non-owning buffer (#28088)
yuslepukhin Apr 16, 2026
d7aec68
Remove old custom EP load infrastructure (#28095)
skottmckay Apr 16, 2026
cafbd7c
[CI] fix: use internal feed for NPM on ADO (#28079)
sanaa-hamel-microsoft Apr 16, 2026
7fdc60e
Fix CoreML EP issue with external weight path handling. (#28062)
skottmckay Apr 16, 2026
1db7c25
Set SOVERSION 1 for Apple builds to match Linux library versioning (#…
Copilot Apr 17, 2026
7063e74
fix(ml): use SafeInt checked arithmetic in ML operator coefficient si…
tianleiwu Apr 17, 2026
94f32ec
[CORE]: Improve filesystem error messages during Linux device discove…
theHamsta Apr 17, 2026
4dd5d36
Fix int32 overflow in CUDA Gather kernel for large tensors (#28108)
justinchuby Apr 17, 2026
3f74b3c
Update worker thread pool to use time based wait. (#27916)
sushraja-msft Apr 17, 2026
a208df8
Use SafeInt for size arithmetic in CPU tensor operators to prevent ov…
tianleiwu Apr 17, 2026
c5bc801
Use weights_only for torch.load checkpoints (#28097)
tianleiwu Apr 17, 2026
e306308
Disable QDQ related transformers for WebNN EP (#28020)
miaobin Apr 17, 2026
9e1621d
Address build failure on ARM (#28120)
yuslepukhin Apr 17, 2026
a779077
WebGPU plugin EP pipeline updates (#28121)
edgchen1 Apr 17, 2026
d354e56
Use cuBLAS status APIs for error strings (#27946)
tianleiwu Apr 18, 2026
f018066
Add INT8, INT16, and UINT8 type support for CUDA TopK operator (#27862)
elwhyjay Apr 19, 2026
2070b28
Fix packaging pipeline TSA Upload warnings (#28127)
skottmckay Apr 20, 2026
7a33d4b
[NvTensorRTRTX EP] Implement GetHardwareDeviceIncompatibilityDetails …
umangb-09 Apr 20, 2026
8ca8dd3
ICM fixes (1/n) (#27906)
hariharans29 Apr 20, 2026
2adf329
Fix doxygen issues (#28122)
yuslepukhin Apr 20, 2026
fb13eb3
CoreML: Add support for Pad with 'reflect' for ML Program (#28073)
skottmckay Apr 20, 2026
7c56fa8
Add seqlens_k bounds validation in GroupQueryAttention to prevent GEM…
vraspar Apr 21, 2026
1905249
Fix RemoveDuplicateCastTransformer incorrectly eliminating lossy Cast…
Copilot Apr 21, 2026
dce77a3
Fix lack of auth on python packaging (#28118)
eserscor Apr 21, 2026
43d7cf4
[webgpu] support arbitrary input_channel size for im2col (#27038)
wenqinI Apr 21, 2026
b20f8fc
Implement GetAvaiableResource() callback (#28103)
yuslepukhin Apr 21, 2026
a3d6452
Enable API doc generation builds in PR checks (#28124)
edgchen1 Apr 21, 2026
564cabf
Apply model size limit before converting size_t input to int32_t (#28…
skottmckay Apr 22, 2026
7337814
Test Driver update (#28190)
eserscor Apr 22, 2026
3729b61
Add MatMul double implementation, inspect optimizers add test coverag…
yuslepukhin Apr 23, 2026
c5e6bd8
Fix string tensor deserialization in ORT format models (#28133)
Copilot Apr 23, 2026
786a34b
Refine specification of CMake C/C++ standard version requirement (#28…
edgchen1 Apr 23, 2026
c2afa72
Allow comments in `build.py` argument files (#27624)
edgchen1 Apr 23, 2026
4265122
Fix overflow in CopyCpuTensor for sub-byte types (#28171)
adrastogi Apr 23, 2026
645f489
Removes duplicate macos dylib and fixes packaging for the Microsoft.M…
baijumeswani Apr 23, 2026
9ec5c19
Address parameter validation across all EPs for Convo kernels (#28142)
yuslepukhin Apr 23, 2026
c7c99ce
Update TRT CI to replace Linux_TRT_Minimal_CUDA_Test_CI pipeline (#28…
tianleiwu Apr 23, 2026
a02f769
[MLAS] Enable FP16 for Gelu (#26815)
akote123 Apr 23, 2026
480dd76
Make sure in memory initialziers are not propagated externally (#28189)
yuslepukhin Apr 23, 2026
f38dec1
[Core] Add correctness tests for SpaceToDepth and MobileClip Attentio…
hariharans29 Apr 24, 2026
5743f71
Add exponential-backoff option for thread pool spin loop (#28096)
tianleiwu Apr 24, 2026
25d58f0
fix: add missing BackendKernelSelectorConfig parameter to QNBitGemmPa…
Rishi-Dave Apr 24, 2026
2ea847c
Add workflow to replace Windows GPU Doc Gen CI Pipeline (#28192)
tianleiwu Apr 24, 2026
d762130
Fix heap OOB write in EmbedLayerNormalizationShapeInference (#28176)
Copilot Apr 24, 2026
404fed7
[Plugin WebGPU EP] Fix API initialization error handling (#28211)
edgchen1 Apr 24, 2026
5dd7f15
[CoreML EP] Add HardSigmoid support (#28182)
maxwbuckley Apr 24, 2026
14b4d48
Fix missing include for SetRawDataInTensorProto in NV TensorRT RTX te…
ishwar-raut1 Apr 24, 2026
51e8e69
[CI] Remove win gpu doc gen and trt cuda mini CI pipelines (#28218)
tianleiwu Apr 24, 2026
2b25baf
Add packaging pipeline for CUDA plugin EP (#28152)
tianleiwu Apr 24, 2026
8b6b0b6
Add LabelEncoder CUDA execution provider for numeric types (#28045)
Copilot Apr 25, 2026
954dbce
webgpu: Refactor SubgroupMatrixMatMulNBits to vendor-agnostic config …
qjia7 Apr 25, 2026
997c479
GQA unfused attention with FP32 QK accumulation (fixes #28195) (#28198)
tianleiwu Apr 25, 2026
0992717
Add regression test for quantize_static with in-memory ModelProto (#2…
Rishi-Dave Apr 25, 2026
3c94f1c
[WebGPU] Fix MHA to ignore past key/value when no present outputs req…
vraspar Apr 27, 2026
f6eb50f
Fix heap OOB read in RNN operator via sequence_lens=0 (#28052)
vraspar Apr 27, 2026
9d2bb58
chore(ci): `NIGHTLY_BUILD` env-var is now set via pipeline parameter …
sanaa-hamel-microsoft Apr 27, 2026
1b1d453
ICM fixes (5/n) (#27971)
hariharans29 Apr 27, 2026
cf805d3
Address CI build failure (#28243)
yuslepukhin Apr 27, 2026
6e418b6
[VitisAI] pass base timestamp for vitisai profiling (#27808)
andrew0917 Apr 28, 2026
d00e297
security: replace unrestricted setattr with allowlist in Python backe…
titaiwangms Apr 28, 2026
45f5aba
[CUDA] PagedAttention: add SM<80 fp16 fallback via memory-efficient a…
elwhyjay Apr 28, 2026
2900ff7
chore(ci): temporarily remove react-native from NPM required publish …
sanaa-hamel-microsoft Apr 28, 2026
a53d6d7
[CoreML EP] Add QuickGelu support (#28184)
maxwbuckley Apr 28, 2026
7a795ed
Improve SparseTensors public API input validation as well as sparse u…
yuslepukhin Apr 28, 2026
3ae38b2
fix out of boundary vector per class in SVM (#27952)
xadupre Apr 28, 2026
8861ecd
Fix universal package version validation comment and add SHA prefix (…
edgchen1 Apr 28, 2026
1727b70
Add position_ids bounds validation to WebGPU/JS RotaryEmbedding kerne…
titaiwangms Apr 29, 2026
81802a9
Replace unsafe `reinterpret_cast` with C API calls in `include/onnxru…
edgchen1 Apr 29, 2026
f97b8c4
WebGPU: Support Split-K with batch size > 1 (#28151)
Jiawei-Shao Apr 29, 2026
037c02d
Add aarch64 wheel build to CUDA 13 Python packaging pipelines (#27760)
Copilot Apr 29, 2026
6f47410
webgpu: merge batchA into M dimension when batchB==1 (#28197)
xhcao Apr 29, 2026
8a77597
[WebNN] Rename roundingType to outputShapeRounding for pool2d ops (#2…
Honry Apr 29, 2026
ddea107
[OVEP] Updating OV version to 2026.1.0 (#28170)
preetha-intel Apr 29, 2026
df2b677
Fix cpuinfo init on Linux without CPU sysfs lists (#28230)
tianleiwu Apr 29, 2026
9a41944
Add update_inplace overload accepting OrtValue for device-to-device c…
Copilot Apr 29, 2026
abb284d
[WebGPU] Add GridSample operator (#28264)
TomCrypto Apr 29, 2026
11e3072
[Cuda] Upgrade cutlass to 4.4.2 (#28276)
tianleiwu Apr 29, 2026
99e811d
[React Native] Add react-native.config.js and Expo plugin MainApplica…
dccarmo Apr 30, 2026
464d8e9
ICM fixes (6/n) (#28255)
hariharans29 Apr 30, 2026
62f742f
Add RISC-V Vector (RVV) support for CPU Execution Provider (#28261)
velonica0 Apr 30, 2026
e295e63
Fix NPM packaging (#28238)
eserscor Apr 30, 2026
4aaf852
[NvTensorRTRTX EP] Fix build breaks with TRT-RTX 1.5+ and 1.6+ (#28263)
umangb-09 Apr 30, 2026
6f23504
WebGPU plugin EP Python packaging (#28226)
edgchen1 Apr 30, 2026
9c2b0c3
Use CUDART_VERSION reduction compatibility in GQA attention (#28296)
Copilot Apr 30, 2026
7a86574
Propagate python authentication to docker contexts (#28295)
eserscor Apr 30, 2026
5ca4e83
Language bindings for Hardware Device and EP Compatibility APIs (#28128)
xhan65 May 1, 2026
9ca299d
Implement CUDA EP Plugin profiling API (#28216)
yuslepukhin May 1, 2026
5f7e9d0
Fix session logger use-after-free during EP teardown under VERBOSE lo…
Copilot May 1, 2026
76655a0
Fix NVTX profiling: restore missing node_compute_range_.Begin() call …
Copilot May 1, 2026
d6c363c
[OVEP] OpenVINO EP 1.26.0 Development Release Updates (#28297)
ankitm3k May 1, 2026
6e19374
Fix CUDA 13 build error in gqa_unfused_attention.cu (#28309)
tianleiwu May 1, 2026
8dd4a06
Include license file in built distributions (#27783)
julia-thorn May 1, 2026
de2bc90
Add QNN Plugin EP repo link to README (#28225)
vraspar May 1, 2026
5f2f848
fix(ci): incorrect relative template includes for setup-feeds (#28312)
sanaa-hamel-microsoft May 1, 2026
9b30f30
remove weights_are_all_positive_ from TreeEnsemble (#27552)
xadupre May 1, 2026
d02a0fd
Fix DoubleQDQPairsRemover adding spurious dimension to scalar scale/z…
ssam18 May 1, 2026
60ce9cc
Relax GQA seqlens_k shape validation for backward compat with older m…
vraspar May 1, 2026
55c5c82
GridSample: harden float->int64 casts against NaN/Inf/out-of-range co…
GopalakrishnanN May 2, 2026
1e4ee66
add support for DFT with onesided=True and inverse=True (irfft) (#27028)
simonbyrne May 2, 2026
9d1492a
Add option to memory map .ORT model loads (#28164)
Kevin-Taha May 2, 2026
9582867
docs: add opset version notation legend to OperatorKernels.md (#28143)
Rishi-Dave May 3, 2026
763c4e3
Support ONNX overloaded functions (IR version 10+) (#28275)
gramalingam May 3, 2026
b81e0c6
Bump version for 1.27.0 (#28324)
eserscor May 4, 2026
96786ef
Bump plugin-ep-webgpu/VERSION_NUMBER to 0.2.0. (#28322)
edgchen1 May 4, 2026
f1c96d5
refactor(ci): simplify build date/time metadata propagation (#28294)
sanaa-hamel-microsoft May 4, 2026
40c9f85
Add plugin-ep-webgpu/RELEASE.md (#28321)
edgchen1 May 4, 2026
4ca6b22
Eliminate Legacy MHA Unfused path from ONNX Attention; unify on 3-tie…
titaiwangms May 4, 2026
3454f86
Fix BitShift UB when shift amount >= bit width (#28272)
Copilot May 4, 2026
a1aa3bb
adds foundry local packaging to webgpu plugin ep packaging pipeline (…
prathikr May 4, 2026
8a09501
Suppress -Wmaybe-uninitialized for onnxruntime_pybind11_state under p…
Rishi-Dave May 5, 2026
7529033
Fix ReshapeFusion dropping allowzero on inferred 0-sized intermediate…
titaiwangms May 5, 2026
b81f3f8
fix: make sympy an optional runtime dependency (#28141)
Rishi-Dave May 5, 2026
ebee606
fix(ci): test pipeline didn't correctly specify `ReleaseVersionSuffix…
sanaa-hamel-microsoft May 5, 2026
ef44604
chore: rename `ort_api_1_to_26` to `ort_api_1_to_27` (#28341)
sanaa-hamel-microsoft May 5, 2026
1f25783
Fix CUDA Attention dispatch: skip MEA when head_size != v_head_size i…
justinchuby May 5, 2026
5e38dfe
Fix CApi tests on S390x (#28074)
AlekseiNikiforovIBM May 5, 2026
07b8f39
Bump postcss from 8.5.3 to 8.5.13 in /js/web/test/e2e/exports/testcas…
dependabot[bot] May 5, 2026
c85ec49
fix(ci): 'rc' qualifier ignored when packaging `onnxruntime-node` (#2…
sanaa-hamel-microsoft May 5, 2026
80a2352
Fix round_prefer_ceil nearest mode for negative halfway values in Res…
Copilot May 5, 2026
a237323
Add CUDA plugin EP Python package pipeline (#28299)
tianleiwu May 5, 2026
513b9bf
[WebGPU plugin EP] NuGet packaging (#28313)
edgchen1 May 5, 2026
ee5158e
Fill CUDA Cast operator opset gap: extend registration from opset 23 …
Copilot May 5, 2026
28bcc9c
Fill CUDA opset gap for ReduceMax and ReduceMin (18 → 20) (#27755)
Copilot May 5, 2026
b8f21f1
Fill RNN CUDA operator opset gap (14 → 22) (#27743)
Copilot May 5, 2026
8aec1a5
Fill Reshape CUDA operator opset gap from 23 to 25 (#27742)
Copilot May 6, 2026
3e21761
Bump brace-expansion in /js/react_native/e2e (#27894)
dependabot[bot] May 6, 2026
673c332
Fill CUDA EP opset gaps for Round and Equal operators (#27754)
Copilot May 6, 2026
3b007a6
webgpu: Support QKV bias in FlashAttention for MultiHeadAttention (#2…
qjia7 May 6, 2026
dbc55db
Prevent double-free in OrtModelEditorApi ownership transfer (#28123)
vraspar May 6, 2026
5f071fb
Add M-tile loop with dispatch capping for Intel Xe2/3-LPG (#28250)
jchen10 May 6, 2026
470977a
[CoreML EP] Support pre-opset-13 Split via 'split' attribute (#28270)
maxwbuckley May 6, 2026
19738c5
[Plugin EP] Add OrtEp::OnSessionInitializationEnd() (#28319)
adrianlizarraga May 6, 2026
e3c34da
Refactor and modernize StringNormalizer. (#28320)
yuslepukhin May 6, 2026
bf76a0b
feat(quantization): add calibration cache to quantize_static (#28221)
Rishi-Dave May 7, 2026
0a341b0
[WebGPU plugin EP packaging] Remove explicit ORT package dependency (…
edgchen1 May 7, 2026
ec55d3c
Fix Subgraph_t issues with TRT RTX ver 1.5.x (#28361)
umangb-09 May 7, 2026
505e0c3
Suppress test warnings in transformers tests and fix CUDA CI (#28391)
tianleiwu May 7, 2026
058ce95
Fix CPU Attention softcap/attn_mask ordering (onnx#7867, #7913) + con…
titaiwangms May 7, 2026
245118a
[CUDA Plugin EP] Add NuGet packaging pipeline (#28378)
tianleiwu May 7, 2026
b8ba448
[C#] Add EP tests for Cuda Plugin (#28375)
yuslepukhin May 7, 2026
46273ea
[Plugin EP] Session options getters (#28377)
adrianlizarraga May 8, 2026
db03660
Remove Universal Package publishing from plugin EP packaging pipeline…
edgchen1 May 8, 2026
fe8bf3a
Fix dangling pointer of temporary return value (#28419)
rajatmonga May 8, 2026
3f49243
Fix CUDA plugin EP packaging test pipeline: host network and add cudn…
tianleiwu May 8, 2026
a2cd643
Fix loader-refcount leak in ProviderLibrary::Load on GetProvider miss…
BoarQing May 8, 2026
aa92574
[CoreML EP] Add FusedConv support (#28289)
maxwbuckley May 9, 2026
3cc4cef
Add FLOAT8E8M0 data type support in ONNX Runtime (#28381)
Copilot May 9, 2026
82f3a83
Add PEP 561 `py.typed` marker to the `onnxruntime` package (#28438)
tairenpiao May 9, 2026
4fec745
Fix CrossEntropyLoss block to support multi-output models (#28232)
Rishi-Dave May 10, 2026
8024503
Fix session use-after-free when UserLoggingFunction is used (#28314)
tianleiwu May 10, 2026
335a217
feat(quantization): add ActivationRestrictedAsymmetric option (#28237)
Rishi-Dave May 11, 2026
672c0f3
[WebGPU] Fix numerical stability issue in QMoE op (#28434)
xenova May 11, 2026
ecc41e5
Don't pin SelectorActionTransformer replacement nodes to CPU (#28288)
maxwbuckley May 11, 2026
19f5d41
Remove data hash from hashing code, to prevent unbounded growth issue…
JonathanC-ARM May 11, 2026
df11457
Harden OneHot operator input validation and output size computation (…
GopalakrishnanN May 11, 2026
0cd3de8
[WebGPU Plugin EP] Packaging pipelines minor updates (#28460)
edgchen1 May 11, 2026
1b64fba
Handle empty initializers gracefully in optimizer passes (#27976)
GopalakrishnanN May 11, 2026
c6d9a02
Fix Clang + libc++ compilation and portability issues (#28049)
mustjab May 11, 2026
67d9400
Address tokenizer shortcomings (#28428)
yuslepukhin May 12, 2026
c1a61b4
Fix CUDA plugin EP test pipeline (#28453)
tianleiwu May 12, 2026
51c546b
Validate NodeArg creation (#28429)
yuslepukhin May 12, 2026
83e402e
[GQA] Make present_key/present_value outputs optional and add Gemma4 …
apsonawane May 12, 2026
d82a06d
[CUDA] PagedAttention: use exact max_query_len on FA path (#28409)
elwhyjay May 12, 2026
8cfb63e
Add CPU Cast op support for Float8E8M0 (#28435)
tianleiwu May 12, 2026
ece097d
Add DiT attention fusion for F5-TTS and diffusion transformer models …
Rishi-Dave May 12, 2026
8d24011
Fix runtime-unresolvable type annotations in Session and InferenceSes…
Rishi-Dave May 12, 2026
a49da87
Enable QuickGeluFusion on WebGPU EP and fix fp16 shader (#28410)
xiaofeihan1 May 12, 2026
756a523
Fix sigmoid transformation in TreeEnsembleClassifier for all-positive…
Copilot May 12, 2026
a12b611
[ Cuda] ConvTranspose-22 (#27710)
tianleiwu May 12, 2026
dac5a14
Add CPU QMoE 2-bit support and LUT GEMM fast path (#28185)
tianleiwu May 12, 2026
fafe564
[WebGPU] Fix SkipSimplifiedLayerNormalization bias (#28427)
xenova May 12, 2026
183e7a9
fix: skip DQ->MatMulNBits fusion when weight/scale initializer is sha…
Rishi-Dave May 12, 2026
72bce1d
[WebGPU] Correct MatMul bias input indexing in `MatMul::ComputeIntern…
Copilot May 12, 2026
239ef9e
Use symbol file to export symbols in onnxruntime library on AIX platf…
ayappanec May 12, 2026
480aad0
Fill CUDA EP opset gap for Sin and Cos operators (7→22) (#27756)
Copilot May 12, 2026
a8260cd
[webgpu] Optimize LinearAttention Op with subgroup (#28412)
daijh May 12, 2026
469bc9f
Address Resize kernel shortcomings (#28402)
yuslepukhin May 12, 2026
0e72188
Add RVV (RISC-V Vector Extension) optimized convolution and pooling k…
velonica0 May 12, 2026
73d5416
Add LabelEncoder_4 fast path for session creation (#27526)
SamuelLess May 13, 2026
4e377a1
Use weights_only=True for remaining torch.load() calls (#28421)
adrianlizarraga May 13, 2026
ad11f7c
Fix OOB reads in SoftmaxCrossEntropyLoss via label bounds validation …
vraspar May 13, 2026
a363e89
Bound total output allocation size in Tile kernel (#28070)
GopalakrishnanN May 13, 2026
e8ae6ce
Add float zero point support for 2-bit LUT GEMM in MatMulNBits (#28354)
vraspar May 13, 2026
9b3fb24
Fixes for unicode path handling (#28390)
jnagi-intel May 13, 2026
0139499
Fill opset gap for RandomNormal, RandomNormalLike, RandomUniform, Ran…
Copilot May 13, 2026
5f72956
Address latent bug in LabelEncoder (#28496)
yuslepukhin May 14, 2026
938b607
Optimize FlashAttention for M4 Max (20x speedup) (#27780)
xenova May 14, 2026
4bc2094
Address string attribute issue (#28494)
yuslepukhin May 14, 2026
fb2b467
Merge main with latest official onnxruntime from 2026 May 14
pedrovgs May 14, 2026
0a2d3bd
Update CircleCI macOS resource class to M4 Pro
pedrovgs May 14, 2026
3ce0d38
Disable shellcheck in lint workflow for our fork
pedrovgs May 14, 2026
ab6ea29
Revert "Disable shellcheck in lint workflow for our fork"
pedrovgs May 14, 2026
e6ccf73
Update CircleCI Xcode image to 16.3.0 for M4 Pro compatibility
pedrovgs May 14, 2026
fb4af06
Fix CircleCI resource class name to m4pro.medium
pedrovgs May 14, 2026
ae97eb8
Add CMAKE_POLICY_VERSION_MINIMUM=3.5 for CMake 4.x compat
pedrovgs May 14, 2026
c2bb76e
Remove CircleCI config, GH Actions handles xcframework build
pedrovgs May 14, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
The diff you're trying to view is too large. We only load the first 3000 changed files.
267 changes: 267 additions & 0 deletions .agents/skills/cuda-attention-kernel-patterns/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,267 @@
---
name: cuda-attention-kernel-patterns
description: Patterns and pitfalls for the ONNX domain Attention operator (opset 23/24) CUDA implementation. Use when modifying the dispatch cascade in core/providers/cuda/llm/attention.cc, writing mask/bias CUDA kernels, debugging attention test routing, or adding features to the ONNX Attention op. NOT for contrib domain MultiHeadAttention/GroupQueryAttention.
---

# ONNX Domain Attention (Opset 23/24) CUDA Patterns

Reusable knowledge from ONNX Attention CUDA development in ORT.

> **Scope**: This skill covers the **ONNX domain** `Attention` operator (opset 23/24)
> implemented at `core/providers/cuda/llm/attention.cc`. This is **separate from** the
> contrib domain `MultiHeadAttention` / `GroupQueryAttention` at `contrib_ops/cuda/bert/`.
> They share some underlying kernels (CUTLASS FMHA, Flash Attention) and infrastructure
> (`attention_softmax.h`) but have **different dispatch logic, parameter structs, and eligibility checks**.
>
> - **Shared infrastructure**: CUTLASS FMHA kernel, Flash kernel, unified unfused kernel
> (`unfused_attention.cu`), `attention_softmax.h`, `attention_impl.cu` (contrib only)
> - **ONNX-specific**: Dispatch cascade in `attention.cc`, `ConvertAttnMaskToBias`,
> `mask_filter_value` cap, parameter bridge to contrib structs, `attention_mask_impl.cu`
> - **Contrib-specific**: Own dispatch in contrib MHA/GQA ops, uses `contrib::AttentionParameters`
> directly, has XQA kernel, past-present buffer sharing

## 1. Runner Dispatch Cascade

CUDA attention dispatches in priority order: **Flash → MEA (Memory Efficient) → Unified Unfused Attention**.

```
// onnxruntime/core/providers/cuda/llm/attention.cc — ComputeInternal()
Flash eligible? → RunFlashAttention()
↓ no
MEA eligible? → RunMemoryEfficientAttention()
↓ no
Unified Unfused → RunUnfusedAttention()
(handles both MHA and GQA via reshape-Q trick)
```

**Flash eligibility**: fp16/bf16 only, SM≥8.0 (Ampere+), `head_size == v_head_size`, `head_size <= 256`, no `output_qk`, `attn_mask == nullptr`. Uses `mha_fwd` / `mha_fwd_kvcache`.

**MEA eligibility**: SM50+/53+/80+ by dtype, `head_size <= 1024` and divisible by 8 (enforced by `has_memory_efficient_attention`), no `output_qk`. GQA additionally requires `head_size == v_head_size` (for `LaunchUngroup`); decode also requires it (for `LaunchConcatNewToPastKV`). Bias stride must satisfy `total_sequence_length % 4 == 0`. GQA with FP32 is excluded (LaunchUngroup only has fp16/bf16 instantiations). Supports `softcap + attn_mask` — CUTLASS applies softcap before bias in kernel tiles, matching ONNX spec ordering (onnx/onnx#7867, supersedes the now-closed onnx/onnx#7865 issue).

**Unified Unfused Attention**: Always available as the final fallback. Handles both MHA (`num_heads == kv_num_heads`, group=1) and GQA (`num_heads != kv_num_heads`, group>1) via a reshape-Q trick with stride-based cuBLAS batched GEMM (no K/V head replication). Uses FP32 QK scratch for precision. Supports all features:
- softcap + attn_mask (spec-correct ordering)
- output_qk (kQK mode: copies raw QK before softcap/mask mutations)
- past_key + past_value with `head_size != v_head_size` (separate K/V concat)
- causal masking, nonpad_kv_seqlen, all dtypes (fp16/bf16/fp32)

## 2. CUTLASS kLog2e Overflow

CUTLASS `iterative_softmax` multiplies all attention scores by `kLog2e ≈ 1.4427` internally (for `exp2f` instead of `expf`). For float/bf16:

```
mask_filter_value = std::numeric_limits<float>::lowest() ≈ -3.40e+38
-3.40e+38 × 1.4427 ≈ -4.91e+38 → overflows fp32 → -inf
```

When all values become `-inf`, CUTLASS's special-case path produces `s_prime=0` → `1/s_prime=inf` → `0 × inf = NaN`.

**Fix**: Cap `mask_filter_value` to `-1.0e+30f` in `ConvertAttnMaskToBias`. This value is safe: `1e30 × 1.4427 ≈ 1.4e30 << FLT_MAX`, and `exp(-1e30) ≈ 0` (effectively masked).

**fp16 is NOT affected**: `lowest() = -65504`, and `-65504 × 1.4427 ≈ -94500` stays within fp32 range.

This cap is ONLY applied in MEA paths. The unfused path uses `lowest()` directly (its softmax subtracts max first, avoiding overflow).

**Subtlety**: When bias is present (`kSupportsBias=true`), CUTLASS pre-applies `p.scale` to QK (line 858) and uses `scaling=1.0f` in the softmax loop (line 981). So the full `kLog2e` multiplier hits the bias-dominated values — the overflow is head_size-independent. Without bias, `scaling = p.scale * kLog2e = kLog2e/sqrt(head_size)`, which is much smaller.

## 3. Bias Alignment

CUTLASS FMHA requires the attention bias row stride to satisfy minimum alignment. The bias has shape `[B, H, S, T]` where `T = total_sequence_length` is the row stride.

```cpp
constexpr int min_bias_align = 4; // elements, not bytes
if (parameters.total_sequence_length % min_bias_align != 0) {
mea_eligible = false; // fall through to unfused
}
```

**Impact on tests**: If a test uses `total_sequence_length` not divisible by 4 (e.g., past=5 + new=6 = 11), MEA is rejected and unfused handles it. To test MEA with bias, ensure `total_sequence_length % 4 == 0`.

## 4. Softcap Ordering

ONNX Attention opset 23/24 spec ordering (per onnx/onnx#7867, which superseded
the now-closed onnx/onnx#7865 issue, and onnx/onnx#7913 which swapped
`qk_matmul_output_mode` values 1 and 2 to align with the corrected pipeline):

```
scale * (Q @ K^T) # stage 0: raw scaled QK
|
softcap (if > 0) # stage 1: tanh(qk / softcap) * softcap
|
+ attn_bias / + attn_mask # stage 2: additive (mask -inf survives to stage 3)
|
softmax # stage 3
|
@ V
```

`qk_matmul_output_mode` integer values follow pipeline stage order:
0 = raw scale*QK, 1 = post-softcap (pre-mask), 2 = post-mask/bias (pre-softmax),
3 = post-softmax.

CUDA implementation status (all spec-correct):
- **MEA (CUTLASS)**: `kernel_forward.h` applies softcap inside the score-compute
tile loop BEFORE `attn_bias` is added.
- **Flash**: `mha_fwd` / `mha_fwd_kvcache` handle softcap natively; reject
explicit `attn_mask`, so ordering with float mask is moot for this path.
- **Unfused**: `UnfusedSoftmaxKernel` does `QK -> scale -> softcap -> add bias -> softmax`
(all fused).

CPU implementation status: `core/providers/cpu/llm/attention.cc::ComputeAttentionProbs<T>`
applies softcap BEFORE the mask add (post-fix; pre-fix it inverted the order
and leaked probability through masked positions).

Why this ordering matters: a -inf in `attn_mask` must survive to softmax. If
softcap were applied AFTER the mask-add, then `tanh(-inf/softcap) * softcap = -softcap`
(a finite value), and softmax would assign non-zero weight to the masked
position — leaking poison V values into the output. The CUDA-side guard tests
at `test_onnx_attention/test_gqa.py:1501` and `:1761`, and the CPU-side guards
at `TestONNXAttentionCPUSoftcapMaskOrdering` in the same file, exercise this
property by combining small softcap, a -inf mask entry, and a poison V value.

## 5. Grid-Stride Loops for CUDA Kernels

Always cap grid size to prevent exceeding `gridDim.x` limits, and use grid-stride loops for large workloads:

```cpp
constexpr int64_t kMaxGridDimX = 65535;
int threads = static_cast<int>(std::min(static_cast<int64_t>(max_threads_per_block), total));
int64_t blocks = (total + threads - 1) / threads;
unsigned int grid_size = static_cast<unsigned int>(std::min(blocks, kMaxGridDimX));

MyKernel<<<grid_size, threads, 0, stream>>>(...);

// Inside the kernel:
for (int64_t idx = blockIdx.x * blockDim.x + threadIdx.x;
idx < total;
idx += static_cast<int64_t>(gridDim.x) * blockDim.x) {
// work
}
```

**Never** cast `int64_t` block count directly to `unsigned int` without capping — it silently truncates.

Always call `CUDA_CALL(cudaGetLastError())` after kernel launches in standalone helper functions. This is the established pattern in the file (see `ConcatPastToPresent`, `PastPresentBufferShare`).

## 6. Fully-Masked Batches

All-false bool masks or `seqlens_k=0` produce NaN in CUTLASS MEA.

**Additive-bias path** (bool mask converted to bias): Fixed by capping `mask_filter_value` to `-1e+30f` (see section 2). CUTLASS then naturally computes uniform softmax → mean(V).

**Nonpad path** (`seqlens_k=0`): CUTLASS skips all K/V positions → `s_prime=0` → NaN. Fixed by `ZeroOutputForFullyMaskedBatches` kernel which zeros output for batches where `seqlens_k[b] == 0`. Note: this produces zeros, not mean(V) — a cross-EP consistency TODO exists.

**CPU/Unfused behavior**: `mask_filter_value = lowest()` (not `-inf`). All masked values are equal → `softmax(equal) = 1/N` → output = mean(V). This is the spec reference.

## 7. Test Runner Targeting

Use `ScopedEnvironmentVariables` to force specific CUDA runners:

```cpp
// Force MEA (disable Flash)
ScopedEnvironmentVariables scoped_env({
{"ORT_DISABLE_FLASH_ATTENTION", "1"},
});

// Force Unfused (disable both Flash and MEA)
ScopedEnvironmentVariables scoped_env({
{"ORT_DISABLE_FLASH_ATTENTION", "1"},
{"ORT_DISABLE_MEMORY_EFFICIENT_ATTENTION", "1"},
});
```

**Always verify which runner a test actually hits.** A test designed for MEA may silently fall to unfused if:
- `total_sequence_length % 4 != 0` (bias alignment)
- `head_size != v_head_size` (decode path)
- fp32 dtype with GQA (LaunchUngroup fp16/bf16 only)
- fp32 dtype on SM < 80

Enable verbose logging to confirm: `LOGS_DEFAULT(VERBOSE) << "ONNX Attention: using ..."`.

## 8. Cross-EP Consistency

CPU is the spec reference implementation. CUDA outputs should match CPU for all valid inputs.

- CPU uses `mask_filter_value = std::numeric_limits<T>::lowest()` (finite, not `-inf`)
- CPU softmax: subtract-max-first → works correctly with extreme finite values
- CPU handles fully-masked batches naturally (uniform softmax → mean(V))

Run tests with `disable_cpu=false` to always validate against CPU. The C++ test framework (`RunTest4D`) supports `disable_cpu`, `disable_cuda`, `disable_dml` flags.

## 9. File Locations

### ONNX Domain (this op's code)

| File | Purpose |
|------|---------|
| `core/providers/cuda/llm/attention.cc` | ONNX Attention CUDA dispatch: Flash/MEA/Unfused cascade, `ConvertAttnMaskToBias`, parameter setup |
| `core/providers/cuda/llm/attention_mask_impl.cu` | ONNX-specific mask/bias CUDA kernels: bool→bias, nonpad→seqlens_k, ZeroOutput, bias composition |
| `core/providers/cuda/llm/attention_mask_impl.h` | Declarations for ONNX mask/bias kernels |
| `core/providers/cpu/llm/attention.cc` | CPU reference implementation (ONNX domain) |
| `core/providers/cpu/llm/attention_helper.h` | ONNX parameter validation and shape computation |
| `test/providers/cpu/llm/attention_op_test.cc` | C++ attention tests (all EPs) |
| `test/python/transformers/test_onnx_attention/test_mha.py` | Python parity tests |
| `test/python/transformers/test_onnx_attention/common.py` | Python test utilities and reference `attention_ref()` |

### Shared Infrastructure (used by both ONNX and contrib ops)

| File | Purpose |
|------|---------|
| `contrib_ops/cuda/bert/unfused_attention.cu` | Unified unfused attention: QK GEMM (FP32), fused softmax kernel (scale+softcap+bias+causal), V GEMM. Handles MHA and GQA. |
| `contrib_ops/cuda/bert/unfused_attention.h` | `UnfusedAttentionParams`, `LaunchUnfusedAttention`, workspace size |
| `contrib_ops/cuda/bert/attention_impl.cu` | Legacy unfused `QkvToContext` (contrib MHA only). Also `ApplySoftcap`, `ConcatPastToPresent` |
| `contrib_ops/cuda/bert/attention_softmax.h` | CUDA softmax kernels (`ComputeSoftmax`, `ComputeSoftmaxWithRawMask`) — used by legacy contrib path |
| `contrib_ops/cuda/bert/cutlass_fmha/` | CUTLASS FMHA (Memory Efficient Attention) kernels |
| `contrib_ops/cuda/bert/flash_attention/` | Flash Attention kernels |

### Contrib Domain (separate ops, NOT covered by this skill)

| File | Purpose |
|------|---------|
| `contrib_ops/cuda/bert/multihead_attention.cu` | Contrib `MultiHeadAttention` — own dispatch, uses `contrib::AttentionParameters` directly |
| `contrib_ops/cuda/bert/group_query_attention.cu` | Contrib `GroupQueryAttention` — has XQA kernel, past-present buffer sharing |

## 10. Parameter Bridge (ONNX → Contrib)

The ONNX Attention op uses `attention_helper::AttentionParameters` (in `core/providers/cpu/llm/attention_parameters.h`). The unified unfused kernel (`LaunchUnfusedAttention`) uses its own `UnfusedAttentionParams` struct populated directly from ONNX parameters in `RunUnfusedAttention`.

The contrib `QkvToContext` function (used by contrib MHA, NOT by ONNX Attention) uses `contrib::AttentionParameters`. ONNX Attention does **not** bridge to `contrib::AttentionParameters` — it routes through the unified unfused kernel instead.

## 11. Causal Alignment

The ONNX spec defines two causal alignment modes based on where query positions sit in the full attention matrix:

- **Upper-left**: `q_i` attends to `kv[0..i]`. Query positions start at 0 in the full matrix.
- **Lower-right**: `q_i` attends to `kv[kv_len - q_len + i..kv_len - 1]`. Query positions are at the end.

**ONNX spec rule**: `is_causal=1` always means upper-left in the full matrix. When `past_key` provides context, `past_sequence_length` shifts the query start position forward — the resulting `[S_q × total_kv]` sub-matrix effectively has lower-right alignment.

### Per-kernel behavior

| Kernel | Alignment | Mechanism |
|--------|-----------|-----------|
| **Flash** | Lower-right only | `is_causal` flag → `seqlen_k - seqlen_q` offset in kernel. No top-left option. |
| **MEA (CUTLASS)** | Both | `causal_from_top_left` flag in `MemoryEfficientAttentionParams`. `true` → `CausalFromTopLeft` (offset=0). `false` → `CausalFromBottomRight` (offset = num_keys - num_queries). |
| **Unfused** | Both | `past_kv_length` param. `0` → upper-left. `total_kv - S_q` → lower-right. |

### Dispatch logic in attention.cc

```cpp
// Flash cannot do upper-left → guarded by causal_cross_no_past
bool causal_cross_no_past = parameters.is_causal &&
parameters.q_sequence_length != parameters.total_sequence_length &&
parameters.past_sequence_length == 0;

// Flash: skip when causal_cross_no_past (no top-left support)
// MEA: NOT skipped — handles it via causal_from_top_left = (past_sequence_length == 0)
// Unfused: always correct via past_kv_length = parameters.past_sequence_length
```

### When S_q == S_kv

Upper-left and lower-right produce **identical** results when `S_q == S_kv` (the offset is 0 either way). The alignment distinction only matters for cross-attention shapes (`S_q != S_kv`).

### TensorScatter decode (opset 24 external KV cache)

TensorScatter manages KV cache externally — `past_key` is nullptr but K/V already contain the full sequence. Per the ONNX spec, `is_causal` with `S_q != S_kv` and no `past_key` means upper-left (q[0] sees only kv[0]), which is **not meaningful for decode**.

**Correct pattern**: TensorScatter decode must use `is_causal=0` and rely on `nonpad_kv_seqlen` to bound the active KV range. Models using `is_causal=1` with TensorScatter decode have a spec-invalid combination.
85 changes: 85 additions & 0 deletions .agents/skills/ort-build/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
---
name: ort-build
description: Build ONNX Runtime from source. Use this skill when asked to build, compile, or generate CMake files for ONNX Runtime.
---

# Building ONNX Runtime

The build scripts `build.sh` (Linux/macOS) and `build.bat` (Windows) delegate to `tools/ci_build/build.py`.

## Build phases

Three phases, controlled by flags:

- `--update` — generate CMake build files
- `--build` — compile (add `--parallel` to speed this up)
- `--test` — run tests

For native builds, if none are specified (and `--skip_tests` is not passed), **all three run by default**. For cross-compiled builds, the default is `--update` + `--build` only.

### When to use `--update`

You need `--update` when:
- First build in a new build directory
- New source files are added (some CMake targets use glob patterns, others use explicit file lists — re-run to pick up new files either way)
- CMake configuration changes (new flags, updated CMakeLists.txt)

You do **not** need `--update` when only modifying existing `.cc`/`.h` files — just use `--build`. Skipping it saves time.

## Examples

```bash
# Full build (update + build + test)
./build.sh --config Release --parallel
.\build.bat --config Release --parallel # Windows

# Just regenerate CMake files
./build.sh --config Release --update

# Just compile (skip CMake regeneration and tests)
./build.sh --config Release --build --parallel

# Just run tests (after a prior build)
./build.sh --config Release --test

# Build with CUDA execution provider
./build.sh --config Release --parallel --use_cuda --cuda_home /usr/local/cuda --cudnn_home /usr/local/cuda

# Build Python wheel
./build.sh --config Release --parallel --build_wheel

# Build a specific CMake target (much faster than a full build)
./build.sh --config Release --build --parallel --target onnxruntime_common

# Load flags from an option file (one flag per line)
./build.sh "@./custom_options.opt" --build --parallel
```

## Key flags

| Flag | Description |
|------|-------------|
| `--config` | `Debug`, `MinSizeRel`, `Release`, or `RelWithDebInfo` |
| `--parallel` | Enable parallel compilation (recommended) |
| `--skip_tests` | Skip running tests after build |
| `--build_wheel` | Build the Python wheel package |
| `--use_cuda` | Enable CUDA EP. Requires `--cuda_home`/`--cudnn_home` or `CUDA_HOME`/`CUDNN_HOME` env vars. On Windows, only `cuda_home`/`CUDA_HOME` is validated. |
| `--target T` | Build a specific CMake target (requires `--build`; e.g., `onnxruntime_common`, `onnxruntime_test_all`) |
| `--build_dir` | Build output directory |

## Build output path

Default: `build/<Platform>/<Config>/` where Platform is `Linux`, `MacOS`, or `Windows`.

With Visual Studio multi-config generators, the config name appears twice (e.g., `build/Windows/Release/Release/`).

It may be customized with `--build_dir`.

## Agent tips

- **Activate a Python virtual environment** before building. See "Python > Virtual environment" in `AGENTS.md`.
- **Prefer `python tools/ci_build/build.py` directly** over `build.bat`/`build.sh` when redirecting output. The `.bat` wrapper runs in `cmd.exe`, which breaks PowerShell redirection.
- **Redirect output to a file** (e.g., `> build_log.txt 2>&1`). Build output is large and will overflow terminal buffers.
- **Run builds in the background** — a full build can take tens of minutes to over an hour. Poll the log for `"Build complete"` or errors.
- **Use `--parallel`** by default unless the user says otherwise.
- Ask the user what they want to build (config, execution providers, wheel, etc.) if not clear from their prompt.
Loading
Loading