-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Description
System
| OS | Windows 11 ARM64 Build 26200 |
| CPU | Qualcomm Snapdragon X Elite |
| VS | 2022 Community 17.14, ClangCL 19.1.5 |
| CMake | 4.2.3 |
| Python | 3.13 ARM64 |
Summary
Native Windows ARM64 build is possible on Snapdragon X Elite with three fixes. After applying them, llama-cli.exe builds and runs at 28.59 tok/s (i2_s kernel, 8 threads, 2.41B model). MATMUL_INT8 = 1 confirms i8mm is active.
Blocker 1: Wrong -march flags (i8mm falsely believed unsupported)
The existing documentation and community issues mark i8mm support on Snapdragon X Elite as "unknown." It is not unknown — the CPU fully supports it. WSL2 /proc/cpuinfo confirms:
Features: ... i8mm bf16 dotprod asimddp ...
The error when building with -march=armv8.2-a+fp16:
error: always_inline function 'vmmlaq_s32' requires target feature 'i8mm',
but would be inlined into function 'ggml_vec_dot_q4_0_q8_0' that is compiled
without support for 'i8mm'
ClangCL detects the ARM64 target and conditionally enables the i8mm code path, but the -march flag doesn't authorise the instructions. Fix: use -march=armv8.6-a+fp16 — the Snapdragon X Elite is ARMv8.6-A capable.
Blocker 2: C++ exceptions disabled by default in ClangCL
error: cannot use 'throw' with exceptions disabled
ClangCL defaults to /EHs-. llama.cpp uses throw throughout. Fix: add /EHsc to CMAKE_CXX_FLAGS.
Blocker 3: Missing <chrono> include in common/common.cpp and common/log.cpp
error: no type named 'system_clock' in namespace 'std::chrono'
error: 'clock' is not a class, namespace, or enumeration
On Linux/Mac, <chrono> is pulled in transitively through <thread>. On Windows with ClangCL it is not. Both common.cpp and log.cpp use std::chrono without explicitly including <chrono>. Fix: add #include <chrono> to both files.
Working Build Command
From a VS 2022 Developer PowerShell with ClangCL on PATH:
# Kernel generation (required first)
python utils/codegen_tl1.py --model bitnet_b1_58-3B --BM 160,320,320 --BK 64,128,64 --bm 32,64,32
# Configure
cmake -B build `
-T ClangCL `
-DBITNET_ARM_TL1=OFF `
-DCMAKE_BUILD_TYPE=Release `
-DCMAKE_C_FLAGS="-march=armv8.6-a+fp16" `
-DCMAKE_CXX_FLAGS="-march=armv8.6-a+fp16 /EHsc"
# Build
cmake --build build --config Release --target llama-cliConfirmed Performance — Snapdragon X Elite, i2_s kernel
| Metric | Value |
|---|---|
| Model | BitNet-b1.58-2B-4T (i2_s) |
| Load time | 844 ms |
| Prompt eval | 210.27 tok/s |
| Generation | 28.59 tok/s |
| Threads | 8 |
| MATMUL_INT8 | active |
Suggested Repo Changes
- CMakeLists.txt — detect ClangCL on ARM64 Windows and add
/EHscand-march=armv8.6-a+fp16automatically - common/common.cpp and common/log.cpp — add
#include <chrono>explicitly (transitive include is not portable) - Documentation — note that Snapdragon X Elite supports i8mm, bf16, dotprod;
-march=armv8.6-ais the correct target - CI — consider adding a Windows ARM64 build check (GitHub Actions now supports ARM64 runners)
Additional Note
gguf-py install fails on Windows ARM64 due to a CMake 4.x incompatibility in the sentencepiece submodule and a hardcoded -A x64 arch flag. This blocks setup_env.py but is not needed for inference if GGUF models are downloaded directly from HuggingFace.