Skip to content

Windows ARM64 (Snapdragon X Elite) Build: Three Blockers and Fixes #440

@caprion

Description

@caprion

System

OS Windows 11 ARM64 Build 26200
CPU Qualcomm Snapdragon X Elite
VS 2022 Community 17.14, ClangCL 19.1.5
CMake 4.2.3
Python 3.13 ARM64

Summary

Native Windows ARM64 build is possible on Snapdragon X Elite with three fixes. After applying them, llama-cli.exe builds and runs at 28.59 tok/s (i2_s kernel, 8 threads, 2.41B model). MATMUL_INT8 = 1 confirms i8mm is active.


Blocker 1: Wrong -march flags (i8mm falsely believed unsupported)

The existing documentation and community issues mark i8mm support on Snapdragon X Elite as "unknown." It is not unknown — the CPU fully supports it. WSL2 /proc/cpuinfo confirms:

Features: ... i8mm bf16 dotprod asimddp ...

The error when building with -march=armv8.2-a+fp16:

error: always_inline function 'vmmlaq_s32' requires target feature 'i8mm',
but would be inlined into function 'ggml_vec_dot_q4_0_q8_0' that is compiled
without support for 'i8mm'

ClangCL detects the ARM64 target and conditionally enables the i8mm code path, but the -march flag doesn't authorise the instructions. Fix: use -march=armv8.6-a+fp16 — the Snapdragon X Elite is ARMv8.6-A capable.


Blocker 2: C++ exceptions disabled by default in ClangCL

error: cannot use 'throw' with exceptions disabled

ClangCL defaults to /EHs-. llama.cpp uses throw throughout. Fix: add /EHsc to CMAKE_CXX_FLAGS.


Blocker 3: Missing <chrono> include in common/common.cpp and common/log.cpp

error: no type named 'system_clock' in namespace 'std::chrono'
error: 'clock' is not a class, namespace, or enumeration

On Linux/Mac, <chrono> is pulled in transitively through <thread>. On Windows with ClangCL it is not. Both common.cpp and log.cpp use std::chrono without explicitly including <chrono>. Fix: add #include <chrono> to both files.


Working Build Command

From a VS 2022 Developer PowerShell with ClangCL on PATH:

# Kernel generation (required first)
python utils/codegen_tl1.py --model bitnet_b1_58-3B --BM 160,320,320 --BK 64,128,64 --bm 32,64,32

# Configure
cmake -B build `
  -T ClangCL `
  -DBITNET_ARM_TL1=OFF `
  -DCMAKE_BUILD_TYPE=Release `
  -DCMAKE_C_FLAGS="-march=armv8.6-a+fp16" `
  -DCMAKE_CXX_FLAGS="-march=armv8.6-a+fp16 /EHsc"

# Build
cmake --build build --config Release --target llama-cli

Confirmed Performance — Snapdragon X Elite, i2_s kernel

Metric Value
Model BitNet-b1.58-2B-4T (i2_s)
Load time 844 ms
Prompt eval 210.27 tok/s
Generation 28.59 tok/s
Threads 8
MATMUL_INT8 active

Suggested Repo Changes

  1. CMakeLists.txt — detect ClangCL on ARM64 Windows and add /EHsc and -march=armv8.6-a+fp16 automatically
  2. common/common.cpp and common/log.cpp — add #include <chrono> explicitly (transitive include is not portable)
  3. Documentation — note that Snapdragon X Elite supports i8mm, bf16, dotprod; -march=armv8.6-a is the correct target
  4. CI — consider adding a Windows ARM64 build check (GitHub Actions now supports ARM64 runners)

Additional Note

gguf-py install fails on Windows ARM64 due to a CMake 4.x incompatibility in the sentencepiece submodule and a hardcoded -A x64 arch flag. This blocks setup_env.py but is not needed for inference if GGUF models are downloaded directly from HuggingFace.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions