You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Not declarable in pyproject.toml — Every downstream project needs custom Makefiles or install scripts with GPU auto-detection logic (macOS → Metal, nvidia-smi → CUDA, rocminfo → ROCm, fallback → OpenBLAS). This is duplicated across hundreds of projects.
Cache invalidation is broken — pip and uv cache wheels by package version, not by CMAKE_ARGS. A cached OpenBLAS wheel silently gets reused when Metal or CUDA is requested. Workaround: --no-cache, which defeats caching entirely.
GPU prebuilt wheels stop at Python 3.12 — The Metal wheel CI (build-wheels-metal.yaml) is hardcoded to CIBW_BUILD: "cp39-* cp310-* cp311-* cp312-*". The CUDA wheel CI (build-wheels-cuda.yaml) has its matrix pinned to Python 3.9-3.12. CPU-only wheels include 3.13 (via default cibuildwheel config in build-and-release.yaml), but the arm64 job there also pins to cp38-cp312. No workflow produces 3.14 or free-threaded (3.13t/3.14t) wheels. Python 3.13 has been stable since Oct 2024, 3.14 since Oct 2025. Free-threaded builds are increasingly important — vLLM, llguidance, and the broader no-GIL ecosystem depend on them.
Update CIBW_BUILD in Metal/CUDA workflows and add free-threaded support. This is the single highest-impact change — it eliminates source builds for most users.
build-wheels-metal.yaml:
Upgrade cibuildwheel from v2.22.0 to v3.x (3.0 added cp314/cp314t support). In cibuildwheel 3.0, cp314t is built by default (free-threading is no longer experimental in 3.14), and cp313t requires CIBW_ENABLE: cpython-freethreading.
build-wheels-cuda.yaml — uses a different build system (python -m build --wheel with a PowerShell matrix). The pyver matrix would need "3.13", "3.14" added.
With prebuilt wheels, any downstream project can use uv's declarative index support:
# pyproject.toml — zero Makefile, zero CMAKE_ARGS
[project]
dependencies = ["llama-cpp-python~=0.3"]
[tool.uv.sources]
llama-cpp-python = [
{ index = "llama-metal", marker = "sys_platform == 'darwin'" },
{ index = "llama-cpu", marker = "sys_platform == 'linux'" },
]
[[tool.uv.index]]
name = "llama-metal"url = "https://abetlen.github.io/llama-cpp-python/whl/metal"explicit = true
[[tool.uv.index]]
name = "llama-cpu"url = "https://abetlen.github.io/llama-cpp-python/whl/cpu"explicit = true
2. Document --config-settings as the source-build path
Since the build backend is scikit-build-core, cmake args can be passed via the standard PEP 517 config-settings interface:
pip install llama-cpp-python -C cmake.args="-DGGML_METAL=on"# or with uv:
uv pip install llama-cpp-python -C cmake.args="-DGGML_METAL=on"
This is cleaner than the CMAKE_ARGS env var — it's the standard PEP 517 mechanism, more explicit, and discoverable. It's already supported via scikit-build-core but not documented in the README or install docs.
3. (Future) Adopt PEP 817 Wheel Variants
PEP 817 (draft, Dec 2025) introduces a standard mechanism for GPU/accelerator wheel variants. PyTorch 2.9 already ships experimental variant-enabled wheels. Once PEP 817 is accepted and tool support lands, llama-cpp-python could publish variant wheels that are auto-selected by the installer:
~470K monthly PyPI downloads (pypistats) — every project using this beyond toy scripts hits this install wall
How others solved it: PyTorch uses per-backend index URLs + PEP 817 variants; ONNX Runtime publishes separate PyPI packages per backend (onnxruntime-gpu, onnxruntime-silicon)
Problem
Installing
llama-cpp-pythonwith a GPU backend requires settingCMAKE_ARGSas an environment variable at build time:CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-pythonThis creates pain across the ecosystem:
Not declarable in
pyproject.toml— Every downstream project needs custom Makefiles or install scripts with GPU auto-detection logic (macOS → Metal, nvidia-smi → CUDA, rocminfo → ROCm, fallback → OpenBLAS). This is duplicated across hundreds of projects.Cache invalidation is broken —
pipanduvcache wheels by package version, not byCMAKE_ARGS. A cached OpenBLAS wheel silently gets reused when Metal or CUDA is requested. Workaround:--no-cache, which defeats caching entirely.GPU prebuilt wheels stop at Python 3.12 — The Metal wheel CI (
build-wheels-metal.yaml) is hardcoded toCIBW_BUILD: "cp39-* cp310-* cp311-* cp312-*". The CUDA wheel CI (build-wheels-cuda.yaml) has its matrix pinned to Python 3.9-3.12. CPU-only wheels include 3.13 (via default cibuildwheel config inbuild-and-release.yaml), but the arm64 job there also pins to cp38-cp312. No workflow produces 3.14 or free-threaded (3.13t/3.14t) wheels. Python 3.13 has been stable since Oct 2024, 3.14 since Oct 2025. Free-threaded builds are increasingly important — vLLM, llguidance, and the broader no-GIL ecosystem depend on them.Current state of published wheel indexes:
/whl/cpu/)/whl/metal/)/whl/cu1xx/)Proposed changes
1. Expand prebuilt wheel matrix (highest impact, smallest change)
Update
CIBW_BUILDin Metal/CUDA workflows and add free-threaded support. This is the single highest-impact change — it eliminates source builds for most users.build-wheels-metal.yaml:Upgrade
cibuildwheelfromv2.22.0tov3.x(3.0 added cp314/cp314t support). In cibuildwheel 3.0, cp314t is built by default (free-threading is no longer experimental in 3.14), and cp313t requiresCIBW_ENABLE: cpython-freethreading.build-and-release.yaml— same cibuildwheel upgrade, and update thebuild_wheels_arm64job:build-wheels-cuda.yaml— uses a different build system (python -m build --wheelwith a PowerShell matrix). Thepyvermatrix would need"3.13", "3.14"added.With prebuilt wheels, any downstream project can use
uv's declarative index support:2. Document
--config-settingsas the source-build pathSince the build backend is
scikit-build-core, cmake args can be passed via the standard PEP 517 config-settings interface:This is cleaner than the
CMAKE_ARGSenv var — it's the standard PEP 517 mechanism, more explicit, and discoverable. It's already supported via scikit-build-core but not documented in the README or install docs.3. (Future) Adopt PEP 817 Wheel Variants
PEP 817 (draft, Dec 2025) introduces a standard mechanism for GPU/accelerator wheel variants. PyTorch 2.9 already ships experimental variant-enabled wheels. Once PEP 817 is accepted and tool support lands,
llama-cpp-pythoncould publish variant wheels that are auto-selected by the installer:# Future: just works, installer picks Metal/CUDA/CPU automatically pip install llama-cpp-pythonThis is mentioned for context only — the actionable items are (1) and (2) above.
Ecosystem context
onnxruntime-gpu,onnxruntime-silicon)Related
Wheel matrix gaps (same root cause):
Wheel variants / long-term packaging:
Downstream impact of missing wheels:
Happy to submit a PR for (1) and (2).