Skip to content

continuum: explicit Metal device selection — avoid hang in AMD Polaris lazy init (card 346b356f)#1

Open
joelteply wants to merge 1 commit into
continuum/qwen35-metal-controlsfrom
346b356f/metal-amd-intel-mac
Open

continuum: explicit Metal device selection — avoid hang in AMD Polaris lazy init (card 346b356f)#1
joelteply wants to merge 1 commit into
continuum/qwen35-metal-controlsfrom
346b356f/metal-amd-intel-mac

Conversation

@joelteply

Copy link
Copy Markdown

continuum: explicit Metal device selection — avoid hang in AMD Polaris lazy init

ggml_metal_device_init calls MTLCreateSystemDefaultDevice() which on
Intel Macs returns the discrete GPU. On MacBookPro15,1 (Radeon Pro 560X,
Polaris, macOS 15.7.7), the first newCommandQueue call enters
amdMtlBronzeLazyInitamdMtlAllocateBufferIOAccelResourceCreate
IOConnectCallMethodmach_msg2_trap and never returns. Because
GGML's Metal backend registers via a global static initializer
(ggml_backend_registry::ggml_backend_registry), this hangs at process
startup — every llama.cpp consumer on this hardware deadlocks before
arg parsing.

Captured via sample(1) of the live hung process:

main → common_params_parser_init
→ ggml_backend_registry::ggml_backend_registry()
→ ggml_backend_metal_reg (ggml-metal.cpp:924)
→ ggml_metal_device_init (ggml-metal-device.m:641)
→ -[BronzeMtlDevice newCommandQueue]
→ -[BronzeMtlDevice amdMtlBronzeLazyInit]
→ IOAccelResourceCreate
→ mach_msg2_trap ← HANG

Fix: enumerate via MTLCopyAllDevices() and select with priority:

  1. Apple Silicon (always best)
  2. Integrated low-power (Intel UHD/Iris on Intel Mac — substrate LCD floor;
    hundreds of millions of Intel Macs ship with one)
  3. External / eGPU
  4. Discrete (last; may hang on Polaris-era AMD)

Operator can override via GGML_METAL_DEVICE_NAME=<substring> env var.
Logs all enumerated devices with lowPower/removable/location
properties to give operators visibility into what was selected and why.

Tested on MacBookPro15,1 / macOS 15.7.7:

  • Before: --help hung forever (sample captured the AMD mach_msg2_trap)
  • After: --help returns immediately; llama-cli loads Qwen 2.5 0.5B
    on Intel UHD 630, runs inference at 22.5 tok/s

Known follow-up: Metal q4_k matmul kernel produces incorrect output on
Intel UHD 630 (output: "antity@@@@@@@" instead of "Hello! How can..."),
likely the matrix-vector fallback that runs when has_simdgroup_mm=false.
Carded separately; not blocking the hang fix.

Tested: macOS 15.7.7, Xcode 26.3 (macOS 26.2 SDK), MacBookPro15,1
(Coffee Lake + Intel UHD 630 + AMD Radeon Pro 560X).

Card: 346b356f.

…s lazy init

`ggml_metal_device_init` calls `MTLCreateSystemDefaultDevice()` which on
Intel Macs returns the discrete GPU. On MacBookPro15,1 (Radeon Pro 560X,
Polaris, macOS 15.7.7), the first `newCommandQueue` call enters
`amdMtlBronzeLazyInit` → `amdMtlAllocateBuffer` → `IOAccelResourceCreate`
→ `IOConnectCallMethod` → `mach_msg2_trap` and never returns. Because
GGML's Metal backend registers via a global static initializer
(`ggml_backend_registry::ggml_backend_registry`), this hangs at process
startup — every llama.cpp consumer on this hardware deadlocks before
arg parsing.

Captured via `sample(1)` of the live hung process:

  main → common_params_parser_init
    → ggml_backend_registry::ggml_backend_registry()
      → ggml_backend_metal_reg (ggml-metal.cpp:924)
        → ggml_metal_device_init (ggml-metal-device.m:641)
          → -[BronzeMtlDevice newCommandQueue]
            → -[BronzeMtlDevice amdMtlBronzeLazyInit]
              → IOAccelResourceCreate
                → mach_msg2_trap  ← HANG

Fix: enumerate via `MTLCopyAllDevices()` and select with priority:

  1. Apple Silicon (always best)
  2. Integrated low-power (Intel UHD/Iris on Intel Mac — substrate LCD floor;
     hundreds of millions of Intel Macs ship with one)
  3. External / eGPU
  4. Discrete (last; may hang on Polaris-era AMD)

Operator can override via `GGML_METAL_DEVICE_NAME=<substring>` env var.
Logs all enumerated devices with `lowPower`/`removable`/`location`
properties to give operators visibility into what was selected and why.

Tested on MacBookPro15,1 / macOS 15.7.7:
- Before: `--help` hung forever (sample captured the AMD mach_msg2_trap)
- After: `--help` returns immediately; `llama-cli` loads Qwen 2.5 0.5B
  on Intel UHD 630, runs inference at 22.5 tok/s

Known follow-up: Metal q4_k matmul kernel produces incorrect output on
Intel UHD 630 (output: "antity@@@@@@@" instead of "Hello! How can..."),
likely the matrix-vector fallback that runs when `has_simdgroup_mm=false`.
Carded separately; not blocking the hang fix.

Tested: macOS 15.7.7, Xcode 26.3 (macOS 26.2 SDK), MacBookPro15,1
(Coffee Lake + Intel UHD 630 + AMD Radeon Pro 560X).

Card: 346b356f.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant