kleidiai : update to v1.24.0 and use release archive by chaxu01 · Pull Request #22549 · ggml-org/llama.cpp

chaxu01 · 2026-04-30T11:41:37Z

Overview

This PR updates the KleidiAI dependency to v1.24.0 and switches the integration to use the official release archive instead of the Arm source Git repository.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: NO

max-krasnyansky · 2026-05-02T03:25:51Z

@chaxu01 you should update the PR description and keep the required disclosures.
Otherwise looks good. I've been waiting for the release and was going to enable new SME kernels on the Snapdragon builds.

chaxu01 · 2026-05-04T07:32:22Z

@max-krasnyansky Thanks for the review! I’ve updated the PR description and kept the required disclosures.

max-krasnyansky · 2026-05-04T20:40:53Z

@chaxu01
Do you guys have any plans on adding "MatMul chunking" like we did in #16833?
I did a quick test on Galaxy S26+ with this PR and the overall performance with KleidiAI is quite a bit worse than the default CPU backend when running on a mix of CPU cores with different performance profiles.
S26+ has Snapdragon 8-Elite Gen5 where the 2 large perf cores are quite a bit faster than the rest of the cores.
Splitting MatMul work into equal chunks (i.e n_rows / n_threads) does not scale.

I wanted to enable the new SME1 kernels that went into 1.24.0 but we'll need to start with the chunking first to get a decent performance baseline with KleidiAI enabled.

GGML_CPU_ARM_ARCH armv8.7a+fp16+dotprod+i8mm

$ llama-completion --no-mmap -m gemma-4-E2B-it-Q4_0.gguf --poll 1000 -t 6 --cpu-mask 0xfc --cpu-strict 1 \
   --ctx-size 8192 --ubatch-size 256 -fa on --device none --jinja -st -f ../sample_prompt_1024.txt -n 64

cpu-backend (default repack):
  load_tensors:          CPU model buffer size =  1910.87 MiB
  load_tensors:   CPU_REPACK model buffer size =  1190.53 MiB

  t=6
    prompt eval time =    4209.53 ms /   740 tokens (    5.69 ms per token,   175.79 tokens per second)
           eval time =    1811.00 ms /    63 runs   (   28.75 ms per token,    34.79 tokens per second)
  t=4
    prompt eval time =    4955.01 ms /   740 tokens (    6.70 ms per token,   149.34 tokens per second)
           eval time =    1883.13 ms /    63 runs   (   29.89 ms per token,    33.45 tokens per second)
  t=2
    prompt eval time =    6597.36 ms /   740 tokens (    8.92 ms per token,   112.17 tokens per second)
           eval time =    1927.11 ms /    63 runs   (   30.59 ms per token,    32.69 tokens per second)

cpu-backend (kleidi repack):
  load_tensors:          CPU model buffer size =  1910.87 MiB
  load_tensors: CPU_KLEIDIAI model buffer size =   974.55 MiB
  load_tensors:   CPU_REPACK model buffer size =   216.00 MiB

  t=6
    prompt eval time =    4864.67 ms /   740 tokens (    6.57 ms per token,   152.12 tokens per second)
           eval time =    1908.59 ms /    63 runs   (   30.30 ms per token,    33.01 tokens per second)
  t=4
    prompt eval time =    5831.66 ms /   740 tokens (    7.88 ms per token,   126.89 tokens per second)
           eval time =    1990.70 ms /    63 runs   (   31.60 ms per token,    31.65 tokens per second)
  t=2
    prompt eval time =    6462.82 ms /   740 tokens (    8.73 ms per token,   114.50 tokens per second)
           eval time =    1927.63 ms /    63 runs   (   30.60 ms per token,    32.68 tokens per second)

chaxu01 · 2026-05-05T06:53:08Z

@max-krasnyansky Thanks for testing and sharing the numbers.
Agreed — the current KleidiAI path still uses a mostly static MatMul split, so it does not handle heterogeneous core mixes well. On parts like Snapdragon 8 Elite Gen5, equal work per thread can easily underutilize the faster cores and explain the regression you’re seeing. We are working on an improvement of weighted load distribution for fast (SME) and slow (NEON) cores. We’ll take a look at #16833 and see how we can align the KleidiAI path with that approach. Thanks again for the detailed benchmark data.

max-krasnyansky · 2026-05-05T16:46:21Z

@max-krasnyansky Thanks for testing and sharing the numbers. Agreed — the current KleidiAI path still uses a mostly static MatMul split, so it does not handle heterogeneous core mixes well. On parts like Snapdragon 8 Elite Gen5, equal work per thread can easily underutilize the faster cores and explain the regression you’re seeing. We are working on an improvement of weighted load distribution for fast (SME) and slow (NEON) cores. We’ll take a look at #16833 and see how we can align the KleidiAI path with that approach. Thanks again for the detailed benchmark data.

Sounds good.
Please tag me when you guys have updates.

(cherry picked from commit eff0670)

chaxu01 · 2026-05-28T13:35:00Z

FYI, @max-krasnyansky just submitted a PR related to MatMul chunking.

kleidiai : update to v1.24.0 and use release archive

34cfd6f

chaxu01 requested a review from ggerganov as a code owner April 30, 2026 11:41

github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label Apr 30, 2026

max-krasnyansky approved these changes May 4, 2026

View reviewed changes

ggerganov merged commit eff0670 into ggml-org:master May 4, 2026
46 checks passed

samuraieng pushed a commit to samuraieng/llama.cpp that referenced this pull request May 6, 2026

kleidiai : update to v1.24.0 and use release archive (ggml-org#22549)

8e7ac50

cetarthoriphros pushed a commit to cetarthoriphros/llama.cpp that referenced this pull request May 9, 2026

kleidiai : update to v1.24.0 and use release archive (ggml-org#22549)

23249a5

meh pushed a commit to meh/llama.cpp that referenced this pull request May 10, 2026

kleidiai : update to v1.24.0 and use release archive (ggml-org#22549)

8658a40

baramofme pushed a commit to baramofme/llama-cpp-turboquant that referenced this pull request May 23, 2026

kleidiai : update to v1.24.0 and use release archive (ggml-org#22549)

52b6519

carlosfundora pushed a commit to carlosfundora/llama.cpp-1-bit-turbo that referenced this pull request May 24, 2026

kleidiai : update to v1.24.0 and use release archive (ggml-org#22549)

95b70ed

(cherry picked from commit eff0670)

winstonma pushed a commit to winstonma/llama.cpp that referenced this pull request May 27, 2026

kleidiai : update to v1.24.0 and use release archive (ggml-org#22549)

3817b24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kleidiai : update to v1.24.0 and use release archive#22549

kleidiai : update to v1.24.0 and use release archive#22549
ggerganov merged 1 commit into
ggml-org:masterfrom
chaxu01:feature/kleidiai-release-src

chaxu01 commented Apr 30, 2026 •

edited

Loading

Uh oh!

max-krasnyansky commented May 2, 2026 •

edited

Loading

Uh oh!

chaxu01 commented May 4, 2026

Uh oh!

Uh oh!

max-krasnyansky commented May 4, 2026

Uh oh!

chaxu01 commented May 5, 2026

Uh oh!

max-krasnyansky commented May 5, 2026

Uh oh!

chaxu01 commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

chaxu01 commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Requirements

Uh oh!

max-krasnyansky commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chaxu01 commented May 4, 2026

Uh oh!

Uh oh!

max-krasnyansky commented May 4, 2026

Uh oh!

chaxu01 commented May 5, 2026

Uh oh!

max-krasnyansky commented May 5, 2026

Uh oh!

chaxu01 commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chaxu01 commented Apr 30, 2026 •

edited

Loading

max-krasnyansky commented May 2, 2026 •

edited

Loading