Skip to content

kleidiai : update to v1.24.0 and use release archive#22549

Merged
ggerganov merged 1 commit into
ggml-org:masterfrom
chaxu01:feature/kleidiai-release-src
May 4, 2026
Merged

kleidiai : update to v1.24.0 and use release archive#22549
ggerganov merged 1 commit into
ggml-org:masterfrom
chaxu01:feature/kleidiai-release-src

Conversation

@chaxu01
Copy link
Copy Markdown
Collaborator

@chaxu01 chaxu01 commented Apr 30, 2026

Overview

This PR updates the KleidiAI dependency to v1.24.0 and switches the integration to use the official release archive instead of the Arm source Git repository.

Requirements

@chaxu01 chaxu01 requested a review from ggerganov as a code owner April 30, 2026 11:41
@github-actions github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label Apr 30, 2026
@max-krasnyansky
Copy link
Copy Markdown
Member

max-krasnyansky commented May 2, 2026

@chaxu01 you should update the PR description and keep the required disclosures.
Otherwise looks good. I've been waiting for the release and was going to enable new SME kernels on the Snapdragon builds.

@chaxu01
Copy link
Copy Markdown
Collaborator Author

chaxu01 commented May 4, 2026

@max-krasnyansky Thanks for the review! I’ve updated the PR description and kept the required disclosures.

@ggerganov ggerganov merged commit eff0670 into ggml-org:master May 4, 2026
46 checks passed
@max-krasnyansky
Copy link
Copy Markdown
Member

@chaxu01
Do you guys have any plans on adding "MatMul chunking" like we did in #16833?
I did a quick test on Galaxy S26+ with this PR and the overall performance with KleidiAI is quite a bit worse than the default CPU backend when running on a mix of CPU cores with different performance profiles.
S26+ has Snapdragon 8-Elite Gen5 where the 2 large perf cores are quite a bit faster than the rest of the cores.
Splitting MatMul work into equal chunks (i.e n_rows / n_threads) does not scale.

I wanted to enable the new SME1 kernels that went into 1.24.0 but we'll need to start with the chunking first to get a decent performance baseline with KleidiAI enabled.

GGML_CPU_ARM_ARCH armv8.7a+fp16+dotprod+i8mm

$ llama-completion --no-mmap -m gemma-4-E2B-it-Q4_0.gguf --poll 1000 -t 6 --cpu-mask 0xfc --cpu-strict 1 \
   --ctx-size 8192 --ubatch-size 256 -fa on --device none --jinja -st -f ../sample_prompt_1024.txt -n 64

cpu-backend (default repack):
  load_tensors:          CPU model buffer size =  1910.87 MiB
  load_tensors:   CPU_REPACK model buffer size =  1190.53 MiB

  t=6
    prompt eval time =    4209.53 ms /   740 tokens (    5.69 ms per token,   175.79 tokens per second)
           eval time =    1811.00 ms /    63 runs   (   28.75 ms per token,    34.79 tokens per second)
  t=4
    prompt eval time =    4955.01 ms /   740 tokens (    6.70 ms per token,   149.34 tokens per second)
           eval time =    1883.13 ms /    63 runs   (   29.89 ms per token,    33.45 tokens per second)
  t=2
    prompt eval time =    6597.36 ms /   740 tokens (    8.92 ms per token,   112.17 tokens per second)
           eval time =    1927.11 ms /    63 runs   (   30.59 ms per token,    32.69 tokens per second)

cpu-backend (kleidi repack):
  load_tensors:          CPU model buffer size =  1910.87 MiB
  load_tensors: CPU_KLEIDIAI model buffer size =   974.55 MiB
  load_tensors:   CPU_REPACK model buffer size =   216.00 MiB

  t=6
    prompt eval time =    4864.67 ms /   740 tokens (    6.57 ms per token,   152.12 tokens per second)
           eval time =    1908.59 ms /    63 runs   (   30.30 ms per token,    33.01 tokens per second)
  t=4
    prompt eval time =    5831.66 ms /   740 tokens (    7.88 ms per token,   126.89 tokens per second)
           eval time =    1990.70 ms /    63 runs   (   31.60 ms per token,    31.65 tokens per second)
  t=2
    prompt eval time =    6462.82 ms /   740 tokens (    8.73 ms per token,   114.50 tokens per second)
           eval time =    1927.63 ms /    63 runs   (   30.60 ms per token,    32.68 tokens per second)

@chaxu01
Copy link
Copy Markdown
Collaborator Author

chaxu01 commented May 5, 2026

@max-krasnyansky Thanks for testing and sharing the numbers.
Agreed — the current KleidiAI path still uses a mostly static MatMul split, so it does not handle heterogeneous core mixes well. On parts like Snapdragon 8 Elite Gen5, equal work per thread can easily underutilize the faster cores and explain the regression you’re seeing. We are working on an improvement of weighted load distribution for fast (SME) and slow (NEON) cores. We’ll take a look at #16833 and see how we can align the KleidiAI path with that approach. Thanks again for the detailed benchmark data.

@max-krasnyansky
Copy link
Copy Markdown
Member

@max-krasnyansky Thanks for testing and sharing the numbers. Agreed — the current KleidiAI path still uses a mostly static MatMul split, so it does not handle heterogeneous core mixes well. On parts like Snapdragon 8 Elite Gen5, equal work per thread can easily underutilize the faster cores and explain the regression you’re seeing. We are working on an improvement of weighted load distribution for fast (SME) and slow (NEON) cores. We’ll take a look at #16833 and see how we can align the KleidiAI path with that approach. Thanks again for the detailed benchmark data.

Sounds good.
Please tag me when you guys have updates.

samuraieng pushed a commit to samuraieng/llama.cpp that referenced this pull request May 6, 2026
cetarthoriphros pushed a commit to cetarthoriphros/llama.cpp that referenced this pull request May 9, 2026
meh pushed a commit to meh/llama.cpp that referenced this pull request May 10, 2026
baramofme pushed a commit to baramofme/llama-cpp-turboquant that referenced this pull request May 23, 2026
carlosfundora pushed a commit to carlosfundora/llama.cpp-1-bit-turbo that referenced this pull request May 24, 2026
winstonma pushed a commit to winstonma/llama.cpp that referenced this pull request May 27, 2026
@chaxu01
Copy link
Copy Markdown
Collaborator Author

chaxu01 commented May 28, 2026

FYI, @max-krasnyansky just submitted a PR related to MatMul chunking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants