TP: quantized KV cache support by JohannesGaessler · Pull Request #23792 · ggml-org/llama.cpp

JohannesGaessler · 2026-05-27T20:20:08Z

This PR implements support for the combination of -sm tensor and quantized KV cache. The reason why this doesn't work on master is that the flattening of tensors for the KV cache rotation leads to the loss of shape information which the meta backend cannot handle. There were previous PRs which resolved the issue by changing the shapes of the KV cache rotation but that is an undesirable solution because batched matrix multiplications may not be as well-supported in ggml backends as a single large matrix multiplication. Also it is generally better to extend the meta backend with capabilities to handle a compute graph than to require compute graphs to conform to the meta backend's requirments.

The approach in this PR is to extend the specification ggml_backend_meta_split_state with a value that specifies how often a given segment repeats. When a tensor is flattened the meta backend uses segments to specify the data layout within the flattened dimension so that upon a further reshape the correct data layout can be restored. No changes to the llama.cpp compute graphs are required.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: No

nifgraup · 2026-05-27T21:42:45Z

I get an assert crash:

llama-bench --model Qwen3.6-27B-IQ4_NL.gguf --flash-attn 1 --cache-type-k q8_0 --cache-type-v q8_0 --split-mode tensor
ggml_cuda_init: found 2 CUDA devices (Total VRAM: 21661 MiB):
  Device 0: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes, VRAM: 10830 MiB
  Device 1: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes, VRAM: 10830 MiB
| model                          |       size |     params | backend    | ngl | type_k | type_v |     sm | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -----: | -----: | -: | --------------: | -------------------: |
llama.cpp/ggml/src/ggml-backend-meta.cpp:1026: GGML_ASSERT(split_state.ne[j] % div == 0) failed

Same command succeeds with code from #23225

JohannesGaessler · 2026-05-27T22:25:08Z

Should be fixed now, the refactored logic did not consider one of the pre-existing edge cases.

CISC · 2026-05-28T07:07:06Z

Seems to break dream:
https://github.com/ggml-org/llama.cpp/actions/runs/26542307453/job/78186392441?pr=23792#step:3:3166

JohannesGaessler · 2026-05-28T09:24:15Z

I added a new assert for reshapes that seems to have been too strict. For dream a permutation in dimensions 0 and 1 is unproblematic because the split is in dimension 2. I don't think there is a simple way to write an assert that triggers only on the problematic cases so I removed it.

krampenschiesser · 2026-05-28T13:58:04Z

same here with q6k:

llama-bench -fa 1 -dio 0 -t 16 -mmp 0 -ngl 999 -m Qwen3.6-27B-Q6_K.gguf -sm tensor -ctk q8_0 -ctv q8_0
/home/scar-ai/projects/llama.cpp/ggml/src/ggml-backend-meta.cpp:1042: GGML_ASSERT(split_state.ne[j]*split_state.nr[0] * tensor->src[i]->ne[src_ss[i].axis] == sum * tensor->ne[split_state.axis]) failed
#4  0x000060b118c59e33 in ggml_print_backtrace ()
#5  0x000060b118c59fe6 in ggml_abort ()
#6  0x000060b118c840d8 in ggml_backend_meta_get_split_state(ggml_backend_meta_simple_tensor_container&, ggml_tensor const*, bool)::{lambda()#1}::operator()() const ()
#7  0x000060b118c7b9e2 in ggml_backend_meta_get_split_state(ggml_backend_meta_simple_tensor_container&, ggml_tensor const*, bool) ()
#8  0x000060b118c87947 in ggml_backend_meta_buffer_init_tensor_impl(ggml_backend_meta_simple_tensor_container&, ggml_tensor*) [clone .isra.0] ()
#9  0x000060b118c89f3a in ggml_backend_meta_buffer_init_tensor(ggml_backend_buffer*, ggml_tensor*) ()
#10 0x000060b118c703d5 in ggml_gallocr_alloc_graph ()
#11 0x000060b118c76951 in ggml_backend_sched_alloc_graph ()
#12 0x000060b117bae627 in llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) ()
#13 0x000060b117bb54b5 in llama_context::decode(llama_batch const&) ()
#14 0x000060b117bb7262 in llama_decode ()
#15 0x000060b117a77ecb in test_prompt(llama_context*, int, int, int) ()
#16 0x000060b117a88d85 in llama_bench(int, char**) ()
Download failed: Invalid argument.  Continuing without source file ./csu/../sysdeps/nptl/libc_start_call_main.h.
#17 0x000079b7e962a601 in __libc_start_call_main (main=main@entry=0x60b117a03c20 <main>, argc=argc@entry=19, argv=argv@entry=0x7ffefe0b5018) at ../sysdeps/nptl/libc_start_call_main.h:59
⚠️ warning: 59	../sysdeps/nptl/libc_start_call_main.h: No such file or directory
Download failed: Invalid argument.  Continuing without source file ./csu/../csu/libc-start.c.
#18 0x000079b7e962a718 in __libc_start_main_impl (main=0x60b117a03c20 <main>, argc=19, argv=0x7ffefe0b5018, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffefe0b5008) at ../csu/libc-start.c:360
⚠️ warning: 360	../csu/libc-start.c: No such file or directory
#19 0x000060b117a76715 in _start ()
[Inferior 1 (process 2213992) detached]

However q8_0 works!

llama-bench -fa 1 -dio 0 -t 16 -mmp 0 -ngl 999 -m Qwen3.6-27B-Q8_0.gguf -sm tensor -ctk q8_0 -ctv q8_0
| model                          |       size |     params | backend    | ngl | type_k | type_v |     sm | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -----: | -----: | -: | ---: | --------------: | -------------------: |
| qwen35 27B Q8_0                |  27.04 GiB |    27.32 B | CUDA       | 999 |   q8_0 |   q8_0 | tensor |  1 |    0 |           pp512 |      1276.58 ± 31.37 |
| qwen35 27B Q8_0                |  27.04 GiB |    27.32 B | CUDA       | 999 |   q8_0 |   q8_0 | tensor |  1 |    0 |           tg128 |         47.33 ± 1.34 |

k/v in q4_0 has the same, works in q8_0 model quant but fails in k-quant models (q6, q4)
using commit 2b2d0e2c18f54c4647d18cdda4a363fbad063c3d

Stoney49th · 2026-05-28T16:15:54Z

wanted to test, but still:

47593] 0.00.043.121 I srv    load_model: loading model '/root/.cache/huggingface/hub/models--unsloth--Qwen3.6-27B-MTP-GGUF/snapshots/b3a58239d8d40b953e34936c9afeb28baa518230/Qwen3.6-27B-UD-Q4_K_XL.gguf'
[47593] 0.00.421.224 I srv    load_model: [mtmd] estimated worst-case memory usage of mmproj is 1161.02 MiB
[47593] 0.00.775.647 I srv    load_model: [spec] estimated memory usage of MTP context is 1529.00 MiB
[47593] 0.00.775.672 I common_init_result: fitting params to device memory ...
[47593] 0.00.775.673 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
[47593] 0.00.775.732 W common_fit_params: failed to fit params to free device memory: llama_params_fit is not implemented for SPLIT_MODE_TENSOR, abort
0.16.115.032 I srv  ensure_model: waiting until model name=qwen3-6-27b-MTP is fully loaded...
[47593] 0.09.360.733 W llama_context: n_ctx_seq (131072) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
[47593] 0.09.529.267 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
[47593] /app/ggml/src/ggml-backend-meta.cpp:1042: GGML_ASSERT(split_state.ne[j]*split_state.nr[0] * tensor->src[i]->ne[src_ss[i].axis] == sum * tensor->ne[split_state.axis]) failed
[47593] libggml-base.so.0(+0x19305) [0x7f886511d305]
[47593] libggml-base.so.0(ggml_print_backtrace+0x200) [0x7f886511d800]
[47593] libggml-base.so.0(ggml_abort+0x12e) [0x7f886511d9ce]
[47593] libggml-base.so.0(+0x44ad0) [0x7f8865148ad0]
[47593] libggml-base.so.0(+0x3b7bb) [0x7f886513f7bb]
[47593] libggml-base.so.0(+0x4837b) [0x7f886514c37b]
[47593] libggml-base.so.0(+0x4a85a) [0x7f886514e85a]
[47593] libggml-base.so.0(ggml_gallocr_alloc_graph+0x505) [0x7f88651338f5]
[47593] libggml-base.so.0(ggml_backend_sched_alloc_graph+0x101) [0x7f886513a091]
[47593] libllama.so.0(_ZN13llama_context14process_ubatchERK12llama_ubatch14llm_graph_typeP22llama_memory_context_iR11ggml_status+0xe2) [0x7f88652b91a2]
[47593] libllama.so.0(_ZN13llama_context6decodeERK11llama_batch+0x398) [0x7f88652bd408]
[47593] libllama.so.0(llama_decode+0xe) [0x7f88652bf6ae]
[47593] libllama-common.so.0(_Z23common_init_from_paramsR13common_paramsb+0x3a6) [0x7f8865836376]
[47593] libllama-server-impl.so(_ZN19server_context_impl10load_modelER13common_params+0x526) [0x7f8865c2e1e6]
[47593] libllama-server-impl.so(_Z12llama_serveriPPc+0x294a) [0x7f8865b7251a]
[47593] /usr/lib/libc.so.6(+0x27741) [0x7f886446d741]
[47593] /usr/lib/libc.so.6(__libc_start_main+0x89) [0x7f886446d879]
[47593] /app/llama-server(+0x1075) [0x5608fa4ce075]

error was there before and has already been reported in #22817

TP: quantized KV cache support

93f40fb

JohannesGaessler requested review from CISC and ggerganov as code owners May 27, 2026 20:20

github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label May 27, 2026

fix partial view

576f3b9

remove overly strict assert

2b2d0e2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TP: quantized KV cache support#23792

TP: quantized KV cache support#23792
JohannesGaessler wants to merge 3 commits into
ggml-org:masterfrom
JohannesGaessler:tp-quant-kv-3

JohannesGaessler commented May 27, 2026

Uh oh!

nifgraup commented May 27, 2026

Uh oh!

JohannesGaessler commented May 27, 2026

Uh oh!

CISC commented May 28, 2026

Uh oh!

JohannesGaessler commented May 28, 2026

Uh oh!

krampenschiesser commented May 28, 2026 •

edited

Loading

Uh oh!

Stoney49th commented May 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

JohannesGaessler commented May 27, 2026

Requirements

Uh oh!

nifgraup commented May 27, 2026

Uh oh!

JohannesGaessler commented May 27, 2026

Uh oh!

CISC commented May 28, 2026

Uh oh!

JohannesGaessler commented May 28, 2026

Uh oh!

krampenschiesser commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Stoney49th commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

krampenschiesser commented May 28, 2026 •

edited

Loading

Stoney49th commented May 28, 2026 •

edited

Loading