Skip to content

TP: quantized KV cache support#23792

Open
JohannesGaessler wants to merge 3 commits into
ggml-org:masterfrom
JohannesGaessler:tp-quant-kv-3
Open

TP: quantized KV cache support#23792
JohannesGaessler wants to merge 3 commits into
ggml-org:masterfrom
JohannesGaessler:tp-quant-kv-3

Conversation

@JohannesGaessler
Copy link
Copy Markdown
Contributor

This PR implements support for the combination of -sm tensor and quantized KV cache. The reason why this doesn't work on master is that the flattening of tensors for the KV cache rotation leads to the loss of shape information which the meta backend cannot handle. There were previous PRs which resolved the issue by changing the shapes of the KV cache rotation but that is an undesirable solution because batched matrix multiplications may not be as well-supported in ggml backends as a single large matrix multiplication. Also it is generally better to extend the meta backend with capabilities to handle a compute graph than to require compute graphs to conform to the meta backend's requirments.

The approach in this PR is to extend the specification ggml_backend_meta_split_state with a value that specifies how often a given segment repeats. When a tensor is flattened the meta backend uses segments to specify the data layout within the flattened dimension so that upon a further reshape the correct data layout can be restored. No changes to the llama.cpp compute graphs are required.

Requirements

@github-actions github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label May 27, 2026
@nifgraup
Copy link
Copy Markdown

I get an assert crash:

llama-bench --model Qwen3.6-27B-IQ4_NL.gguf --flash-attn 1 --cache-type-k q8_0 --cache-type-v q8_0 --split-mode tensor
ggml_cuda_init: found 2 CUDA devices (Total VRAM: 21661 MiB):
  Device 0: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes, VRAM: 10830 MiB
  Device 1: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes, VRAM: 10830 MiB
| model                          |       size |     params | backend    | ngl | type_k | type_v |     sm | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -----: | -----: | -: | --------------: | -------------------: |
llama.cpp/ggml/src/ggml-backend-meta.cpp:1026: GGML_ASSERT(split_state.ne[j] % div == 0) failed

Same command succeeds with code from #23225

@JohannesGaessler
Copy link
Copy Markdown
Contributor Author

Should be fixed now, the refactored logic did not consider one of the pre-existing edge cases.

@CISC
Copy link
Copy Markdown
Member

CISC commented May 28, 2026

@JohannesGaessler
Copy link
Copy Markdown
Contributor Author

I added a new assert for reshapes that seems to have been too strict. For dream a permutation in dimensions 0 and 1 is unproblematic because the split is in dimension 2. I don't think there is a simple way to write an assert that triggers only on the problematic cases so I removed it.

@krampenschiesser
Copy link
Copy Markdown

krampenschiesser commented May 28, 2026

same here with q6k:

llama-bench -fa 1 -dio 0 -t 16 -mmp 0 -ngl 999 -m Qwen3.6-27B-Q6_K.gguf -sm tensor -ctk q8_0 -ctv q8_0
/home/scar-ai/projects/llama.cpp/ggml/src/ggml-backend-meta.cpp:1042: GGML_ASSERT(split_state.ne[j]*split_state.nr[0] * tensor->src[i]->ne[src_ss[i].axis] == sum * tensor->ne[split_state.axis]) failed
#4  0x000060b118c59e33 in ggml_print_backtrace ()
#5  0x000060b118c59fe6 in ggml_abort ()
#6  0x000060b118c840d8 in ggml_backend_meta_get_split_state(ggml_backend_meta_simple_tensor_container&, ggml_tensor const*, bool)::{lambda()#1}::operator()() const ()
#7  0x000060b118c7b9e2 in ggml_backend_meta_get_split_state(ggml_backend_meta_simple_tensor_container&, ggml_tensor const*, bool) ()
#8  0x000060b118c87947 in ggml_backend_meta_buffer_init_tensor_impl(ggml_backend_meta_simple_tensor_container&, ggml_tensor*) [clone .isra.0] ()
#9  0x000060b118c89f3a in ggml_backend_meta_buffer_init_tensor(ggml_backend_buffer*, ggml_tensor*) ()
#10 0x000060b118c703d5 in ggml_gallocr_alloc_graph ()
#11 0x000060b118c76951 in ggml_backend_sched_alloc_graph ()
#12 0x000060b117bae627 in llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) ()
#13 0x000060b117bb54b5 in llama_context::decode(llama_batch const&) ()
#14 0x000060b117bb7262 in llama_decode ()
#15 0x000060b117a77ecb in test_prompt(llama_context*, int, int, int) ()
#16 0x000060b117a88d85 in llama_bench(int, char**) ()
Download failed: Invalid argument.  Continuing without source file ./csu/../sysdeps/nptl/libc_start_call_main.h.
#17 0x000079b7e962a601 in __libc_start_call_main (main=main@entry=0x60b117a03c20 <main>, argc=argc@entry=19, argv=argv@entry=0x7ffefe0b5018) at ../sysdeps/nptl/libc_start_call_main.h:59
⚠️ warning: 59	../sysdeps/nptl/libc_start_call_main.h: No such file or directory
Download failed: Invalid argument.  Continuing without source file ./csu/../csu/libc-start.c.
#18 0x000079b7e962a718 in __libc_start_main_impl (main=0x60b117a03c20 <main>, argc=19, argv=0x7ffefe0b5018, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffefe0b5008) at ../csu/libc-start.c:360
⚠️ warning: 360	../csu/libc-start.c: No such file or directory
#19 0x000060b117a76715 in _start ()
[Inferior 1 (process 2213992) detached]

However q8_0 works!

llama-bench -fa 1 -dio 0 -t 16 -mmp 0 -ngl 999 -m Qwen3.6-27B-Q8_0.gguf -sm tensor -ctk q8_0 -ctv q8_0
| model                          |       size |     params | backend    | ngl | type_k | type_v |     sm | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -----: | -----: | -: | ---: | --------------: | -------------------: |
| qwen35 27B Q8_0                |  27.04 GiB |    27.32 B | CUDA       | 999 |   q8_0 |   q8_0 | tensor |  1 |    0 |           pp512 |      1276.58 ± 31.37 |
| qwen35 27B Q8_0                |  27.04 GiB |    27.32 B | CUDA       | 999 |   q8_0 |   q8_0 | tensor |  1 |    0 |           tg128 |         47.33 ± 1.34 |

k/v in q4_0 has the same, works in q8_0 model quant but fails in k-quant models (q6, q4)
using commit 2b2d0e2c18f54c4647d18cdda4a363fbad063c3d

@Stoney49th
Copy link
Copy Markdown

Stoney49th commented May 28, 2026

wanted to test, but still:

47593] 0.00.043.121 I srv    load_model: loading model '/root/.cache/huggingface/hub/models--unsloth--Qwen3.6-27B-MTP-GGUF/snapshots/b3a58239d8d40b953e34936c9afeb28baa518230/Qwen3.6-27B-UD-Q4_K_XL.gguf'
[47593] 0.00.421.224 I srv    load_model: [mtmd] estimated worst-case memory usage of mmproj is 1161.02 MiB
[47593] 0.00.775.647 I srv    load_model: [spec] estimated memory usage of MTP context is 1529.00 MiB
[47593] 0.00.775.672 I common_init_result: fitting params to device memory ...
[47593] 0.00.775.673 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
[47593] 0.00.775.732 W common_fit_params: failed to fit params to free device memory: llama_params_fit is not implemented for SPLIT_MODE_TENSOR, abort
0.16.115.032 I srv  ensure_model: waiting until model name=qwen3-6-27b-MTP is fully loaded...
[47593] 0.09.360.733 W llama_context: n_ctx_seq (131072) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
[47593] 0.09.529.267 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
[47593] /app/ggml/src/ggml-backend-meta.cpp:1042: GGML_ASSERT(split_state.ne[j]*split_state.nr[0] * tensor->src[i]->ne[src_ss[i].axis] == sum * tensor->ne[split_state.axis]) failed
[47593] libggml-base.so.0(+0x19305) [0x7f886511d305]
[47593] libggml-base.so.0(ggml_print_backtrace+0x200) [0x7f886511d800]
[47593] libggml-base.so.0(ggml_abort+0x12e) [0x7f886511d9ce]
[47593] libggml-base.so.0(+0x44ad0) [0x7f8865148ad0]
[47593] libggml-base.so.0(+0x3b7bb) [0x7f886513f7bb]
[47593] libggml-base.so.0(+0x4837b) [0x7f886514c37b]
[47593] libggml-base.so.0(+0x4a85a) [0x7f886514e85a]
[47593] libggml-base.so.0(ggml_gallocr_alloc_graph+0x505) [0x7f88651338f5]
[47593] libggml-base.so.0(ggml_backend_sched_alloc_graph+0x101) [0x7f886513a091]
[47593] libllama.so.0(_ZN13llama_context14process_ubatchERK12llama_ubatch14llm_graph_typeP22llama_memory_context_iR11ggml_status+0xe2) [0x7f88652b91a2]
[47593] libllama.so.0(_ZN13llama_context6decodeERK11llama_batch+0x398) [0x7f88652bd408]
[47593] libllama.so.0(llama_decode+0xe) [0x7f88652bf6ae]
[47593] libllama-common.so.0(_Z23common_init_from_paramsR13common_paramsb+0x3a6) [0x7f8865836376]
[47593] libllama-server-impl.so(_ZN19server_context_impl10load_modelER13common_params+0x526) [0x7f8865c2e1e6]
[47593] libllama-server-impl.so(_Z12llama_serveriPPc+0x294a) [0x7f8865b7251a]
[47593] /usr/lib/libc.so.6(+0x27741) [0x7f886446d741]
[47593] /usr/lib/libc.so.6(__libc_start_main+0x89) [0x7f886446d879]
[47593] /app/llama-server(+0x1075) [0x5608fa4ce075]

error was there before and has already been reported in #22817

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants