RFC: Planar3 KV-cache quantization for compressed attention (experimental)#265
Open
hexxyan wants to merge 10 commits into
Open
RFC: Planar3 KV-cache quantization for compressed attention (experimental)#265hexxyan wants to merge 10 commits into
hexxyan wants to merge 10 commits into
Conversation
51f333c to
a192722
Compare
Pure C reference implementation of PlanarQuant (2D Givens rotation + Lloyd-Max 3-bit) for ds4's head_dim=512, adapted from experolk/planar-llama. Standalone — no ggml dependency. Block layout: 50 bytes per 128-dim block (norm FP16 + 2-bit indices + 1-bit QJL signs). Four blocks per 512-dim row = 200 bytes (5.12x vs FP16). d=512 roundtrip quality baseline (random vectors): cosine avg=0.985, MSE avg=0.010, norm preservation <0.03% error This is an RFC prototype for offline quality evaluation. Not yet wired into ds4's hot path or Metal/CUDA attention kernels. Reference: ParaMind2025 PlanarQuant, RotorQuant paper
- ds4_planar3_dequantize: use n_per_row for output stride instead of hardcoded 512; add assert(n_per_row == 512) in both batch functions - Rotation parameters: clarify "64 pairs per block, reused across 4 blocks" instead of misleading "256 pairs" - Block signs field: correct comment from "QJL signs" to "high bit of 3-bit centroid index"
Standalone CLI tool that evaluates Planar3 quantization quality on synthetic KV-cache-like data. Supports four distribution modes: random_normal, random_uniform, sparse, ds4_realistic. Reports cosine similarity, MSE, max element error, relative norm error, and attention score drift (Pearson correlation + top-1 agreement). d=512 quality baseline (10K rows, ds4_realistic mode): cosine: mean=0.981 p99=0.986 min=0.967 MSE: mean=4.57e-02 norm error: mean=3.17e-04 (< 0.04%) attention score corr: 0.981, top-1 preserved Makefile targets: planar-eval, planar-quant-test.
- ds4.h: add planar_kv_cache and dump_comp_kv to engine options - ds4.c: dual-cache strategy (FP32 + Planar3), comp_kv_for_attn() dequant helper, conditional planar staging in decode scratch, checkpoint load quantizes FP32→Planar3, dump-comp-kv tool - ds4_cli.c: --planar-kv-cache and --dump-comp-kv flags - ds4_planar_quant.c/h: soft asserts for dim contract, attribution - metal/dsv4_misc.metal: Planar3 centroids/tables and dequant helper for Phase 2 inline attention dequant - Makefile: ds4_planar_quant.o in all build variants - tests/planar_quant_test.c: dim-mismatch edge case test - tools/planar_eval.c: ds4_like mode, multi-query eval improvements
…tize - metal/dsv4_misc.metal: rename pad0->comp_kv_planar in args struct, add cooperative Planar3 dequant path in both indexed attention kernels (heads8 and rb16), add kernel_planar3_quantize_row for GPU-side FP32->Planar3 conversion with midpoint-based fast quantization - ds4_metal.m: thread comp_kv_planar through indexed mixed attention bridge, add ds4_gpu_planar3_quantize_tensor dispatch function, register planar3 quantize pipeline - ds4_gpu.h: add comp_kv_planar param and planar3 quantize declaration - ds4.c: allocate Planar3 GPU cache tensors per layer (layer_attn_comp_planar), add planar_kv_cache param through metal_graph_alloc_raw_cap, wire new parameter through all call sites
- Add metal_graph_quantize_attn_comp_planar() helper that reads from FP32 staging (F16 path) or FP32 cache, quantizes to Planar3 on GPU - Call after all 4 compressed-row commit sites (single-row decode, batch prefill, chunked prefill, batch single-row) - Add metal_graph_attn_comp_for_attention() / metal_graph_attn_comp_is_planar() helpers to select Planar3 cache when enabled - Pass Planar3 cache and flag to all 4 indexed attention dispatch sites - Rebuild Planar3 GPU cache after checkpoint restore (read-back via CPU, quantize, upload) for both F16 and FP32 comp cache paths - Add Planar3 tensor to layer allocation validation check
P0: Move planar_staging from ds4_cpu_decode_scratch to ds4_layer_cache,
eliminating per-token xmalloc/free in CPU decode and prefill hot-paths.
P0: Optimize _rb16 Planar3 dequant to use all 256 threads (2 rows/iteration).
P1: Add staging capacity guard in metal_graph_quantize_attn_comp_planar.
P1: Fix fp16_to_fp32 subnormal and infinity handling.
P1: Document ds4_session_dump_comp_kv single-layer limitation.
P2: Remove unused kv_cache_uses_planar (now per-layer staging).
P2: Add zero-norm and single-element edge-case tests.
Add comp_kv_planar parameter to all three FA wrapper functions (decode_heads, decode_mixed_batch, prefill_static_mixed) and pass Planar3→F16 dequant path through the internal dispatch chain. FA encoder functions now branch: if comp_kv_planar, use the new kernel_planar3_dequant_to_f16_rows to decompress directly into g_flash_attn_kv_buffer; otherwise use existing copy_to_f16 path. ds4.c dispatch sites use metal_graph_attn_comp_for_attention() and metal_graph_attn_comp_is_planar() to select the correct tensor.
When --planar-kv-cache-only is enabled (implies --planar-kv-cache): - GPU: skips layer_attn_comp_cache allocation, only allocates Planar3 - CPU: skips attn_comp_kv allocation, quantizes directly to Planar3 - Checkpoint save: writes Planar3 bytes directly instead of FP16/FP32 - Checkpoint restore: reads Planar3 directly, no CPU quantize step - metal_graph_store_attn_comp_stage: returns true when comp cache is NULL - kv_cache_push_comp: handles NULL rows pointer (planar-only path) This eliminates the dual-cache memory overhead. Combined with the FA Planar3→F16 pre-dequant path, all attention paths work without the persistent FP16/FP32 compressed cache.
a192722 to
381959a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Thanks @antirez for ds4 — a remarkably clean and well-structured inference engine.
Summary
This PR adds an opt-in, experimental Planar3 KV-cache quantization based on the PlanarQuant approach. 2D Givens rotations decorrelate 128-dim blocks of KV cache rows, then Lloyd-Max 3-bit centroids quantize the rotated coefficients. For ds4's 512-dim heads, each compressed row is 200 bytes (4 blocks × 50 bytes) — a 5.12× density ratio vs FP16.
What's Included
Core Codec (
ds4_planar_quant.c/h)ds4_row_planar3struct: 200 bytes per 512-dim row (4 ×ds4_block_planar3at 50 bytes each)Metal GPU Path
kernel_planar3_quantize_row) — cooperative 128-thread per rowkernel_planar3_dequant_to_f16_rows) — pre-stage for FlashAttentionkernel_dsv4_indexed_mixed_attention_heads8and_rb16CPU Hot-Path
planar_stagingbuffer for dequant (no per-token malloc)comp_kv_for_attn()transparent dequant from Planar3Planar-Only Mode (
--planar-kv-cache-only)--planar-kv-cacheCheckpoint Compatibility
Offline Quality Evaluator (
tools/planar_eval)Tests (
tests/planar_quant_test.c)Status
Experimental / RFC. Implementation and static/unit verification are solid. The codec round-trips random vectors at cosine similarity ~0.985 avg. End-to-end quality validation on a real 80GB+ DS4 model with compressed KV dumps is needed before recommending production use.
Memory Savings
With
--planar-kv-cache-only, each compressed KV row is 200 bytes instead of 1024 bytes (FP16). The dual-cache overhead is eliminated — only Planar3 bytes are stored persistently. FA paths use a transient F16 scratch buffer for dequant.Usage
References & Related Work
Planar3 sits within a family of rotation-based KV-cache compression methods that reduce the d×d random orthogonal projection (originally from TurboQuant) to progressively lighter structures:
Core Methods
* RotorQuant PPL from production llama.cpp benchmarks on RTX 5090.
Evolution: TurboQuant (dense WHT) → RotorQuant (3D Clifford, by John D. Pope) → IsoQuant (4D quaternion) / PlanarQuant (2D Givens, both by Zhongping Ji / ParaMind2025). RotorQuant's README credits ParaMind2025 for designing PlanarQuant and IsoQuant.
PlanarQuant is the lightest variant — 128 params, 256 FMAs — making it the best fit for fused GPU kernels in production inference engines. Its quality is competitive with TurboQuant at a fraction of the compute.
Community Projects