speed-bench: add M5 Max 128GB q2-q4-imatrix curve (addresses #226)#255
speed-bench: add M5 Max 128GB q2-q4-imatrix curve (addresses #226)#255kenahrens wants to merge 1 commit into
Conversation
Bench data for the q2-q4-imatrix mixed Flash quant (last 6 expert layers Q4K, rest IQ2XXS) on M5 Max 128GB, macOS 26.4.1. Fills the unanswered request in antirez#226 for q2-q4-imatrix benchmark numbers, and extends published M5 Max coverage past the 65K point from antirez#97 into the 100K-200K range. Command: ds4-bench -m ds4flash.gguf --prompt-file speed-bench/promessi_sposi.txt --ctx-start 2048 --ctx-max 200000 --step-incr 16384 --gen-tokens 128 Build: ad0209f (Metal 4 tensor API + decode-indexer top-k path from antirez#169 enabled). Highlights vs M5 Max q2-imatrix from antirez#97 (same hardware tier): - 2K decode: 34.4 t/s (vs 31.5 t/s, +9%) - 2K prefill: 413.9 t/s (vs 372.2 t/s, +11%) - 32K decode: 27.8 t/s (vs 28.9 t/s, -4%) - 65K decode: 25.8 t/s (vs 27.0 t/s, -4%) q2-q4 is faster than q2 at low ctx (Q4 layers + Metal 4 win) and ~4% slower above 32K (more bandwidth-bound). Closes antirez#226 with data.
|
I tested it on the M5 Max 128gb RAM. I got about the same results, so I think that it is worth adding. if need to, I can do some more tests and give benchmarks from my side |
|
Quality eval to pair with the speed numbers Ran the repo's existing
q2-q4-imatrix is the better average fit on every aggregate metric and wins 69 of 100 individual prompts. The −7.1% NLL is on the same target token sequence, so it's a strict fit improvement, not a sampling artifact. Caveat: largest single-case deltas favor q2 (cases 35, 7, 42 by 1.1–2.9 NLL). Per-case variance is higher with n=100 than aggregate stats suggest. Combined with the prefill+decode numbers already in this PR, q2-q4-imatrix shows tighter agreement with the official model on next-token distributions at this sample size. Worth re-running at n≥500 before drawing strong conclusions — but the direction is consistent with the speed result. Reproduce: make -C gguf-tools quality-score
python3 gguf-tools/quality-testing/collect_official.py \
--prompts gguf-tools/quality-testing/prompts.jsonl \
--out gguf-tools/quality-testing/data/flash --count 100 --max-tokens 24
./gguf-tools/quality-testing/score_official ./q2-imatrix.gguf data/flash/manifest.tsv /tmp/q2.tsv 4096
./gguf-tools/quality-testing/score_official ./q2-q4-imatrix.gguf data/flash/manifest.tsv /tmp/q2q4.tsv 4096
python3 gguf-tools/quality-testing/compare_scores.py /tmp/q2.tsv /tmp/q2q4.tsv |
|
cc @antirez — adding an independent M5 Max 128GB reproduction plus the practical q2 → q2-q4 speed-gain summary for review. Value summaryFrom this PR's q2 vs q2-q4 comparison on M5 Max 128GB: Prefill / prompt ingestionThis is the main win, especially for coding-agent and long-context use where ds4 needs to ingest repo context, tool history, or large prompts.
Decode / generationDecode is mixed: faster at short context, slightly slower at longer context.
So the practical result is: q2-q4 looks like a net win for interactive use on M5 Max 128GB — much faster prefill, faster short-context decode, and only a small long-context decode penalty. Independent reproductionSetup:
./ds4-bench \
-m ds4flash.gguf \
--prompt-file speed-bench/promessi_sposi.txt \
--ctx-start 2048 \
--ctx-max 200000 \
--step-incr 16384 \
--gen-tokens 128 \
--warm-weights \
--csv /tmp/m5max_q2q4_imatrix_current_main.csvResult: I can reproduce the PR's curve. Decode is very close across the sweep, and KV bytes match exactly. Prefill differs more in the early/mid rows, likely thermal/run-to-run variance, but converges closely at long context.
My final row: 200000,1344,157.40,128,19.97,2776775308So from a second M5 Max 128GB run: this PR's q2-q4-imatrix speed curve looks reproducible and worth merging. |
Bench data for the q2-q4-imatrix mixed Flash quant on M5 Max 128GB (macOS 26.4.1), 14 frontiers from 2048 to 200000 tokens. Adds
speed-bench/m5_max_q2q4_imatrix.csv+ the auto-generated_ts.svg.Addresses #226 (q2 vs q2-q4-imatrix benchmark request — was unanswered) and fills the M5 Max coverage gap between the 65K point in #97 and the 256K point in #143.
Run on build
ad0209fwith the Metal 4 tensor API path (#15) and the decode-indexer top-k optimization (#169) enabled:Comparison vs M5 Max q2-imatrix from #97 (same hardware tier):
q2-q4 is faster than q2 across the board for prefill (Q4 last-6-layers + Metal 4 tensor path), and faster for decode below ~16K, with a small drop (~4%) above 32K where the extra Q4 bandwidth cost shows up. The headline finding: the mixed quant is a net win for interactive use and the cost above 32K is much smaller than the README's suggestion would imply.
KV cache observed at ~13.4 KB/token marginal — 2.78 GB at the 200K frontier. (Worth noting: this is the bench's reported
kvcache_bytescolumn, which is much higher than what #164 saw via RSS on M4 Max. Different measurement window.)