-
-
Notifications
You must be signed in to change notification settings - Fork 12.1k
Open
Labels
documentationImprovements or additions to documentationImprovements or additions to documentation
Description
📚 The doc issue
https://docs.vllm.ai/en/v0.8.5.post1/features/quantization/quantized_kvcache.html
Question:
In the FP8 KV Cache implementation, after computing attention scores and softmax at higher precision (FP16/BF16), is the resulting attention weight matrix:
Quantized to FP8 and multiplied directly with FP8 V cache, or
Multiplied with V cache after dequantizing V to higher precision?
The documentation mentions "no fused dequantization and attention operations yet" but doesn't specify the precision of this final multiplication. Clarifying this detail would help understand the accuracy-performance tradeoff.
Thanks!
Suggest a potential alternative/fix
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Metadata
Metadata
Assignees
Labels
documentationImprovements or additions to documentationImprovements or additions to documentation