feat(inference): add DeepSeek V4 Pro model architecture#432
Open
Oseltamivir wants to merge 9 commits into
Open
feat(inference): add DeepSeek V4 Pro model architecture#432Oseltamivir wants to merge 9 commits into
Oseltamivir wants to merge 9 commits into
Conversation
Add a MODEL_ARCHITECTURES entry for DeepSeek V4 Pro (1.6T/49B MoE, 61 layers, 1M context) following the existing per-model pattern. Attention is modeled as a Hybrid stack of two interleaved compressed variants — Heavily Compressed Attention (31 layers) and Compressed Sparse Attention (30 layers) — each carrying a 128-token sliding-window branch and a learnable attention sink. Surface sliding-window attention on both alternating blocks: the diagram's window note now derives from a per-spec slidingWindow field instead of the block index, so hybrid models show window=128 on every attention variant (gpt-oss behavior preserved). The specs-bar attention cell now derives from attentionType so it reads "Hybrid" instead of the hardcoded "Sink/Full GQA". Sourced from deepseek-ai/DeepSeek-V4-Pro (config.json, inference/model.py, DeepSeek_V4.pdf).
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
…ph centering DeepSeek V4 hybrid attention now drills down: expanding the CSA/HCA attention block reveals the sliding-window branch as its own block alongside the compressed branch (lightning indexer for CSA, heavy compression for HCA), converging at shared-KV MQA + sink + output projection. Gated to attentionType === 'Hybrid' so gpt-oss is unchanged. Also fixes two diagram issues affecting all models: - Residual bypass tapped at the RMSNorm's top edge, so its horizontal connector ran across the norm block. Tap from the arrow gap above the norm instead. - Circle glyphs (+, ×, −) rendered off-center because dominant-baseline: central is unreliable (Safari falls back to the alphabetic baseline). Use dy=0.35em, which centers consistently across browsers and matches central where it already worked.
…yphs Addresses three rendering issues in the model architecture diagram: - Hybrid (CSA/HCA) attention now drills down into symmetric 2x2 columns: Local (Sliding Window + Attention Sink) beside the two-stage Compressed branch (compression + selector). This removes the lonely long connector that made the expanded box look unbalanced and promotes the attention sink to an explicit block; the merge block is now plain "Shared-KV MQA". - The +, -, and x symbols inside merge/expand circles are drawn as geometric strokes instead of <text>. Font baseline drift (even with dy tuning) left the glyph sitting slightly low, which also made the residual bypass line read as misaligned with the "+". The strokes are centered on the circle's center, so the residual line is now co-linear with the arm.
The sliding-window and compressed branches are two KV *sources* whose selected indices are unioned into a single shared-KV MQA softmax — not two attentions merged after the fact. The attention sink is a per-head learnable softmax-denominator bias on that MQA (model.py attn_sink / kernel.py sum_exp += exp(attn_sink - max)), not literal "first tokens" in the local branch. - Local branch is just the sliding-window source (one block); the sink moves back onto the merge block as "Shared-KV MQA + Sink". - CSA compressed branch = Token Compression -> Lightning Indexer (2 stages); HCA = a single Heavy Compression source. This makes CSA a 1-vs-2 split again. - Center each column within the shared column area in drawParallelFlow so an unequal split reads as an intentional branch merge instead of leaving the shorter column's connector dangling as a long unattached line. Also improves the 2-vs-1 SwiGLU expert merge.
…tmax When the DeepSeek V4 hybrid attention drill-down is expanded, show a short note clarifying that the Local (sliding-window) and Compressed (CSA/HCA) columns are two KV *sources* unioned into a single shared-KV MQA softmax — not two separate attentions that get summed — with the attention sink being a learnable per-head softmax-denominator bias. Prevents the parallel-column schematic from being read as two independent attention paths. Shown only while a hybrid attention block is expanded; covered by an e2e assertion.
…te attention type The 128-token sliding window is the shared local base of every hybrid layer (both HCA and CSA extend it; the final layer runs it alone) — not a third attention type alongside CSA/HCA. Rename the features badge from "Sliding Window Attention (128 tokens)" to "Sliding window (128 tokens)" so it reads as a windowing mechanism rather than a standalone attention. The drill-down's "Sliding Window" KV-source block is unchanged.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 3c55577. Configure here.
| <span className="font-medium text-foreground">single softmax</span> to the union of | ||
| sliding-window + selected compressed keys, with a learnable per-head attention sink. | ||
| </p> | ||
| )} |
There was a problem hiding this comment.
Hybrid note ignores block collapse
Low Severity
The hybrid attention caption is shown whenever altAttention0 or altAttention1 stays in expandedBlocks, but collapsing an alternating block does not clear those ids. Users can see the softmax explanation with no matching drill-down in the SVG.
Reviewed by Cursor Bugbot for commit 3c55577. Configure here.
…ons for V4 Two DeepSeek V4 architectural facts were only feature badges; surface them in the diagram structure. Hash-routed layers (num_hash_layers=3): the first 3 MoE layers route by token id, not a learned gate. They now render as a separate stacked prefix block (between embedding and the alternating blocks) with a "Hash Router" instead of "MoE Router". Alternating HCA/CSA counts drop 31/30 → 29/29 so they describe the learned-router layers (3 + 29 + 29 = 61); drawExpertGrid gains optional routerLabel / routerSub params. mHC (hc_mult=4): residuals are replaced by 4 parallel hyper-connection streams with learned, Sinkhorn-normalized A/B/C mixing. Residual merges now render as an "mHC ×N" mixer node instead of a plain "+" when arch.hyperConnections > 1, plus a caption shown while a block exposing the nodes is expanded. Models without hyper-connections keep the "+" residual. Adds arch fields hashRoutedLayers and hyperConnections; unit + e2e coverage.
Two diagram glitches in expanded transformer blocks: - The incoming arrow stopped at the dashed container border, leaving no line above the first RMSNorm. Route each block's incoming arrow to its first RMSNorm when the block is expanded (through the border), so there is a continuous connector; collapsed blocks still target the block top. - The attention drill-down rect sat flush against the attention block's bottom border, reading as an overlap. Add a small gap (drillGap) between an attention block and its expansion flow.
The specs bar showed "6/385", but the always-on shared expert is active too, so 7 experts run per token (6 routed + 1 shared). Show "6+1/385" for shared-expert MoE models (e.g. R1 → "8+1/257") so the active count isn't undersold; the title's "N active" params and the router subtitle already account for the shared expert.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.


Summary
Adds a model-architecture entry for DeepSeek V4 Pro, so the inference tab's "Model Architecture" diagram renders for it like the other models. Sourced from
deepseek-ai/DeepSeek-V4-Pro(config.json,inference/model.py,DeepSeek_V4.pdf).Architecture
Hybridattention rendered as two interleaved blocks that sum to 61:Shared-renderer changes (small, justified)
AlternatingLayerSpec.slidingWindow) instead ofbi === 0, so hybrid models showwindow=128on every attention variant. gpt-oss behavior is preserved (its sliding spec carries the window; its full-attention block does not).attentionType→ showsHybrid(gpt-oss still readsSink/Full GQA).Overlay (
?unofficialrun=) supportN/A in the chart-data sense — the diagram is static model metadata keyed by
selectedModel, not benchmark/overlay data. It renders for whatever model is selected, including when an unofficial run is loaded; there is no overlay code path to handle.Tests
model-architectures.test.ts): field assertions, CSA/HCA specs, counts sum to 61, SWA surfaced in features, MoE summary; also documents the per-spec window on gpt-oss. → 53 passedmodel-architecture.cy.ts):describeblock mirroring gpt-oss (Hybrid badge, two alternating blocks, alternating indicator, features incl. sliding window).dsv4is in the availability fixture, so it's selectable. (written but not executed locally — no dev server/browser here)Verification
pnpm typecheck✅ · unit 53 passed ✅ ·pnpm lint✅ ·pnpm fmt✅ (pre-commit hook re-ran lint/format/typecheck — all green)Note
releaseDateis set to the HF snapshot date (2026-06-08) as a proxy — please correct if the actual public release date differs.Note
Medium Risk
Large, shared changes to the D3 architecture renderer and expand/collapse layout could affect diagrams for other models; behavior is mostly additive with targeted gpt-oss compatibility fixes.
Overview
Adds DeepSeek V4 Pro to the inference Model Architecture diagram via new static metadata (1.6T MoE, Hybrid CSA/HCA, 3 hash-routed prefix layers, mHC ×4, 1M context) and extends the shared SVG renderer to match.
The diagram now stacks a hash-routed MoE prefix block (
Hash Router, token-id routing), then two alternating CSA/HCA blocks with an expandable hybrid attention drill-down (getHybridAttentionSubBlocks: local sliding window + compressed branch → single softmax). Residual adds can render as mHC ×N pills; helper copy explains union-softmax hybrid attention and hyper-connections when relevant blocks are open.Cross-model polish: sliding-window labels come from
AlternatingLayerSpec.slidingWindow(gpt-oss unchanged), specs show Hybrid and 6+1/385 experts when a shared expert exists, plus minor SVG layout (drill gap, centered parallel columns, stroke-based +/- glyphs). Unit and Cypress coverage added for V4 Pro.Reviewed by Cursor Bugbot for commit 9a60028. Bugbot is set up for automated code reviews on this repo. Configure here.