Skip to content

feat(inference): add DeepSeek V4 Pro model architecture#432

Open
Oseltamivir wants to merge 9 commits into
masterfrom
feat/dsv4-pro-model-architecture
Open

feat(inference): add DeepSeek V4 Pro model architecture#432
Oseltamivir wants to merge 9 commits into
masterfrom
feat/dsv4-pro-model-architecture

Conversation

@Oseltamivir

@Oseltamivir Oseltamivir commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds a model-architecture entry for DeepSeek V4 Pro, so the inference tab's "Model Architecture" diagram renders for it like the other models. Sourced from deepseek-ai/DeepSeek-V4-Pro (config.json, inference/model.py, DeepSeek_V4.pdf).

Architecture

  • MoE 1.6T total / 49B active, 61 layers, hidden 7168, vocab 129280, 1M context
  • 128 query heads / 1 KV head (MLA-lineage shared-KV MQA), head_dim 512
  • 384 routed + 1 shared experts, top-6; expert FFN 3072
  • Hybrid attention rendered as two interleaved blocks that sum to 61:
    • HCA — Heavily Compressed Attention (31 layers)
    • CSA — Compressed Sparse Attention, lightning indexer top-1024 (30 layers)
  • Sliding Window Attention (128 tokens) + learnable attention sink on both variants
  • Features: mHC, sqrt-softplus routing, aux-loss-free balancing, hash routing (first 3 layers), MTP, YaRN 1M, FP4 experts + FP8, Muon

Shared-renderer changes (small, justified)

  • Sliding-window note is now per-spec (AlternatingLayerSpec.slidingWindow) instead of bi === 0, so hybrid models show window=128 on every attention variant. gpt-oss behavior is preserved (its sliding spec carries the window; its full-attention block does not).
  • Specs-bar "Attention" cell derives from attentionType → shows Hybrid (gpt-oss still reads Sink/Full GQA).

Overlay (?unofficialrun=) support

N/A in the chart-data sense — the diagram is static model metadata keyed by selectedModel, not benchmark/overlay data. It renders for whatever model is selected, including when an unofficial run is loaded; there is no overlay code path to handle.

Tests

  • Unit (model-architectures.test.ts): field assertions, CSA/HCA specs, counts sum to 61, SWA surfaced in features, MoE summary; also documents the per-spec window on gpt-oss. → 53 passed
  • E2E (model-architecture.cy.ts): describe block mirroring gpt-oss (Hybrid badge, two alternating blocks, alternating indicator, features incl. sliding window). dsv4 is in the availability fixture, so it's selectable. (written but not executed locally — no dev server/browser here)

Verification

pnpm typecheck ✅ · unit 53 passed ✅ · pnpm lint ✅ · pnpm fmt ✅ (pre-commit hook re-ran lint/format/typecheck — all green)

Note

releaseDate is set to the HF snapshot date (2026-06-08) as a proxy — please correct if the actual public release date differs.


Note

Medium Risk
Large, shared changes to the D3 architecture renderer and expand/collapse layout could affect diagrams for other models; behavior is mostly additive with targeted gpt-oss compatibility fixes.

Overview
Adds DeepSeek V4 Pro to the inference Model Architecture diagram via new static metadata (1.6T MoE, Hybrid CSA/HCA, 3 hash-routed prefix layers, mHC ×4, 1M context) and extends the shared SVG renderer to match.

The diagram now stacks a hash-routed MoE prefix block (Hash Router, token-id routing), then two alternating CSA/HCA blocks with an expandable hybrid attention drill-down (getHybridAttentionSubBlocks: local sliding window + compressed branch → single softmax). Residual adds can render as mHC ×N pills; helper copy explains union-softmax hybrid attention and hyper-connections when relevant blocks are open.

Cross-model polish: sliding-window labels come from AlternatingLayerSpec.slidingWindow (gpt-oss unchanged), specs show Hybrid and 6+1/385 experts when a shared expert exists, plus minor SVG layout (drill gap, centered parallel columns, stroke-based +/- glyphs). Unit and Cypress coverage added for V4 Pro.

Reviewed by Cursor Bugbot for commit 9a60028. Bugbot is set up for automated code reviews on this repo. Configure here.

Add a MODEL_ARCHITECTURES entry for DeepSeek V4 Pro (1.6T/49B MoE, 61
layers, 1M context) following the existing per-model pattern. Attention
is modeled as a Hybrid stack of two interleaved compressed variants —
Heavily Compressed Attention (31 layers) and Compressed Sparse Attention
(30 layers) — each carrying a 128-token sliding-window branch and a
learnable attention sink.

Surface sliding-window attention on both alternating blocks: the diagram's
window note now derives from a per-spec slidingWindow field instead of the
block index, so hybrid models show window=128 on every attention variant
(gpt-oss behavior preserved). The specs-bar attention cell now derives from
attentionType so it reads "Hybrid" instead of the hardcoded "Sink/Full GQA".

Sourced from deepseek-ai/DeepSeek-V4-Pro (config.json, inference/model.py,
DeepSeek_V4.pdf).
@Oseltamivir Oseltamivir requested a review from adibarra as a code owner June 8, 2026 07:57
@vercel

vercel Bot commented Jun 8, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
inferencemax-app Ready Ready Preview, Comment Jun 8, 2026 10:26pm

Request Review

…ph centering

DeepSeek V4 hybrid attention now drills down: expanding the CSA/HCA
attention block reveals the sliding-window branch as its own block
alongside the compressed branch (lightning indexer for CSA, heavy
compression for HCA), converging at shared-KV MQA + sink + output
projection. Gated to attentionType === 'Hybrid' so gpt-oss is unchanged.

Also fixes two diagram issues affecting all models:
- Residual bypass tapped at the RMSNorm's top edge, so its horizontal
  connector ran across the norm block. Tap from the arrow gap above the
  norm instead.
- Circle glyphs (+, ×, −) rendered off-center because
  dominant-baseline: central is unreliable (Safari falls back to the
  alphabetic baseline). Use dy=0.35em, which centers consistently across
  browsers and matches central where it already worked.
…yphs

Addresses three rendering issues in the model architecture diagram:

- Hybrid (CSA/HCA) attention now drills down into symmetric 2x2 columns:
  Local (Sliding Window + Attention Sink) beside the two-stage Compressed
  branch (compression + selector). This removes the lonely long connector
  that made the expanded box look unbalanced and promotes the attention
  sink to an explicit block; the merge block is now plain "Shared-KV MQA".

- The +, -, and x symbols inside merge/expand circles are drawn as
  geometric strokes instead of <text>. Font baseline drift (even with dy
  tuning) left the glyph sitting slightly low, which also made the residual
  bypass line read as misaligned with the "+". The strokes are centered on
  the circle's center, so the residual line is now co-linear with the arm.
The sliding-window and compressed branches are two KV *sources* whose
selected indices are unioned into a single shared-KV MQA softmax — not two
attentions merged after the fact. The attention sink is a per-head learnable
softmax-denominator bias on that MQA (model.py attn_sink / kernel.py
sum_exp += exp(attn_sink - max)), not literal "first tokens" in the local
branch.

- Local branch is just the sliding-window source (one block); the sink moves
  back onto the merge block as "Shared-KV MQA + Sink".
- CSA compressed branch = Token Compression -> Lightning Indexer (2 stages);
  HCA = a single Heavy Compression source. This makes CSA a 1-vs-2 split
  again.
- Center each column within the shared column area in drawParallelFlow so an
  unequal split reads as an intentional branch merge instead of leaving the
  shorter column's connector dangling as a long unattached line. Also
  improves the 2-vs-1 SwiGLU expert merge.
…tmax

When the DeepSeek V4 hybrid attention drill-down is expanded, show a short
note clarifying that the Local (sliding-window) and Compressed (CSA/HCA)
columns are two KV *sources* unioned into a single shared-KV MQA softmax —
not two separate attentions that get summed — with the attention sink being
a learnable per-head softmax-denominator bias. Prevents the parallel-column
schematic from being read as two independent attention paths.

Shown only while a hybrid attention block is expanded; covered by an e2e
assertion.
…te attention type

The 128-token sliding window is the shared local base of every hybrid layer
(both HCA and CSA extend it; the final layer runs it alone) — not a third
attention type alongside CSA/HCA. Rename the features badge from
"Sliding Window Attention (128 tokens)" to "Sliding window (128 tokens)" so it
reads as a windowing mechanism rather than a standalone attention. The
drill-down's "Sliding Window" KV-source block is unchanged.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 3c55577. Configure here.

<span className="font-medium text-foreground">single softmax</span> to the union of
sliding-window + selected compressed keys, with a learnable per-head attention sink.
</p>
)}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hybrid note ignores block collapse

Low Severity

The hybrid attention caption is shown whenever altAttention0 or altAttention1 stays in expandedBlocks, but collapsing an alternating block does not clear those ids. Users can see the softmax explanation with no matching drill-down in the SVG.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 3c55577. Configure here.

…ons for V4

Two DeepSeek V4 architectural facts were only feature badges; surface them in
the diagram structure.

Hash-routed layers (num_hash_layers=3): the first 3 MoE layers route by token
id, not a learned gate. They now render as a separate stacked prefix block
(between embedding and the alternating blocks) with a "Hash Router" instead of
"MoE Router". Alternating HCA/CSA counts drop 31/30 → 29/29 so they describe the
learned-router layers (3 + 29 + 29 = 61); drawExpertGrid gains optional
routerLabel / routerSub params.

mHC (hc_mult=4): residuals are replaced by 4 parallel hyper-connection streams
with learned, Sinkhorn-normalized A/B/C mixing. Residual merges now render as an
"mHC ×N" mixer node instead of a plain "+" when arch.hyperConnections > 1, plus
a caption shown while a block exposing the nodes is expanded. Models without
hyper-connections keep the "+" residual.

Adds arch fields hashRoutedLayers and hyperConnections; unit + e2e coverage.
Two diagram glitches in expanded transformer blocks:

- The incoming arrow stopped at the dashed container border, leaving no line
  above the first RMSNorm. Route each block's incoming arrow to its first
  RMSNorm when the block is expanded (through the border), so there is a
  continuous connector; collapsed blocks still target the block top.

- The attention drill-down rect sat flush against the attention block's bottom
  border, reading as an overlap. Add a small gap (drillGap) between an
  attention block and its expansion flow.
The specs bar showed "6/385", but the always-on shared expert is active too,
so 7 experts run per token (6 routed + 1 shared). Show "6+1/385" for
shared-expert MoE models (e.g. R1 → "8+1/257") so the active count isn't
undersold; the title's "N active" params and the router subtitle already
account for the shared expert.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant