feat(inference): add DeepSeek V4 Pro model architecture by Oseltamivir · Pull Request #432 · SemiAnalysisAI/InferenceX-app

Oseltamivir · 2026-06-08T07:57:57Z

Summary

Adds a model-architecture entry for DeepSeek V4 Pro, so the inference tab's "Model Architecture" diagram renders for it like the other models. Sourced from deepseek-ai/DeepSeek-V4-Pro (config.json, inference/model.py, DeepSeek_V4.pdf).

Architecture

MoE 1.6T total / 49B active, 61 layers, hidden 7168, vocab 129280, 1M context
128 query heads / 1 KV head (MLA-lineage shared-KV MQA), head_dim 512
384 routed + 1 shared experts, top-6; expert FFN 3072
Hybrid attention rendered as two interleaved blocks that sum to 61:
- HCA — Heavily Compressed Attention (31 layers)
- CSA — Compressed Sparse Attention, lightning indexer top-1024 (30 layers)
Sliding Window Attention (128 tokens) + learnable attention sink on both variants
Features: mHC, sqrt-softplus routing, aux-loss-free balancing, hash routing (first 3 layers), MTP, YaRN 1M, FP4 experts + FP8, Muon

Shared-renderer changes (small, justified)

Sliding-window note is now per-spec (AlternatingLayerSpec.slidingWindow) instead of bi === 0, so hybrid models show window=128 on every attention variant. gpt-oss behavior is preserved (its sliding spec carries the window; its full-attention block does not).
Specs-bar "Attention" cell derives from attentionType → shows Hybrid (gpt-oss still reads Sink/Full GQA).

Overlay (`?unofficialrun=`) support

N/A in the chart-data sense — the diagram is static model metadata keyed by selectedModel, not benchmark/overlay data. It renders for whatever model is selected, including when an unofficial run is loaded; there is no overlay code path to handle.

Tests

Unit (model-architectures.test.ts): field assertions, CSA/HCA specs, counts sum to 61, SWA surfaced in features, MoE summary; also documents the per-spec window on gpt-oss. → 53 passed
E2E (model-architecture.cy.ts): describe block mirroring gpt-oss (Hybrid badge, two alternating blocks, alternating indicator, features incl. sliding window). dsv4 is in the availability fixture, so it's selectable. (written but not executed locally — no dev server/browser here)

Verification

pnpm typecheck ✅ · unit 53 passed ✅ · pnpm lint ✅ · pnpm fmt ✅ (pre-commit hook re-ran lint/format/typecheck — all green)

Note

releaseDate is set to the HF snapshot date (2026-06-08) as a proxy — please correct if the actual public release date differs.

Note

Medium Risk
Large, shared changes to the D3 architecture renderer and expand/collapse layout could affect diagrams for other models; behavior is mostly additive with targeted gpt-oss compatibility fixes.

Overview
Adds DeepSeek V4 Pro to the inference Model Architecture diagram via new static metadata (1.6T MoE, Hybrid CSA/HCA, 3 hash-routed prefix layers, mHC ×4, 1M context) and extends the shared SVG renderer to match.

The diagram now stacks a hash-routed MoE prefix block (Hash Router, token-id routing), then two alternating CSA/HCA blocks with an expandable hybrid attention drill-down (getHybridAttentionSubBlocks: local sliding window + compressed branch → single softmax). Residual adds can render as mHC ×N pills; helper copy explains union-softmax hybrid attention and hyper-connections when relevant blocks are open.

Cross-model polish: sliding-window labels come from AlternatingLayerSpec.slidingWindow (gpt-oss unchanged), specs show Hybrid and 6+1/385 experts when a shared expert exists, plus minor SVG layout (drill gap, centered parallel columns, stroke-based +/- glyphs). Unit and Cypress coverage added for V4 Pro.

^{Reviewed by Cursor Bugbot for commit 9a60028. Bugbot is set up for automated code reviews on this repo. Configure here.}

Add a MODEL_ARCHITECTURES entry for DeepSeek V4 Pro (1.6T/49B MoE, 61 layers, 1M context) following the existing per-model pattern. Attention is modeled as a Hybrid stack of two interleaved compressed variants — Heavily Compressed Attention (31 layers) and Compressed Sparse Attention (30 layers) — each carrying a 128-token sliding-window branch and a learnable attention sink. Surface sliding-window attention on both alternating blocks: the diagram's window note now derives from a per-spec slidingWindow field instead of the block index, so hybrid models show window=128 on every attention variant (gpt-oss behavior preserved). The specs-bar attention cell now derives from attentionType so it reads "Hybrid" instead of the hardcoded "Sink/Full GQA". Sourced from deepseek-ai/DeepSeek-V4-Pro (config.json, inference/model.py, DeepSeek_V4.pdf).

vercel · 2026-06-08T07:58:03Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
inferencemax-app	Ready	Preview, Comment	Jun 8, 2026 10:26pm

…ph centering DeepSeek V4 hybrid attention now drills down: expanding the CSA/HCA attention block reveals the sliding-window branch as its own block alongside the compressed branch (lightning indexer for CSA, heavy compression for HCA), converging at shared-KV MQA + sink + output projection. Gated to attentionType === 'Hybrid' so gpt-oss is unchanged. Also fixes two diagram issues affecting all models: - Residual bypass tapped at the RMSNorm's top edge, so its horizontal connector ran across the norm block. Tap from the arrow gap above the norm instead. - Circle glyphs (+, ×, −) rendered off-center because dominant-baseline: central is unreliable (Safari falls back to the alphabetic baseline). Use dy=0.35em, which centers consistently across browsers and matches central where it already worked.

…yphs Addresses three rendering issues in the model architecture diagram: - Hybrid (CSA/HCA) attention now drills down into symmetric 2x2 columns: Local (Sliding Window + Attention Sink) beside the two-stage Compressed branch (compression + selector). This removes the lonely long connector that made the expanded box look unbalanced and promotes the attention sink to an explicit block; the merge block is now plain "Shared-KV MQA". - The +, -, and x symbols inside merge/expand circles are drawn as geometric strokes instead of <text>. Font baseline drift (even with dy tuning) left the glyph sitting slightly low, which also made the residual bypass line read as misaligned with the "+". The strokes are centered on the circle's center, so the residual line is now co-linear with the arm.

The sliding-window and compressed branches are two KV *sources* whose selected indices are unioned into a single shared-KV MQA softmax — not two attentions merged after the fact. The attention sink is a per-head learnable softmax-denominator bias on that MQA (model.py attn_sink / kernel.py sum_exp += exp(attn_sink - max)), not literal "first tokens" in the local branch. - Local branch is just the sliding-window source (one block); the sink moves back onto the merge block as "Shared-KV MQA + Sink". - CSA compressed branch = Token Compression -> Lightning Indexer (2 stages); HCA = a single Heavy Compression source. This makes CSA a 1-vs-2 split again. - Center each column within the shared column area in drawParallelFlow so an unequal split reads as an intentional branch merge instead of leaving the shorter column's connector dangling as a long unattached line. Also improves the 2-vs-1 SwiGLU expert merge.

…tmax When the DeepSeek V4 hybrid attention drill-down is expanded, show a short note clarifying that the Local (sliding-window) and Compressed (CSA/HCA) columns are two KV *sources* unioned into a single shared-KV MQA softmax — not two separate attentions that get summed — with the attention sink being a learnable per-head softmax-denominator bias. Prevents the parallel-column schematic from being read as two independent attention paths. Shown only while a hybrid attention block is expanded; covered by an e2e assertion.

…te attention type The 128-token sliding window is the shared local base of every hybrid layer (both HCA and CSA extend it; the final layer runs it alone) — not a third attention type alongside CSA/HCA. Rename the features badge from "Sliding Window Attention (128 tokens)" to "Sliding window (128 tokens)" so it reads as a windowing mechanism rather than a standalone attention. The drill-down's "Sliding Window" KV-source block is unchanged.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 3c55577. Configure here.}

cursor · 2026-06-08T18:45:43Z

+                <span className="font-medium text-foreground">single softmax</span> to the union of
+                sliding-window + selected compressed keys, with a learnable per-head attention sink.
+              </p>
+            )}


Hybrid note ignores block collapse

Low Severity

The hybrid attention caption is shown whenever altAttention0 or altAttention1 stays in expandedBlocks, but collapsing an alternating block does not clear those ids. Users can see the softmax explanation with no matching drill-down in the SVG.

^{Reviewed by Cursor Bugbot for commit 3c55577. Configure here.}

…ons for V4 Two DeepSeek V4 architectural facts were only feature badges; surface them in the diagram structure. Hash-routed layers (num_hash_layers=3): the first 3 MoE layers route by token id, not a learned gate. They now render as a separate stacked prefix block (between embedding and the alternating blocks) with a "Hash Router" instead of "MoE Router". Alternating HCA/CSA counts drop 31/30 → 29/29 so they describe the learned-router layers (3 + 29 + 29 = 61); drawExpertGrid gains optional routerLabel / routerSub params. mHC (hc_mult=4): residuals are replaced by 4 parallel hyper-connection streams with learned, Sinkhorn-normalized A/B/C mixing. Residual merges now render as an "mHC ×N" mixer node instead of a plain "+" when arch.hyperConnections > 1, plus a caption shown while a block exposing the nodes is expanded. Models without hyper-connections keep the "+" residual. Adds arch fields hashRoutedLayers and hyperConnections; unit + e2e coverage.

Two diagram glitches in expanded transformer blocks: - The incoming arrow stopped at the dashed container border, leaving no line above the first RMSNorm. Route each block's incoming arrow to its first RMSNorm when the block is expanded (through the border), so there is a continuous connector; collapsed blocks still target the block top. - The attention drill-down rect sat flush against the attention block's bottom border, reading as an overlap. Add a small gap (drillGap) between an attention block and its expansion flow.

The specs bar showed "6/385", but the always-on shared expert is active too, so 7 experts run per token (6 routed + 1 shared). Show "6+1/385" for shared-expert MoE models (e.g. R1 → "8+1/257") so the active count isn't undersold; the title's "N active" params and the router subtitle already account for the shared expert.

Oseltamivir requested a review from adibarra as a code owner June 8, 2026 07:57

vercel Bot deployed to Preview June 8, 2026 07:58 View deployment

vercel Bot deployed to Preview June 8, 2026 08:21 View deployment

vercel Bot deployed to Preview June 8, 2026 08:55 View deployment

vercel Bot deployed to Preview June 8, 2026 18:16 View deployment

vercel Bot deployed to Preview June 8, 2026 18:27 View deployment

vercel Bot deployed to Preview June 8, 2026 18:44 View deployment

cursor Bot reviewed Jun 8, 2026

View reviewed changes

vercel Bot deployed to Preview June 8, 2026 21:27 View deployment

vercel Bot deployed to Preview June 8, 2026 21:41 View deployment

vercel Bot deployed to Preview June 8, 2026 22:26 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(inference): add DeepSeek V4 Pro model architecture#432

feat(inference): add DeepSeek V4 Pro model architecture#432
Oseltamivir wants to merge 9 commits into
masterfrom
feat/dsv4-pro-model-architecture

Oseltamivir commented Jun 8, 2026 •

edited by cursor Bot

Loading

Uh oh!

vercel Bot commented Jun 8, 2026 •

edited

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Oseltamivir commented Jun 8, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture

Shared-renderer changes (small, justified)

Overlay (?unofficialrun=) support

Tests

Verification

Note

Uh oh!

vercel Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 8, 2026

Choose a reason for hiding this comment

Hybrid note ignores block collapse

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Oseltamivir commented Jun 8, 2026 •

edited by cursor Bot

Loading

Overlay (`?unofficialrun=`) support

vercel Bot commented Jun 8, 2026 •

edited

Loading