Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/configs/nvidia-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3000,7 +3000,7 @@ dsv4-fp8-h200-sglang-mtp:
# layouts on 4 allocated GPUs.
dsv4-fp4-b300-vllm:
image: vllm/vllm-openai:v0.22.0
model: deepseek-ai/DeepSeek-V4-Pro
model: nvidia/DeepSeek-V4-Pro-NVFP4
model-prefix: dsv4
runner: b300
precision: fp4
Expand Down
7 changes: 4 additions & 3 deletions benchmarks/single_node/fixed_seq_len/dsv4_fp4_b300_vllm.sh
Original file line number Diff line number Diff line change
Expand Up @@ -53,9 +53,9 @@ if [ "${EP_SIZE:-1}" -gt 1 ]; then
fi

MOE_ARGS=()
if [ "${DP_ATTENTION}" = "true" ]; then
MOE_ARGS=(--moe-backend deep_gemm_mega_moe)
fi
# if [ "${DP_ATTENTION}" = "true" ]; then
# MOE_ARGS=(--moe-backend deep_gemm_mega_moe)
# fi
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DP-attn megamoe backend disabled

Medium Severity

With DP_ATTENTION=true, the script no longer passes --moe-backend deep_gemm_mega_moe, but dsv4-fp4-b300-vllm still schedules high-concurrency dp-attn/ep points. That diverges from the prior B300 pareto recipe and from dsv4_fp4_b300_vllm_mtp.sh / B200 vLLM siblings, so those runs may not match the intended serving path.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 41630f2. Configure here.


if [ "${DP_ATTENTION}" = "true" ]; then
MAX_NUM_BATCHED_TOKENS=2048
Expand Down Expand Up @@ -92,6 +92,7 @@ vllm serve "$MODEL_PATH" --served-model-name "$MODEL" --host 0.0.0.0 --port "$PO
--tool-call-parser deepseek_v4 \
--enable-auto-tool-choice \
--reasoning-parser deepseek_v4 \
--gpu-memory-utilization 0.97 \
--max-cudagraph-capture-size 2048 \
--max-model-len "$SERVE_MAX_MODEL_LEN" \
--max-num-batched-tokens "$MAX_NUM_BATCHED_TOKENS" > "$SERVER_LOG" 2>&1 &
Expand Down
6 changes: 6 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3531,3 +3531,9 @@
- "The Rust frontend replaces only the Python serving/API layer (HTTP, tokenization, scheduling glue, detokenization) and spawns the same Python EngineCore, so GPU kernels/attention/MoE GEMM/KV cache are untouched"
- "A/B sweep (28 single-node points, 1k1k + 8k1k, TP 1/2/4) vs the Python-frontend baseline (run 26696260751): throughput Pareto-neutral (peak tok/s/GPU within <1.5%, frontiers coincident) and TPOT flat (+-0.5%); TTFT improves ~8% at 1k1k and ~22% at 8k1k (every point), the expected signature of lower frontend CPU latency before first token, scaling with input length"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1634

- config-keys:
- dsv4-fp4-b300-vllm
description:
- "Update B300 dsv4 image to nvfp4"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 WARNING: Misleading description — says "image" but only the model field changed

Why it matters: The perf-changelog is the canonical record for tracing config changes. "image" in this file consistently refers to the Docker container image (e.g., vllm/vllm-openai:v0.22.0), which is unchanged here. The actual change is the model checkpoint swap to nvidia/DeepSeek-V4-Pro-NVFP4, plus the --gpu-memory-utilization 0.97 addition and MoE backend change. This was flagged in the earlier review (the PR-link was fixed but the description was not).

Fix:

Suggested change
- "Update B300 dsv4 image to nvfp4"
- "Switch B300 dsv4 model to nvidia/DeepSeek-V4-Pro-NVFP4 checkpoint; bump gpu-memory-utilization to 0.97; disable deep_gemm_mega_moe backend for dp-attn"

pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1656
1 change: 1 addition & 0 deletions runners/launch_b300-nv.sh
Original file line number Diff line number Diff line change
Expand Up @@ -314,6 +314,7 @@ else
DeepSeek-R1-0528-NVFP4-v2
DeepSeek-V4-Flash
DeepSeek-V4-Pro
DeepSeek-V4-Pro-NVFP4
GLM-5-FP8
GLM-5-NVFP4
GLM-5.1
Expand Down