Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
16 commits
Select commit Hold shift + click to select a range
4bcefb7
Update DSv4 TRT image for B200/B300 (non-MTP) to feat-deepseek_v4-2dd…
Oseltamivir Jun 1, 2026
6b7558c
Backfill PR number in changelog pr-link
Oseltamivir Jun 1, 2026
bd3c94c
Merge branch 'main' into update-dsv4-trt-image-2dd03e6
Oseltamivir Jun 1, 2026
f441f9f
Try official TRT-LLM release image 1.3.0rc15.post1 for DSv4 B200/B300…
Oseltamivir Jun 1, 2026
4bc5592
Revert to custom feat/deepseek_v4-2dd03e6 image for DSv4 B200/B300 (n…
Oseltamivir Jun 2, 2026
1b0afeb
Merge branch 'main' into update-dsv4-trt-image-2dd03e6
Oseltamivir Jun 2, 2026
14a1bb3
Point DSv4 B200/B300 TRT (non-MTP) at the SWA-scratch-fix image
Oseltamivir Jun 2, 2026
242ab88
Merge remote-tracking branch 'origin/main' into update-dsv4-trt-image…
Oseltamivir Jun 2, 2026
e23a541
Revert DSv4 B200/B300 TRT (non-MTP) to 2dd03e6 + disable SWA scratch …
Oseltamivir Jun 2, 2026
6118a76
Merge remote-tracking branch 'origin/main' into update-dsv4-trt-image…
Oseltamivir Jun 3, 2026
5adfeb3
Scope DSv4 TRT non-MTP change to B300 only
Oseltamivir Jun 3, 2026
ad529fb
Use official TRT-LLM image (1.3.0rc15.post1) for DSv4 B300 TRT (non-M…
Oseltamivir Jun 3, 2026
c2381b7
Re-add TRTLLM_DSV4_ENABLE_SWA_SCRATCH_REUSE=0 on rc15.post1 to test D…
Oseltamivir Jun 5, 2026
b09619e
Merge branch 'main' into update-dsv4-trt-image-2dd03e6
Oseltamivir Jun 5, 2026
65bddbb
Re-enable SWA scratch reuse (confirmed not the cause of rc15.post1 DP…
Oseltamivir Jun 5, 2026
285b79a
Merge branch 'main' into update-dsv4-trt-image-2dd03e6
Oseltamivir Jun 5, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/configs/nvidia-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3049,7 +3049,7 @@ dsv4-fp4-b300-vllm-agentic:
- { tp: 8, ep: 8, dp-attn: true, offloading: cpu, conc-list: [128, 256, 512] }

dsv4-fp4-b300-trt:
image: ghcr.io#semianalysisai/trtllm-deepseek-v4:feat-deepseek_v4-9aa3715
image: nvcr.io#nvidia/tensorrt-llm/release:1.3.0rc15.post1
model: deepseek-ai/DeepSeek-V4-Pro
model-prefix: dsv4
runner: b300
Expand All @@ -3072,7 +3072,7 @@ dsv4-fp4-b300-trt:
- { tp: 8, ep: 8, dp-attn: true, conc-start: 256, conc-end: 1024 }

dsv4-fp4-b300-trt-mtp:
image: ghcr.io#semianalysisai/trtllm-deepseek-v4:feat-deepseek_v4-9aa3715
image: nvcr.io#nvidia/tensorrt-llm/release:1.3.0rc15.post1
model: deepseek-ai/DeepSeek-V4-Pro
model-prefix: dsv4
runner: b300
Expand Down
5 changes: 5 additions & 0 deletions benchmarks/single_node/fixed_seq_len/dsv4_fp4_b300_trt.sh
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,11 @@ sanitize_slurm_mpi_env_for_trtllm
export NCCL_NVLS_ENABLE="${NCCL_NVLS_ENABLE:-0}"
echo "NCCL_NVLS_ENABLE: $NCCL_NVLS_ENABLE"

# Disable DSv4 SWA scratch reuse to test whether the rc15.post1 DPA crash
# is the same SWA-scratch bug or a separate FMHA kernel issue.
export TRTLLM_DSV4_ENABLE_SWA_SCRATCH_REUSE="${TRTLLM_DSV4_ENABLE_SWA_SCRATCH_REUSE:-1}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SWA default contradicts disable comment

Medium Severity

The new launcher comments say DSv4 SWA scratch reuse is disabled to isolate the rc15.post1 DPA crash, but TRTLLM_DSV4_ENABLE_SWA_SCRATCH_REUSE defaults to 1 when unset, which keeps scratch reuse enabled. Sweep runs without an override therefore do not match the stated experiment.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 65bddbb. Configure here.

echo "TRTLLM_DSV4_ENABLE_SWA_SCRATCH_REUSE: $TRTLLM_DSV4_ENABLE_SWA_SCRATCH_REUSE"

nvidia-smi

SERVER_LOG="$PWD/server.log"
Expand Down
5 changes: 5 additions & 0 deletions benchmarks/single_node/fixed_seq_len/dsv4_fp4_b300_trt_mtp.sh
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,11 @@ sanitize_slurm_mpi_env_for_trtllm
export NCCL_NVLS_ENABLE="${NCCL_NVLS_ENABLE:-0}"
echo "NCCL_NVLS_ENABLE: $NCCL_NVLS_ENABLE"

# Disable DSv4 SWA scratch reuse to test whether the rc15.post1 DPA crash
# is the same SWA-scratch bug or a separate FMHA kernel issue.
export TRTLLM_DSV4_ENABLE_SWA_SCRATCH_REUSE="${TRTLLM_DSV4_ENABLE_SWA_SCRATCH_REUSE:-1}"
echo "TRTLLM_DSV4_ENABLE_SWA_SCRATCH_REUSE: $TRTLLM_DSV4_ENABLE_SWA_SCRATCH_REUSE"

nvidia-smi

SERVER_LOG="$PWD/server.log"
Expand Down
14 changes: 14 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3387,6 +3387,13 @@
- "Add MTP speculative-decoding sibling for dsv4-fp4-mi355x-vllm (model: deepseek-ai/DeepSeek-V4-Pro) on vllm/vllm-openai-rocm:v0.22.0, per vllm-project/vllm#43385"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1630

- config-keys:
- dsv4-fp4-b300-trt
- dsv4-fp4-b300-trt-mtp
description:
- "Point the B300 TensorRT-LLM DeepSeek-V4-Pro configs (non-MTP dsv4-fp4-b300-trt and MTP dsv4-fp4-b300-trt-mtp) at the official NVIDIA release image nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc15.post1, replacing the custom ghcr.io semianalysis feat/deepseek_v4 builds (2dd03e6 and 9aa3715 respectively), to evaluate the official RC for DeepSeek-V4-Pro. Also drops the TRTLLM_DSV4_ENABLE_SWA_SCRATCH_REUSE=0 launcher workaround (specific to the custom build) so the official image runs with its native behavior. B200 TRT is unchanged (stays on feat-deepseek_v4-9aa3715)."
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1636
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate PR changelog entries

Low Severity

This commit adds two separate perf-changelog.yaml blocks for the same PR link and the same dsv4-fp4-b300-trt / dsv4-fp4-b300-trt-mtp config keys, with conflicting descriptions. That duplicates maintenance and leaves readers unsure which entry reflects the shipped change.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit c2381b7. Configure here.

Comment thread
cursor[bot] marked this conversation as resolved.
Comment on lines +3390 to +3395
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Duplicate and contradictory perf-changelog entries for PR #1636: two entries cover the same config-keys (dsv4-fp4-b300-trt and dsv4-fp4-b300-trt-mtp) with contradictory descriptions. The first entry (lines 3390-3395) claims the PR "drops the TRTLLM_DSV4_ENABLE_SWA_SCRATCH_REUSE=0 launcher workaround ... so the official image runs with its native behavior", but the same PR actually ADDS that export with default 0 to both launchers (dsv4_fp4_b300_trt.sh L62-65 and dsv4_fp4_b300_trt_mtp.sh L61-64). The second entry (lines 3455-3460) is correct; please remove the first (stale) entry so the changelog has a single, accurate source of truth.

Extended reasoning...

What the bug is

perf-changelog.yaml ends up with two distinct entries for the same PR (#1636) and the same config-keys (dsv4-fp4-b300-trt, dsv4-fp4-b300-trt-mtp), and the two entries say opposite things about whether the TRTLLM_DSV4_ENABLE_SWA_SCRATCH_REUSE=0 launcher workaround is shipped.

  • Entry A (perf-changelog.yaml:3390-3395): "Point the B300 TensorRT-LLM DeepSeek-V4-Pro configs ... at the official NVIDIA release image ... Also drops the TRTLLM_DSV4_ENABLE_SWA_SCRATCH_REUSE=0 launcher workaround (specific to the custom build) so the official image runs with its native behavior. B200 TRT is unchanged."
  • Entry B (perf-changelog.yaml:3455-3460): "Switch B300 DSv4 TRT (non-MTP + MTP) to official rc15.post1 image with TRTLLM_DSV4_ENABLE_SWA_SCRATCH_REUSE=0 to test whether DPA crash is same SWA-scratch bug".

Why entry A is wrong

The same PR diff adds the export to both launcher scripts with default 0:

# benchmarks/single_node/fixed_seq_len/dsv4_fp4_b300_trt.sh:62-65
export TRTLLM_DSV4_ENABLE_SWA_SCRATCH_REUSE="${TRTLLM_DSV4_ENABLE_SWA_SCRATCH_REUSE:-0}"
echo "TRTLLM_DSV4_ENABLE_SWA_SCRATCH_REUSE: $TRTLLM_DSV4_ENABLE_SWA_SCRATCH_REUSE"

# benchmarks/single_node/fixed_seq_len/dsv4_fp4_b300_trt_mtp.sh:61-64 (identical)

Because the parameter expansion ${VAR:-0} defaults to 0 when unset, sweeps using the official rc15.post1 image still disable SWA scratch reuse — the workaround is kept, not dropped. The recent commit c2381b7 ("Re-add TRTLLM_DSV4_ENABLE_SWA_SCRATCH_REUSE=0 on rc15.post1 to test DPA crash") confirms this is the intended shipped behavior, matching Entry B and contradicting Entry A.

Impact

Doc-only — the runtime behavior is unambiguously the one described in Entry B (the workaround is on). But the changelog is the project's audit trail: any human or tool that consumes it to ask "what did PR #1636 ship for dsv4-fp4-b300-trt?" now gets two contradictory answers for the same config-keys. Future debugging of the rc15.post1 DPA crash will be harder because the changelog suggests the workaround was removed when it actually was retained — exactly the opposite of the diagnostic intent.

Fix

Delete Entry A (perf-changelog.yaml:3389-3395). Entry B already covers the same configs with the accurate description. Independently flagged by Cursor Bugbot in two comments on this PR: Duplicate PR changelog entries (Low) and Changelog contradicts launcher workaround (Medium).

Step-by-step proof

  1. Open perf-changelog.yaml, scroll to line 3389. Entry A lists pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1636 with config-keys: [dsv4-fp4-b300-trt, dsv4-fp4-b300-trt-mtp].
  2. Scroll to line 3454. Entry B lists the same pr-link and the same config-keys.
  3. Read Entry A's description — it says the SWA workaround is dropped.
  4. Read Entry B's description — it says the SWA workaround is set to 0 to test DPA crash.
  5. Open benchmarks/single_node/fixed_seq_len/dsv4_fp4_b300_trt.sh at line 62. Observe export TRTLLM_DSV4_ENABLE_SWA_SCRATCH_REUSE="${TRTLLM_DSV4_ENABLE_SWA_SCRATCH_REUSE:-0}" — defaults to 0.
  6. Same in dsv4_fp4_b300_trt_mtp.sh at line 61.
  7. Therefore the shipped behavior matches Entry B; Entry A is stale and should be removed.


- config-keys:
- dsv4-fp4-mi355x-sglang-mtp
description:
Expand Down Expand Up @@ -3445,6 +3452,13 @@
- "Add 1k1k/8k1k minimax recipe set under benchmarks/multi_node/srt-slurm-recipes/vllm/minimax-m2.5/"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1641

- config-keys:
- dsv4-fp4-b300-trt
- dsv4-fp4-b300-trt-mtp
description:
- "Switch B300 DSv4 TRT (non-MTP + MTP) to official rc15.post1 image with TRTLLM_DSV4_ENABLE_SWA_SCRATCH_REUSE=0 to test whether DPA crash is same SWA-scratch bug"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1636

- config-keys:
- dsv4-fp4-b200-vllm
description:
Expand Down
Loading