-
Notifications
You must be signed in to change notification settings - Fork 189
fix(power): classify zero-decode-GPU multinode runs as aggregated #1646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
arygupt
wants to merge
1
commit into
main
Choose a base branch
from
feat/multinode-agg-detect
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 In the new
is_aggbranch, all eight detail env vars are read tolerantly withos.environ.get(..., '0'/''), includingPREFILL_NUM_WORKERS/TP/EP/DP_ATTN. This contradicts both the PR description ("the DECODE_* detail vars are read tolerantly") and the inline comment one line above ("the DECODE_* detail vars may be absent"). An agg multinode run still has a colocated prefill pool (line 105 enforcesprefill_gpus > 0), so the PREFILL_* topology vars should remain strictly required — otherwise a launcher that forgets to exportPREFILL_TPwill silently emitprefill_tp=0into the agg JSON consumed by downstream ETL, masking a real misconfiguration. Fix: in theis_aggbranch, callget_required_env_vars(['PREFILL_NUM_WORKERS','PREFILL_TP','PREFILL_EP','PREFILL_DP_ATTN'])and only default the DECODE_* vars tolerantly.Extended reasoning...
What the bug is
The new
is_aggbranch inutils/process_result.py(lines 74-83) reads all eight detail env vars tolerantly:This is broader than what the PR description and the inline comment justify. Both state the intent is only to tolerate the DECODE_* vars:
Why the existing code does not prevent it
The agg branch only enforces topology (
PREFILL_GPUS,DECODE_GPUS) viaget_required_env_vars. Lines 105-106 then enforceprefill_gpus > 0— so by the time we useprefill_tp/prefill_ep/etc., we know there is a prefill pool whose parallelism is topologically meaningful. Yet the PREFILL_* detail vars describing that pool are silently defaulted to 0/empty if missing.The new test
test_multinode_agg_detected_when_decode_gpus_zeroonlypop()s the DECODE_* keys (lines 432-433) and keeps all PREFILL_* keys set, so it neither verifies the strict-PREFILL contract nor exercises the over-broad tolerance.Impact
For an agg run that forgets to export
PREFILL_TP(or any other PREFILL_* detail var), the script produces an agg JSON withprefill_tp=0(or emptyprefill_dp_attention) instead of failing fast like the disagg branch would. Since the agg JSON is consumed by downstream ETL, a 0 silently passes through as valid data — a data-quality regression introduced by this PR.Step-by-step proof
Launcher misconfiguration scenario for an aggregated multinode run:
IS_MULTINODE=true,PREFILL_GPUS=16,DECODE_GPUS=0→ exported correctly.PREFILL_TPis forgotten by the launcher (not exported).PREFILL_EP,PREFILL_NUM_WORKERS,PREFILL_DP_ATTNare exported correctly.is_agg = (decode_gpus == 0)→ True.is_aggdict at lines 74-83 reads'PREFILL_TP': os.environ.get('PREFILL_TP', '0')→'0'.prefill_tp = int(detail_env['PREFILL_TP'])→0.multi_node_data['prefill_tp'] = 0written toagg_*.json.prefill_tp=0for a 16-GPU prefill pool — silently wrong; no error.Compare to the disagg branch (lines 87-90): the same forgotten PREFILL_TP would have raised
EnvironmentError: Missing required environment variables: PREFILL_TP, failing the run cleanly.Fix
In the
is_aggbranch, keep PREFILL_* strict and only relax DECODE_*:This matches the PR's stated contract, matches the inline comment, and matches what the existing agg test already exercises (it sets all PREFILL_* and only omits DECODE_*).