[OMNIML-4788] specdec_bench/Qwen3.5-4B: throughput_32k benchmark + S3 upload step#1564
[OMNIML-4788] specdec_bench/Qwen3.5-4B: throughput_32k benchmark + S3 upload step#1564ChenhanYu wants to merge 10 commits into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a 32K-context runtime-params YAML, an S3 upload Bash wrapper with dependency bootstrapping, and converts Qwen3.5-4B example pipelines (standard and MTP) into three-task workflows: qualitative, throughput_32k (using the new runtime params), and S3 upload. ChangesSPEED-Bench Throughput and S3 Upload Pipeline
Possibly related PRs
Suggested reviewers
🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 6✅ Passed checks (6 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1564 +/- ##
==========================================
- Coverage 76.88% 75.80% -1.08%
==========================================
Files 478 478
Lines 52209 54277 +2068
==========================================
+ Hits 40140 41144 +1004
- Misses 12069 13133 +1064
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
/claude review |
There was a problem hiding this comment.
Claude review summary
Findings: 1 CRITICAL, 0 IMPORTANT, 1 SUGGESTION
Most impactful finding
[CRITICAL Algorithm] The whole point of runtime_params_throughput_32k.yaml — pinning engine_args.max_model_len = 40960 so vLLM can't silently truncate 32K prompts — does not actually take effect. VLLMModel.__init__ in examples/specdec_bench/specdec_bench/models/vllm.py:79-91 builds AsyncEngineArgs(...) with a hardcoded subset of kwargs and never reads max_model_len out of **kwargs, so the override from runtime_params.engine_args is silently dropped. vLLM continues to auto-derive max_model_len from the model config + GPU memory budget — exactly the failure mode the PR description says this file prevents. task_1 (throughput_32k) will still measure throughput over truncated inputs on memory-tight single-GPU runs.
The fix lives in vllm.py, not in this PR's diff: forward max_model_len (and ideally any engine_args overrides whose key matches an AsyncEngineArgs field) into AsyncEngineArgs(...). Without that fix the new YAML override and the --runtime_params line in both task_1s are dead weight. Inline comment on runtime_params_throughput_32k.yaml:14 has a concrete patch sketch.
The cluster smoke (12257378/79/80) called out in the PR description should catch this once it lands — recommend confirming the dumped serving_config.vllm_config.max_model_len actually reads 40960 in the throughput_32k run dir before merging.
Other note
[SUGGESTION] task_2 (S3 upload) is a CPU-only boto3 file copy but allocates a GPU and uses the heavy vllm/vllm-openai container — non-blocking, just operational efficiency.
Risk assessment
Medium-high. The PR is small and the launcher plumbing is fine, but the headline feature (correct 32K throughput numbers) is broken at the engine layer — anyone consuming the throughput_32k results uploaded to S3 would be reading from silently-truncated runs. Hold on the vLLMModel kwarg-forwarding fix before landing.
| # needed, since its prompts top out at ~8K). | ||
|
|
||
| engine_args: | ||
| max_model_len: 40960 |
There was a problem hiding this comment.
[CRITICAL Algorithm] This max_model_len: 40960 override is silently dropped — vLLM never sees it.
VLLMModel.__init__ in examples/specdec_bench/specdec_bench/models/vllm.py:79-91 constructs AsyncEngineArgs(...) with a hardcoded subset of kwargs (tokenizer, trust_remote_code, tensor_parallel_size, enable_expert_parallel, enable_prefix_caching, speculative_config, max_num_seqs, skip_tokenizer_init, async_scheduling, enforce_eager). The **engine_args from runtime_params is unpacked into VLLMModel.__init__'s **kwargs, but max_model_len is not one of the whitelisted keys read out of that dict — so it's discarded. The comment at vllm.py:164 even acknowledges this ("max_model_len / kv cache / dtype defaults that don't appear in AsyncEngineArgs").
Impact: vLLM keeps auto-deriving max_model_len from the model config + GPU memory budget — exactly the failure mode the PR description ("vLLM's default auto-derivation … can cap max_model_len lower than 36K on a single GPU") claims this file prevents. 32K-token prompts in the throughput_32k split will still be silently truncated on memory-tight single-GPU runs, and the throughput numbers task_1 reports will be measured over truncated inputs. Adding the YAML without the forwarding fix gives a false sense of correctness.
Fix: forward unknown engine_args kwargs (or at minimum max_model_len) to AsyncEngineArgs in VLLMModel.__init__. e.g.
import dataclasses
extra_engine_kwargs = {
k: v for k, v in kwargs.items()
if k in {f.name for f in dataclasses.fields(AsyncEngineArgs)}
and k not in {"model", "tokenizer", "trust_remote_code",
"tensor_parallel_size", "enable_expert_parallel",
"enable_prefix_caching", "speculative_config",
"max_num_seqs", "skip_tokenizer_init",
"async_scheduling", "enforce_eager"}
}
engine_args = AsyncEngineArgs(
model=model_dir,
..., # existing explicit kwargs
**extra_engine_kwargs,
)Without this (or an equivalent explicit max_model_len=kwargs.get("max_model_len")), this YAML and the --runtime_params line in both specdec_bench{,_mtp}.yaml::task_1 are dead weight.
There was a problem hiding this comment.
Addressed in 80309cae01 (this PR). examples/specdec_bench/specdec_bench/models/vllm.py:VLLMModel.__init__ now forwards any kwarg whose name matches an AsyncEngineArgs dataclass field — except the dozen kwargs VLLMModel consumes itself (collected into _VLLM_CONSUMED_KWARGS to prevent got multiple values for keyword argument crashes). So engine_args.max_model_len: 40960 from runtime_params_throughput_32k.yaml now actually reaches AsyncEngineArgs.
Unit tests in tests/examples/specdec_bench/test_vllm_kwargs_forwarding.py:
max_model_lenis forwarded- multiple engine_args fields (
max_model_len+dtype+gpu_memory_utilization) all pass through - consumed kwargs (e.g.
tensor_parallel_size) are NOT double-forwarded (would crashAsyncEngineArgs) - unknown kwargs are silently dropped (matches prior behaviour)
- a pin test asserts the test's copy of
_VLLM_CONSUMED_KWARGSmatches the module's definition
Cluster re-validation: planning to capture dump_env's serving_config.vllm_config.max_model_len from the throughput_32k run dir on the next smoke to confirm it reads 40960.
| nodes: 1 | ||
| ntasks_per_node: 1 | ||
| gpus_per_node: 1 | ||
| container: vllm/vllm-openai:qwen3_5-cu130 |
There was a problem hiding this comment.
[SUGGESTION] task_2 is a CPU-only boto3 file copy but allocates gpus_per_node: 1 and the heavyweight vllm/vllm-openai:qwen3_5-cu130 container.
On a busy GPU partition this can add hours of queue time for a job that does no compute, and it pulls a multi-GB image just to run boto3.upload_file(). If the launcher / cluster supports it, dropping gpus_per_node to 0 (or routing to a CPU partition) and using a small python+boto3 image would tighten the pipeline. Same applies to specdec_bench_mtp.yaml::task_2. Non-blocking — operationally simpler to reuse one container and one slurm template, just calling out the trade-off.
There was a problem hiding this comment.
Addressed in b90e30afc8 (this PR). Both specdec_bench.yaml and specdec_bench_mtp.yaml task_2 now use gpus_per_node: 0. Kept the same vllm/vllm-openai:qwen3_5-cu130 image so the already-pulled cached image is reused across tasks — Slurm will still schedule the 0-GPU job on the default batch partition (no separate cpu partition needed; GPU nodes accept 0-GPU jobs and have spare CPU slots). Switching to a dedicated lightweight python+boto3 image would shave the image-pull cost further but adds a maintenance surface, deferring that to a follow-up if image-pull becomes the bottleneck.
…ith throughput_32k + S3 upload
3-task pipelines now exercise both:
* `qualitative` split (existing quality / acceptance-rate signal)
* `throughput_32k` split (long-context throughput signal)
* Final task uploads `/scratchspace/specdec_bench{,_mtp}` to S3 in sweep
layout so qualitative/ and throughput_32k/ both land under
`s3://team-specdec-workgroup/results/specdec_bench{,_mtp}/<split>/`.
Adds tools/launcher/common/specdec_bench/upload_to_s3.sh — thin wrapper
around examples/specdec_bench/upload_to_s3.py so it can be invoked as a
launcher task. Installs boto3 from examples/specdec_bench/requirements.txt
on cold containers (warm pipelines pick it up from the prior run.sh).
S3 credentials are read from S3_ENDPOINT / S3_KEY_ID / S3_SECRET env
vars; both --skip-existing and --allow-incomplete-provenance are passed
by default so reruns don't fail and runs lacking CONTAINER_IMAGE still
land (the harness Phase-2 work in OMNIML-4788 will populate it).
Save dirs split per benchmark (/scratchspace/<base>/<split>/) so the
upload step sees a proper sweep layout and doesn't overwrite earlier
results.
Signed-off-by: chenhany <chenhany@nvidia.com>
…Bench throughput_32k Without an explicit max_model_len, vLLM auto-derives it from the model config and gpu_memory_utilization. On a single-GPU job with Qwen3.5-4B (declared max_position_embeddings = 128K), the auto-derivation can cap max_model_len well below 36K to fit the KV cache budget, silently truncating 32K-token prompts from SPEED-Bench-Internal/throughput_32k and producing wrong throughput numbers. Add tools/launcher/common/specdec_bench/runtime_params_throughput_32k.yaml that pins engine_args.max_model_len = 40,960 (32K input + 4K output + 4K headroom). Wire it via --runtime_params on task_1 in both Qwen3.5-4B specdec_bench YAMLs (NONE/baseline + MTP variant). The qualitative task uses the engine default (no override) because its prompts top out around 8K and vLLM's auto-derivation handles that fine. Signed-off-by: chenhany <chenhany@nvidia.com>
09ea467 to
8004bd6
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tools/launcher/examples/Qwen/Qwen3.5-4B/specdec_bench.yaml (1)
93-98: ⚡ Quick winAvoid reserving a GPU for the S3 upload-only task.
task_2runs the uploader wrapper and doesn’t need GPU compute; reserving one GPU can unnecessarily delay scheduling and consume scarce resources.💡 Proposed tweak
task_2: @@ slurm_config: _factory_: "slurm_factory" nodes: 1 ntasks_per_node: 1 - gpus_per_node: 1 + gpus_per_node: 0 container: vllm/vllm-openai:qwen3_5-cu130🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/launcher/examples/Qwen/Qwen3.5-4B/specdec_bench.yaml` around lines 93 - 98, The slurm_config currently reserves a GPU for all tasks which is unnecessary for the uploader job (task_2); update the configuration so the uploader task does not request GPU resources — either set gpus_per_node: 0 in a dedicated slurm_config for task_2 or remove/override the GPU request in task_2's job spec so task_2 runs without a GPU (leave the container image unchanged); locate the slurm_config block and the task_2 job definition and adjust the GPU request accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@tools/launcher/examples/Qwen/Qwen3.5-4B/specdec_bench.yaml`:
- Around line 93-98: The slurm_config currently reserves a GPU for all tasks
which is unnecessary for the uploader job (task_2); update the configuration so
the uploader task does not request GPU resources — either set gpus_per_node: 0
in a dedicated slurm_config for task_2 or remove/override the GPU request in
task_2's job spec so task_2 runs without a GPU (leave the container image
unchanged); locate the slurm_config block and the task_2 job definition and
adjust the GPU request accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 81cec96c-e9c8-420f-928f-a4d7f179cd58
📒 Files selected for processing (4)
tools/launcher/common/specdec_bench/runtime_params_throughput_32k.yamltools/launcher/common/specdec_bench/upload_to_s3.shtools/launcher/examples/Qwen/Qwen3.5-4B/specdec_bench.yamltools/launcher/examples/Qwen/Qwen3.5-4B/specdec_bench_mtp.yaml
🚧 Files skipped from review as they are similar to previous changes (2)
- tools/launcher/common/specdec_bench/runtime_params_throughput_32k.yaml
- tools/launcher/examples/Qwen/Qwen3.5-4B/specdec_bench_mtp.yaml
…SPEED-bench
The single-GPU concurrency=1 config takes ~4.4h on the 880-sample qualitative
split, which doesn't fit in the cw_dfw batch partition's 4h MaxTime. Acceptance
length (AL) — the primary metric — is concurrency-independent; only aa_timing
fidelity is sacrificed by increasing concurrency.
task_0 (qualitative): tp=2, concurrency=8, gpus=2 -> ~30-45 min
task_1 (throughput_32k): tp=2, concurrency=4, gpus=2 -> stays capped at
--num_requests 20
for KV-cache safety
at 32K-token prompts
tp_size=2 also doubles the KV-cache budget for task_1, making concurrency>1
feasible on 32K prompts that would OOM a single H100.
Signed-off-by: chenhany <chenhany@nvidia.com>
…oughput_32k to 80 samples @ concurrency=8
Prior config (concurrency=8 on qualitative, --num_requests 20 + concurrency=4 on
throughput_32k) was conservative-tuned for time-budget headroom. With tp_size=2
in place the KV budget is doubled, so we can push concurrency further:
task_0 (qualitative): concurrency 8 -> 32 (still tp_size=2)
task_1 (throughput_32k): concurrency 4 -> 8,
--num_requests 20 -> 80 (still tp_size=2)
AL is concurrency-independent; the bump only sacrifices aa_timing fidelity.
8 * 32K = 256K tokens of in-flight KV stays within the doubled KV budget on
tp_size=2.
Signed-off-by: chenhany <chenhany@nvidia.com>
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tools/launcher/examples/Qwen/Qwen3.5-4B/specdec_bench.yaml (1)
101-106: 💤 Low valueUpload step requests a GPU it doesn't use.
task_2only runsupload_to_s3.shyet reservesgpus_per_node: 1on a GPU container. On a busy cluster this can delay scheduling and waste an allocated GPU. Ifslurm_factorysupports it, considergpus_per_node: 0(and a lighter image) for the upload step.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/launcher/examples/Qwen/Qwen3.5-4B/specdec_bench.yaml` around lines 101 - 106, task_2's slurm_config is requesting a GPU it doesn't use; locate the slurm_config block for task_2 (look for "task_2", "slurm_config", and the upload_to_s3.sh step) and set gpus_per_node: 0 and switch the container to a lightweight non-GPU image (or remove the container override) so the upload step doesn't reserve GPU resources; keep the _factory_: "slurm_factory" and other scheduling fields unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tools/launcher/examples/Qwen/Qwen3.5-4B/specdec_bench.yaml`:
- Line 79: The --runtime_params value in
tools/launcher/examples/Qwen/Qwen3.5-4B/specdec_bench.yaml points to the wrong
location; since examples/specdec_bench/run.py opens args.runtime_params directly
(open(args.runtime_params)), update the CLI entry for --runtime_params to the
repository path where the file actually lives
(tools/launcher/common/specdec_bench/runtime_params_throughput_32k.yaml) so the
32k override can be loaded correctly.
---
Nitpick comments:
In `@tools/launcher/examples/Qwen/Qwen3.5-4B/specdec_bench.yaml`:
- Around line 101-106: task_2's slurm_config is requesting a GPU it doesn't use;
locate the slurm_config block for task_2 (look for "task_2", "slurm_config", and
the upload_to_s3.sh step) and set gpus_per_node: 0 and switch the container to a
lightweight non-GPU image (or remove the container override) so the upload step
doesn't reserve GPU resources; keep the _factory_: "slurm_factory" and other
scheduling fields unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 5ed32921-0e71-4818-a43c-24bac63e217c
📒 Files selected for processing (2)
tools/launcher/examples/Qwen/Qwen3.5-4B/specdec_bench.yamltools/launcher/examples/Qwen/Qwen3.5-4B/specdec_bench_mtp.yaml
… throughput_32k slurm.py's PatternPackager flattens modules/Model-Optimizer/tools/launcher/common/* to /nemo_run/code/common/* (matches the convention already used by the script: field, e.g. `script: common/specdec_bench/run.sh`). The runtime_params YAML path needs to follow the same convention; otherwise the container working dir misses the file: FileNotFoundError: 'modules/Model-Optimizer/tools/launcher/common/specdec_bench/runtime_params_throughput_32k.yaml' This also makes the path consistent with the public launcher (launch.py) which runs from tools/launcher/ as CWD, so the same `common/...` prefix resolves there too. Signed-off-by: chenhany <chenhany@nvidia.com>
…ECDEC_BENCH_S3_*
The earlier convention used unprefixed S3_ENDPOINT / S3_KEY_ID / S3_SECRET,
which would collide with any other S3 credentials a CI runner or developer
shell happens to carry. Rename to SPECDEC_BENCH_S3_* so the upload step is
explicit about which workflow it serves and avoids cross-tool credential
leakage.
Two changes:
* examples/specdec_bench/upload_to_s3.py — read SPECDEC_BENCH_S3_{ENDPOINT,
KEY_ID,SECRET} as the defaults for --endpoint / --key-id / --secret.
* tools/launcher/core.py — get_default_env() now forwards SPECDEC_BENCH_S3_*
from the launching shell into both slurm_env and local_env so the cluster
upload task (common/specdec_bench/upload_to_s3.sh) sees them inside the
container without committing secrets to any YAML.
The previous task_2 (12308020) on cw_dfw failed because the unprefixed S3_*
vars never reached the sbatch — even when the launching shell had them
exported. With this fix, exporting SPECDEC_BENCH_S3_* before `uv run
slurm.py` / launch.py is enough; the harness propagates them.
Signed-off-by: chenhany <chenhany@nvidia.com>
… runtime_params Without this fix, kwargs unpacked from runtime_params.engine_args into VLLMModel.__init__ are silently dropped — the AsyncEngineArgs(...) call only reads a hardcoded subset. The clearest casualty was runtime_params_throughput_32k.yaml::max_model_len: 40960 (added in e3c4d4f to prevent vLLM from auto-deriving a short max_model_len and truncating 32K prompts), which never took effect. Throughput numbers measured on memory-tight single-GPU runs were over silently-truncated inputs. Caught by claude[bot] review on PR #1564 (CRITICAL). Forward any kwarg whose name matches an AsyncEngineArgs dataclass field, except for the dozen kwargs VLLMModel reads itself (collected into `_VLLM_CONSUMED_KWARGS` so adding a future consumed kwarg fails the unit test loudly). This lets users override max_model_len / dtype / gpu_memory_utilization / any other AsyncEngineArgs field via runtime_params.engine_args. Tests: tests/examples/specdec_bench/test_vllm_kwargs_forwarding.py covers the four cases: * max_model_len is forwarded * multiple engine_args fields all pass through * consumed kwargs (tensor_parallel_size, trust_remote_code) are NOT double-forwarded (would crash AsyncEngineArgs) * unknown/typo kwargs are silently dropped (matches prior behaviour) Signed-off-by: chenhany <chenhany@nvidia.com>
…pload task_2 is a boto3 file copy — pure CPU work that finishes in ~3s. The prior gpus_per_node=1 reserved an entire GPU for the duration, sitting in PD-Priority on busy GPU partitions for the few seconds of upload work. Setting gpus_per_node=0 keeps the job on the default `batch` partition (GPU node, but no GPU reserved) so it schedules instantly alongside existing GPU jobs. No separate cpu partition needed — Slurm allows 0-GPU jobs on GPU partitions. Suggested by claude[bot] review on PR #1564 (non-blocking). Signed-off-by: chenhany <chenhany@nvidia.com>
…engine> The prior names `specdec_bench` (NONE/AR baseline) and `specdec_bench_mtp` encoded only the workflow, not the model. A future run with a different base model (e.g., Qwen3-8B) under the same workflow would collide in S3 against the existing Qwen3.5-4B run — `--skip-existing` would reject it, masking the new data. Rename sweep dirs to embed model + algorithm + engine so they stay unique across multi-model / multi-engine benchmarking: specdec_bench -> qwen35_4b_none_vllm specdec_bench_mtp -> qwen35_4b_mtp_vllm The S3 prefix `results/specdec_bench_mtp/` (14 objects from the prior cluster smoke) has already been moved to `results/qwen35_4b_mtp_vllm/` via boto3 copy+delete. No NONE-baseline data was in S3 (prior task_2 failures), so nothing else to migrate. This matches the loose convention already used by other team sweeps in the same bucket (e.g., `qwen35_35_sglang_mtp7`, `dsr1-0528-fp4_trtllm_mtp7`, `kimi25_vllm_nv_eagle_d7`). Signed-off-by: chenhany <chenhany@nvidia.com>
…h YAMLs The bench-pipeline YAML now stops at task_1 (throughput_32k). S3 upload moves out of the YAML and into the pensieve-intern specdec_bench workflow's wrap_up stage, which owns harvesting /scratchspace/<sweep>/ contents and publishing them with provenance stamps (jira_ticket + huggingface_model_id) so the visualizer can flag them as `is_official=TRUE`. Rationale (Phase 3 of OMNIML-4788): * The launcher YAML lives in NVIDIA/Model-Optimizer/tools/launcher/ examples/ where any external user can run it as a self-contained benchmark recipe. Baking S3 credentials + a team-specific bucket into that example is the wrong scoping — anyone running it locally hits an auth error or worse, exfiltrates to whatever S3 endpoint their env happens to have set. * The wrap_up stage runs inside nmm-sandbox CI with the team's SPECDEC_BENCH_S3_* CI vars already configured (per the env-prefix rename in commit b928954). It can stamp jira_ticket from the Epic key + huggingface_model_id from the Epic SPEC, which is the data the visualizer needs to mark a run as an official record (csv_from_s3.py:255 `is_official = jira_ticket AND huggingface_model_id`). The kwarg-forwarding fix in vllm.py (80309ca), runtime_params path fix (2c69f89), and SPECDEC_BENCH_S3_* env-prefix in core.py (b928954) all stay — they're useful infrastructure regardless of where the upload step lives. Signed-off-by: chenhany <chenhany@nvidia.com>
What does this PR do?
Type of change: enhancement (follow-up to #1531).
Extends the merged Qwen3.5-4B SPEED-Bench launcher YAMLs from a single-task qualitative-only smoke into a 3-task pipeline that also covers long-context throughput and verifies the S3-upload path end-to-end. Two commits, cleanly cherry-picked from #1531's late branch state — they were authored after the merge-commit was resolved against an earlier rebased head and so didn't ride along with that merge.
Pipeline shape (both YAMLs)
task_0/scratchspace/specdec_bench{,_mtp}/qualitativetask_1/scratchspace/specdec_bench{,_mtp}/throughput_32ktask_2s3://team-specdec-workgroup/results/specdec_bench{,_mtp}/<split>/New artifacts
tools/launcher/common/specdec_bench/upload_to_s3.sh— thin wrapper aroundexamples/specdec_bench/upload_to_s3.pyso it can be invoked as a launcher task. Installsboto3fromrequirements.txton cold containers; warm pipelines pick it up from the priorrun.sh.tools/launcher/common/specdec_bench/runtime_params_throughput_32k.yaml— pinsengine_args.max_model_len = 40,960(32K input + 4K output + 4K headroom) so vLLM doesn't silently auto-capmax_model_lenbelow the 36K minimum needed forthroughput_32kprompts on single-GPU runs.Why max_model_len matters
Without an explicit
max_model_len, vLLM auto-derives it from the model config (Qwen3.5-4B = 128K) and from the GPU-memory budget. On a single GPU the second factor can cap effectivemax_model_lenwell below 36K, silently truncating 32K-token prompts and producing wrong throughput numbers. The qualitative split is not affected (its prompts top out around 8K, well under any auto-derivation floor) so onlytask_1carries the override.S3 credentials
upload_to_s3.shreadsS3_ENDPOINT/S3_KEY_ID/S3_SECRETfrom the runtime environment (not hardcoded).--skip-existing+--allow-incomplete-provenanceare passed by default so re-runs land alongside the prior upload, and runs lackingCONTAINER_IMAGE(Phase-2 harness work in OMNIML-4788 will populate it) still upload.Testing
Cluster smoke on cw_dfw via:
is currently in-flight (jobs
12257378/79/80, PD). Will update this PR with timing/AR numbers + S3 upload confirmation once it lands.Before your PR is "Ready for review"
/qualitativesuffix insave_dir)examples/specdec_bench/requirements.txtfrom [OMNIML-4788] specdec_bench: configuration.json provenance + upload_to_s3 #1531)🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Chores