Batch invariant support PART1#4666
Open
grimoire wants to merge 9 commits into
Open
Conversation
…ence # Conflicts: # lmdeploy/pytorch/backends/cuda/graph_runner.py # lmdeploy/pytorch/models/utils/cudagraph.py # lmdeploy/pytorch/strategies/ar_spec/cudagraph.py
Contributor
There was a problem hiding this comment.
Pull request overview
Adds first-stage “batch-invariant” greedy inference support for the PyTorch CUDA backend by introducing an opt-in runtime policy, enforcing FA3-based decoding paths, and gating unsupported configurations (EP/EPLB) early.
Changes:
- Introduces an opt-in CUDA batch-invariant runtime policy (env + engine config) and applies it in executors/backends.
- Forces/threads FA3 decoding metadata options (incl. split policy) through CUDA graph + FA3 attention paths.
- Adds deterministic MoE routing primitives for Qwen3.5 MoE (SoftmaxTopK via FlashInfer; padded shared-expert gate).
Reviewed changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/pytorch/spec_decode/test_cudagraph_strategy.py | Updates dummy cudagraph metadata hook signature to match new FA3 metadata arguments. |
| lmdeploy/pytorch/models/utils/cudagraph.py | Threads FA3 num_splits into flash-attn metadata generation and stores it in cudagraph meta. |
| lmdeploy/pytorch/models/qwen3_5_moe.py | Switches MoE routing to SoftmaxTopK and adds a batch-invariant-friendly shared-expert gate weight layout. |
| lmdeploy/pytorch/model_inputs.py | Adds enable_batch_invariant to BuildModelContext for model-build-time decisions. |
| lmdeploy/pytorch/kernels/cuda/pagedattention.py | Caps split-K by KV page table size to reduce waste in split-K decode. |
| lmdeploy/pytorch/envs.py | Adds LMDEPLOY_ENABLE_BATCH_INVARIANT env parsing. |
| lmdeploy/pytorch/engine/model_agent/agent.py | Propagates enable_batch_invariant into model build context. |
| lmdeploy/pytorch/engine/executor/uni_executor.py | Applies backend policy early when batch-invariant mode is enabled. |
| lmdeploy/pytorch/engine/executor/ray_executor.py | Applies backend policy early in Ray workers when batch-invariant mode is enabled. |
| lmdeploy/pytorch/engine/config_builder.py | Enables batch-invariant via env, rejects unsupported EP/EPLB, and applies backend policy during config build. |
| lmdeploy/pytorch/config.py | Adds enable_batch_invariant to BackendConfig. |
| lmdeploy/pytorch/backends/selector.py | Exposes apply_backend_policy(...) helper to call backend-specific policy hooks. |
| lmdeploy/pytorch/backends/cuda/op_backend.py | Adds SoftmaxTopK builder wiring; updates FA3 metadata build rules and speculative-decoding FA3 requirement handling; applies CUDA batch-invariant policy. |
| lmdeploy/pytorch/backends/cuda/moe/topk.py | Adds CUDA SoftmaxTopK implementation using FlashInfer top_k for deterministic routing under batch-invariant policy. |
| lmdeploy/pytorch/backends/cuda/moe/init.py | Exports the CUDA SoftmaxTopK builder. |
| lmdeploy/pytorch/backends/cuda/graph_runner.py | Stores FA3 split policy and FA3 eligibility in cudagraph meta. |
| lmdeploy/pytorch/backends/cuda/batch_invariant.py | Implements the CUDA batch-invariant policy (env + torch precision knobs + optional device validation). |
| lmdeploy/pytorch/backends/cuda/attention/fa3.py | Routes decode to FA3 flash_attn_with_kvcache paths and enforces batch-invariant constraints (e.g., rejects quantized KV decode). |
| lmdeploy/pytorch/backends/cuda/attention/init.py | Enforces “batch-invariant requires FA3 attention” and reports disable reasons. |
| lmdeploy/pytorch/backends/base.py | Adds a default backend policy hook that rejects batch-invariant on unsupported backends. |
| lmdeploy/messages.py | Adds user-facing config docs + validation for enable_batch_invariant. |
| lmdeploy/cli/utils.py | Adds --enable-batch-invariant CLI flag and help text. |
| lmdeploy/cli/serve.py | Wires CLI flag into PytorchEngineConfig. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+681
to
+683
| max_pages = max(int(page_table.size(1)), 1) | ||
| max_useful_split = max(4, triton.next_power_of_2(max_pages)) | ||
| return min(split_k, max_useful_split) |
Comment on lines
+41
to
+45
| if current is not None and _cuda_initialized(): | ||
| raise RuntimeError( | ||
| f'enable_batch_invariant requires {name}={value}, but found {name}={current} ' | ||
| 'after CUDA was already initialized. Set the environment before importing/running LMDeploy, ' | ||
| 'or create the engine before any CUDA work in this process.') |
Comment on lines
+255
to
+257
| raise RuntimeError( | ||
| 'Speculative decoding on CUDA requires FA3 attention. Please ensure FA3 is available and the ' | ||
| 'attention configuration supports FA3, or disable speculative decoding.') |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add the first-stage batch-invariant greedy inference support for the PyTorch CUDA backend.
This PR focuses on the supported PART1 scope and keeps unsupported cases explicit:
Changes
enable_batch_invariantis enabled.enable_batch_invariantuser-facing warning text toPytorchEngineConfig.--enable-batch-invariantCLI help to describe the supported scope.enable_batch_invariantwith unsupported EP/EPLB configs inConfigBuilder.Validation
enable_batch_invariant=True, ep=2fails as expected.LMDEPLOY_ENABLE_BATCH_INVARIANT=1withep=2fails as expected.enable_batch_invariant=True, enable_eplb=Truefails as expected.py_compilepassed.git diff --checkpassed.pre-commit run --files lmdeploy/messages.py lmdeploy/cli/utils.py lmdeploy/pytorch/engine/config_builder.pypassed.Notes
Large real-model validation was not rerun for this final guard/docs update. Previous validation covered the supported batch-invariant AR/MTP paths.