Skip to content

Batch invariant support PART1#4666

Open
grimoire wants to merge 9 commits into
InternLM:mainfrom
grimoire:deterministic-inference
Open

Batch invariant support PART1#4666
grimoire wants to merge 9 commits into
InternLM:mainfrom
grimoire:deterministic-inference

Conversation

@grimoire

Copy link
Copy Markdown
Collaborator

Summary

Add the first-stage batch-invariant greedy inference support for the PyTorch CUDA backend.

This PR focuses on the supported PART1 scope and keeps unsupported cases explicit:

  • Hopper CUDA backend only
  • Qwen3.5-oriented greedy inference path
  • AR and Qwen3.5 MTP/spec decode support
  • EP/DeepEP is rejected for now
  • active eviction/recompute is warned as potentially breaking batch-invariant behavior

Changes

  • Use fa3 as the default decoding attention backend
  • FlashInfer is required for moe layer when enable_batch_invariant is enabled.
  • Add enable_batch_invariant user-facing warning text to PytorchEngineConfig.
  • Update --enable-batch-invariant CLI help to describe the supported scope.
  • Reject enable_batch_invariant with unsupported EP/EPLB configs in ConfigBuilder.

Validation

  • ConfigBuilder smoke:
    • enable_batch_invariant=True, ep=2 fails as expected.
    • LMDEPLOY_ENABLE_BATCH_INVARIANT=1 with ep=2 fails as expected.
    • enable_batch_invariant=True, enable_eplb=True fails as expected.
  • py_compile passed.
  • git diff --check passed.
  • pre-commit run --files lmdeploy/messages.py lmdeploy/cli/utils.py lmdeploy/pytorch/engine/config_builder.py passed.

Notes

Large real-model validation was not rerun for this final guard/docs update. Previous validation covered the supported batch-invariant AR/MTP paths.

grimoire added 8 commits June 9, 2026 11:30
…ence

# Conflicts:
#	lmdeploy/pytorch/backends/cuda/graph_runner.py
#	lmdeploy/pytorch/models/utils/cudagraph.py
#	lmdeploy/pytorch/strategies/ar_spec/cudagraph.py
Copilot AI review requested due to automatic review settings June 10, 2026 06:56

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds first-stage “batch-invariant” greedy inference support for the PyTorch CUDA backend by introducing an opt-in runtime policy, enforcing FA3-based decoding paths, and gating unsupported configurations (EP/EPLB) early.

Changes:

  • Introduces an opt-in CUDA batch-invariant runtime policy (env + engine config) and applies it in executors/backends.
  • Forces/threads FA3 decoding metadata options (incl. split policy) through CUDA graph + FA3 attention paths.
  • Adds deterministic MoE routing primitives for Qwen3.5 MoE (SoftmaxTopK via FlashInfer; padded shared-expert gate).

Reviewed changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/pytorch/spec_decode/test_cudagraph_strategy.py Updates dummy cudagraph metadata hook signature to match new FA3 metadata arguments.
lmdeploy/pytorch/models/utils/cudagraph.py Threads FA3 num_splits into flash-attn metadata generation and stores it in cudagraph meta.
lmdeploy/pytorch/models/qwen3_5_moe.py Switches MoE routing to SoftmaxTopK and adds a batch-invariant-friendly shared-expert gate weight layout.
lmdeploy/pytorch/model_inputs.py Adds enable_batch_invariant to BuildModelContext for model-build-time decisions.
lmdeploy/pytorch/kernels/cuda/pagedattention.py Caps split-K by KV page table size to reduce waste in split-K decode.
lmdeploy/pytorch/envs.py Adds LMDEPLOY_ENABLE_BATCH_INVARIANT env parsing.
lmdeploy/pytorch/engine/model_agent/agent.py Propagates enable_batch_invariant into model build context.
lmdeploy/pytorch/engine/executor/uni_executor.py Applies backend policy early when batch-invariant mode is enabled.
lmdeploy/pytorch/engine/executor/ray_executor.py Applies backend policy early in Ray workers when batch-invariant mode is enabled.
lmdeploy/pytorch/engine/config_builder.py Enables batch-invariant via env, rejects unsupported EP/EPLB, and applies backend policy during config build.
lmdeploy/pytorch/config.py Adds enable_batch_invariant to BackendConfig.
lmdeploy/pytorch/backends/selector.py Exposes apply_backend_policy(...) helper to call backend-specific policy hooks.
lmdeploy/pytorch/backends/cuda/op_backend.py Adds SoftmaxTopK builder wiring; updates FA3 metadata build rules and speculative-decoding FA3 requirement handling; applies CUDA batch-invariant policy.
lmdeploy/pytorch/backends/cuda/moe/topk.py Adds CUDA SoftmaxTopK implementation using FlashInfer top_k for deterministic routing under batch-invariant policy.
lmdeploy/pytorch/backends/cuda/moe/init.py Exports the CUDA SoftmaxTopK builder.
lmdeploy/pytorch/backends/cuda/graph_runner.py Stores FA3 split policy and FA3 eligibility in cudagraph meta.
lmdeploy/pytorch/backends/cuda/batch_invariant.py Implements the CUDA batch-invariant policy (env + torch precision knobs + optional device validation).
lmdeploy/pytorch/backends/cuda/attention/fa3.py Routes decode to FA3 flash_attn_with_kvcache paths and enforces batch-invariant constraints (e.g., rejects quantized KV decode).
lmdeploy/pytorch/backends/cuda/attention/init.py Enforces “batch-invariant requires FA3 attention” and reports disable reasons.
lmdeploy/pytorch/backends/base.py Adds a default backend policy hook that rejects batch-invariant on unsupported backends.
lmdeploy/messages.py Adds user-facing config docs + validation for enable_batch_invariant.
lmdeploy/cli/utils.py Adds --enable-batch-invariant CLI flag and help text.
lmdeploy/cli/serve.py Wires CLI flag into PytorchEngineConfig.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +681 to +683
max_pages = max(int(page_table.size(1)), 1)
max_useful_split = max(4, triton.next_power_of_2(max_pages))
return min(split_k, max_useful_split)
Comment on lines +41 to +45
if current is not None and _cuda_initialized():
raise RuntimeError(
f'enable_batch_invariant requires {name}={value}, but found {name}={current} '
'after CUDA was already initialized. Set the environment before importing/running LMDeploy, '
'or create the engine before any CUDA work in this process.')
Comment on lines +255 to +257
raise RuntimeError(
'Speculative decoding on CUDA requires FA3 attention. Please ensure FA3 is available and the '
'attention configuration supports FA3, or disable speculative decoding.')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants