Batch invariant support PART1 by grimoire · Pull Request #4666 · InternLM/lmdeploy

grimoire · 2026-06-10T06:56:10Z

Summary

Add the first-stage batch-invariant greedy inference support for the PyTorch CUDA backend.

This PR focuses on the supported PART1 scope and keeps unsupported cases explicit:

Hopper CUDA backend only
Qwen3.5-oriented greedy inference path
AR and Qwen3.5 MTP/spec decode support
EP/DeepEP is rejected for now
active eviction/recompute is warned as potentially breaking batch-invariant behavior

Changes

Use fa3 as the default decoding attention backend
FlashInfer is required for moe layer when enable_batch_invariant is enabled.
Add enable_batch_invariant user-facing warning text to PytorchEngineConfig.
Update --enable-batch-invariant CLI help to describe the supported scope.
Reject enable_batch_invariant with unsupported EP/EPLB configs in ConfigBuilder.

Validation

ConfigBuilder smoke:
- enable_batch_invariant=True, ep=2 fails as expected.
- LMDEPLOY_ENABLE_BATCH_INVARIANT=1 with ep=2 fails as expected.
- enable_batch_invariant=True, enable_eplb=True fails as expected.
py_compile passed.
git diff --check passed.
pre-commit run --files lmdeploy/messages.py lmdeploy/cli/utils.py lmdeploy/pytorch/engine/config_builder.py passed.

Notes

Large real-model validation was not rerun for this final guard/docs update. Previous validation covered the supported batch-invariant AR/MTP paths.

…ence # Conflicts: # lmdeploy/pytorch/backends/cuda/graph_runner.py # lmdeploy/pytorch/models/utils/cudagraph.py # lmdeploy/pytorch/strategies/ar_spec/cudagraph.py

Copilot

Pull request overview

Adds first-stage “batch-invariant” greedy inference support for the PyTorch CUDA backend by introducing an opt-in runtime policy, enforcing FA3-based decoding paths, and gating unsupported configurations (EP/EPLB) early.

Changes:

Introduces an opt-in CUDA batch-invariant runtime policy (env + engine config) and applies it in executors/backends.
Forces/threads FA3 decoding metadata options (incl. split policy) through CUDA graph + FA3 attention paths.
Adds deterministic MoE routing primitives for Qwen3.5 MoE (SoftmaxTopK via FlashInfer; padded shared-expert gate).

Reviewed changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
tests/pytorch/spec_decode/test_cudagraph_strategy.py	Updates dummy cudagraph metadata hook signature to match new FA3 metadata arguments.
lmdeploy/pytorch/models/utils/cudagraph.py	Threads FA3 `num_splits` into flash-attn metadata generation and stores it in cudagraph meta.
lmdeploy/pytorch/models/qwen3_5_moe.py	Switches MoE routing to `SoftmaxTopK` and adds a batch-invariant-friendly shared-expert gate weight layout.
lmdeploy/pytorch/model_inputs.py	Adds `enable_batch_invariant` to `BuildModelContext` for model-build-time decisions.
lmdeploy/pytorch/kernels/cuda/pagedattention.py	Caps split-K by KV page table size to reduce waste in split-K decode.
lmdeploy/pytorch/envs.py	Adds `LMDEPLOY_ENABLE_BATCH_INVARIANT` env parsing.
lmdeploy/pytorch/engine/model_agent/agent.py	Propagates `enable_batch_invariant` into model build context.
lmdeploy/pytorch/engine/executor/uni_executor.py	Applies backend policy early when batch-invariant mode is enabled.
lmdeploy/pytorch/engine/executor/ray_executor.py	Applies backend policy early in Ray workers when batch-invariant mode is enabled.
lmdeploy/pytorch/engine/config_builder.py	Enables batch-invariant via env, rejects unsupported EP/EPLB, and applies backend policy during config build.
lmdeploy/pytorch/config.py	Adds `enable_batch_invariant` to `BackendConfig`.
lmdeploy/pytorch/backends/selector.py	Exposes `apply_backend_policy(...)` helper to call backend-specific policy hooks.
lmdeploy/pytorch/backends/cuda/op_backend.py	Adds SoftmaxTopK builder wiring; updates FA3 metadata build rules and speculative-decoding FA3 requirement handling; applies CUDA batch-invariant policy.
lmdeploy/pytorch/backends/cuda/moe/topk.py	Adds CUDA SoftmaxTopK implementation using FlashInfer top_k for deterministic routing under batch-invariant policy.
lmdeploy/pytorch/backends/cuda/moe/init.py	Exports the CUDA SoftmaxTopK builder.
lmdeploy/pytorch/backends/cuda/graph_runner.py	Stores FA3 split policy and FA3 eligibility in cudagraph meta.
lmdeploy/pytorch/backends/cuda/batch_invariant.py	Implements the CUDA batch-invariant policy (env + torch precision knobs + optional device validation).
lmdeploy/pytorch/backends/cuda/attention/fa3.py	Routes decode to FA3 `flash_attn_with_kvcache` paths and enforces batch-invariant constraints (e.g., rejects quantized KV decode).
lmdeploy/pytorch/backends/cuda/attention/init.py	Enforces “batch-invariant requires FA3 attention” and reports disable reasons.
lmdeploy/pytorch/backends/base.py	Adds a default backend policy hook that rejects batch-invariant on unsupported backends.
lmdeploy/messages.py	Adds user-facing config docs + validation for `enable_batch_invariant`.
lmdeploy/cli/utils.py	Adds `--enable-batch-invariant` CLI flag and help text.
lmdeploy/cli/serve.py	Wires CLI flag into `PytorchEngineConfig`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    max_pages = max(int(page_table.size(1)), 1)
+    max_useful_split = max(4, triton.next_power_of_2(max_pages))
+    return min(split_k, max_useful_split)


+    if current is not None and _cuda_initialized():
+        raise RuntimeError(
+            f'enable_batch_invariant requires {name}={value}, but found {name}={current} '
+            'after CUDA was already initialized. Set the environment before importing/running LMDeploy, '
+            'or create the engine before any CUDA work in this process.')


+                raise RuntimeError(
+                    'Speculative decoding on CUDA requires FA3 attention. Please ensure FA3 is available and the '
+                    'attention configuration supports FA3, or disable speculative decoding.')


grimoire added 8 commits June 9, 2026 11:30

add batch invariant flag

b83dda6

topk

cc350a1

update fa3

33094b3

fix fa3

6af97c9

fix mtp

eacaa9d

fix gdr

66804ef

Merge remote-tracking branch 'upstream/main' into deterministic-infer…

e1fd938

…ence # Conflicts: # lmdeploy/pytorch/backends/cuda/graph_runner.py # lmdeploy/pytorch/models/utils/cudagraph.py # lmdeploy/pytorch/strategies/ar_spec/cudagraph.py

update check

63e2e4c

Copilot AI review requested due to automatic review settings June 10, 2026 06:56

Copilot started reviewing on behalf of grimoire June 10, 2026 06:56 View session

Copilot AI reviewed Jun 10, 2026

View reviewed changes

fix copilot

29666fe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch invariant support PART1#4666

Batch invariant support PART1#4666
grimoire wants to merge 9 commits into
InternLM:mainfrom
grimoire:deterministic-inference

grimoire commented Jun 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

grimoire commented Jun 10, 2026

Summary

Changes

Validation

Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants