Conversation
…ening - fix min/max bug at visualserver/manager.py:87 — was silently capping per-DP visual batch size to 1 regardless of --visual_infer_batch_size - tighten visual and audio runtime semaphores from *8 to *1 so runtime concurrency never exceeds the stress-tested peak - add per-model _check_max_len_infer for Qwen2_VL, Qwen2_5_VL, Qwen3_VL, Qwen3_omni_moe (absorbed from PR #1253) - qwen_vl_check_max_len_infer deliberately omits torch.cuda.empty_cache so the stress peak stays pinned in the caching allocator at the driver level for the rest of process lifetime — this is the reserve-then-yield contract that lets the LLM's later profile_size see peer reservations - wire _check_max_len_infer call site into visual model_rpc.py::exposed_init_model with hasattr gate and warning log for uncovered model types - absorb PR #1253's config-driven worst-case derivation for InternVL - port _check_decode_infer helper from origin/qw35_stable into basemodel.py (not yet called from __init__ — Commit 2 wires it in) Part of the multimodal OOM fix. See docs/superpowers/specs/2026-04-16-multimodal-oom-fix-design.md for rationale.
- profile_size now picks a probe KV size (max(gmbs*(bmt+256), 8192)) instead of the old (1-mem_fraction)*total fudge subtraction. The mem_fraction argument is saved for Phase 2 to use as an optional safety multiplier. - add profile_size_target() method that computes the budget from measured peak_reserved, total memory, peer footprint, and canary. TP-safe via all_reduce(MIN) across ranks. - wire _check_decode_infer into basemodel __init__ as a call site immediately after _check_max_len_infer. - add _auto_profile_log_only_phase2 transitional helper that computes what auto-profile would pick and logs the delta versus the probe size, without acting on the result. The server still ships with probe-sized KV; Commit 3 replaces this helper with the real Phase 3 rebuild. - add pure-Python unit tests under unit_tests/common/basemodel/ covering the probe formula, target arithmetic, peer footprint flooring, and mem_fraction multiplier semantics. These tests mock torch.cuda and run without a GPU. - apply carry-forward Fix A: add defensive mem_manager.free_all() on the actual_batch < 2 early-return path in _check_decode_infer. - apply carry-forward Fix B: add clarifying comment on mem_indexes[:actual_batch] in _build_decode_model_input explaining the 1-slot-per-request semantics. Part of the multimodal OOM fix. See docs/superpowers/specs/2026-04-16-multimodal-oom-fix-design.md §6.
- add _teardown_graphs_and_kv helper that drops all references to the probe's kv_buffer and CUDA graphs in the order required for empty_cache() to actually return blocks to the driver. - replace the log-only Commit 2 helper with _auto_profile_rebuild_and_validate, which measures peak, computes target, tears down, empty_caches, re-inits mem_manager + req_manager + cudagraphs at target size, allocates the 256 MB canary, and runs a validation stress forward. On OOM, shrinks target by 5% and retries (cap 3 retries, 4 total attempts). - on exhausted retries, raises with the multi-knob diagnostic from spec §7.2. - post-empty_cache sanity check: assert torch.cuda.memory_reserved() dropped by at least probe_kv_bytes to catch teardown reference leaks early. - explicit --max_total_token_num path short-circuits the whole loop. - add negative non_kv_overhead guard in profile_size_target (clamps to 0 with warning if peak < probe, per Commit 2 quality review Fix D). - add unit tests for the retry cap (4 attempts max, all diagnostic knobs named in the error) and the escape-hatch short-circuit. Part of the multimodal OOM fix. See docs/superpowers/specs/2026-04-16-multimodal-oom-fix-design.md §6.5 and §7.
…MAX_LEN_INFER - --mem_fraction default changes from 0.9 to 1.0. Under the new auto-profile, mem_fraction is an optional safety multiplier on top of the measured budget, not a blind subtraction. Default 1.0 means "use what auto-profile measured." - remove all DISABLE_CHECK_MAX_LEN_INFER env var checks (three call sites): basemodel._check_max_len_infer, basemodel._check_decode_infer, vit/model.py._check_max_len_infer, qwen2_vl/vision_process.qwen_vl_check_max_len_infer. Operators who need to skip stress should use --max_total_token_num to pin a known-good size instead, which implicitly skips the auto-profile loop but preserves the single-pass stress as a safety net. - this is the user-visible behavior change commit. Commits 1-3 preserved the env var specifically so that rollout could emergency-disable stress if something broke; commit 4 is when that escape hatch goes away. - also removes the now-dead os.environ setter in base_backend.init_mtp_draft_model Part of the multimodal OOM fix. See docs/superpowers/specs/2026-04-16-multimodal-oom-fix-design.md §8.3 and §11.
There was a problem hiding this comment.
Code Review
This pull request implements a two-pass auto-profiling mechanism and a "reserve-then-yield" memory contract to prevent multimodal CUDA OOM errors. It updates subprocess initialization, adds realistic stress tests for vision models, and tightens concurrency limits. Feedback indicates that the Phase 3 rebuild loop is missing necessary re-initialization calls for the attention backend and padded requests, and the teardown process fails to clear references to the attention backend and OOM canary, which could prevent memory from being released.
| # Phase 3: re-init everything at target_tokens | ||
| self.max_total_token_num = target_tokens | ||
| self._init_mem_manager() | ||
| self._init_kv_move_buffer() | ||
| self._check_mem_size() | ||
| self._init_req_manager() | ||
| self._init_cudagraph() | ||
| self._init_prefill_cuda_graph() |
There was a problem hiding this comment.
The rebuild loop in Phase 3 is missing calls to self._init_att_backend() and self._init_padded_req(). Since mem_manager and req_manager are recreated, the attention backend and the padded request (which hold references to the old managers or their internal state like KV buffer pointers and request slots) must also be re-initialized. Without this, the model will continue to use stale state pointing to the released probe memory, leading to crashes or incorrect results during validation and service.
| # Phase 3: re-init everything at target_tokens | |
| self.max_total_token_num = target_tokens | |
| self._init_mem_manager() | |
| self._init_kv_move_buffer() | |
| self._check_mem_size() | |
| self._init_req_manager() | |
| self._init_cudagraph() | |
| self._init_prefill_cuda_graph() | |
| # Phase 3: re-init everything at target_tokens | |
| self.max_total_token_num = target_tokens | |
| self._init_mem_manager() | |
| self._init_kv_move_buffer() | |
| self._check_mem_size() | |
| self._init_req_manager() | |
| self._init_att_backend() | |
| if hasattr(self, "_init_att_backend1"): | |
| self._init_att_backend1() | |
| self._init_padded_req() | |
| self._init_cudagraph() | |
| self._init_prefill_cuda_graph() |
| def _teardown_graphs_and_kv(self): | ||
| """Phase 3 teardown: drop references to the probe's kv_buffer and | ||
| all captured CUDA graphs so torch.cuda.empty_cache() can actually | ||
| return the blocks to the driver. | ||
|
|
||
| Order matters — graphs hold tensor pointers into kv_buffer and must | ||
| go first, then req_manager (which references mem_manager), then | ||
| mem_manager itself. See spec §6.5 for the rationale. | ||
| """ | ||
| if hasattr(self, "mem_manager") and self.mem_manager is not None: | ||
| try: | ||
| self.mem_manager.free_all() | ||
| except Exception: | ||
| pass | ||
|
|
||
| for attr in ("graph", "prefill_graph", "prefill_cuda_graph"): | ||
| if hasattr(self, attr): | ||
| setattr(self, attr, None) | ||
| # MTP variants (if present) | ||
| for attr in ("graph1", "prefill_graph1"): | ||
| if hasattr(self, attr): | ||
| setattr(self, attr, None) | ||
|
|
||
| if hasattr(self, "req_manager"): | ||
| self.req_manager = None | ||
| if hasattr(self, "mem_manager"): | ||
| self.mem_manager = None |
There was a problem hiding this comment.
The _teardown_graphs_and_kv method is missing the cleanup of self.att_backend (and self.att_backend1 if present) and self._oom_canary. If these references are not dropped, torch.cuda.empty_cache() will not be able to release the memory occupied by the probe KV buffer and the canary tensor, which will cause the teardown leak check to fail or lead to OOM during the rebuild phase.
def _teardown_graphs_and_kv(self):
"""Phase 3 teardown: drop references to the probe's kv_buffer and
all captured CUDA graphs so torch.cuda.empty_cache() can actually
return the blocks to the driver.
Order matters — graphs hold tensor pointers into kv_buffer and must
go first, then req_manager (which references mem_manager), then
mem_manager itself. See spec §6.5 for the rationale.
"""
if hasattr(self, "mem_manager") and self.mem_manager is not None:
try:
self.mem_manager.free_all()
except Exception:
pass
for attr in ("graph", "prefill_graph", "prefill_cuda_graph", "graph1", "prefill_graph1"):
if hasattr(self, attr):
setattr(self, attr, None)
if hasattr(self, "att_backend"):
self.att_backend = None
if hasattr(self, "att_backend1"):
self.att_backend1 = None
if hasattr(self, "_oom_canary"):
self._oom_canary = None
if hasattr(self, "req_manager"):
self.req_manager = None
if hasattr(self, "mem_manager"):
self.mem_manager = None- change --mem_fraction default from 1.0 to 0.95: reserves 5% extra headroom on top of the auto-profiled budget for per-request spikes and allocator fragmentation the stress test cannot cover. Users can still set 1.0 explicitly for aggressive mode. - fix api_start.py assertion from strict `< 1` to `<= 1` so that mem_fraction=1.0 no longer crashes on startup.
The probe size formula was max(gmbs * (bmt + 256), 8192), which produced a huge KV allocation when both graph_max_batch_size and batch_max_tokens were large. On hybrid models like Qwen3.5 (attention + mamba), the oversized probe KV starved the mamba cache manager's allocation, causing a startup error asking the operator to raise --mem_fraction or reduce --running_max_req_size. The probe only needs enough KV slots for: - one full-length prefill stress (bmt slots) - one decode batch stress (gmbs slots, 1 new token per request) So the correct formula is max(bmt + gmbs, 8192), not max(gmbs * bmt, 8192). This is a 100x+ reduction in probe memory for typical configs. Update unit test to match new formula; rename from _graph_heavy to _large_bmt to reflect that the exceeding-floor case is now driven by bmt.
The probe size formula was max(bmt + gmbs, 8192), which was smaller than max_req_total_len in chunked-prefill configurations (e.g. bmt=4096 but max_req_total_len=262144). This broke _check_mem_size's assertion max_seq_length <= max_total_token_num because a single request's full KV must fit in the (probe-sized) manager. New formula: max(max_req_total_len + gmbs, bmt + gmbs, 8192). This ensures the probe can hold one full-length request plus a decode batch, which is the minimum to pass _check_mem_size AND run both stress forwards. Update unit tests: add max_req_total_len to the fixture and add a new _long_context test for the chunked-prefill case.
The max_seq_length <= max_total_token_num assertion in _check_mem_size is a correctness constraint for production KV sizing — a single request's full KV must fit. But during the auto-profile probe phase, the KV is intentionally small (just enough for the stress forwards), so the assertion is a false positive. - _check_mem_size now skips the assertion when mem_manager._probe_tokens is set (probe phase) AND the probe size is smaller than max_seq_length. After Phase 3 rebuild, _probe_tokens is unset and the assertion fires as before. - Revert probe formula to max(bmt + gmbs, 8192). We no longer need max_req_total_len in the formula because the probe doesn't have to hold a full-length request — it only needs enough slots for the stress forwards. This reduces probe KV by ~60x on chunked-prefill configs (e.g. max_req_total_len=262144, bmt=4096 → probe was 262272, now 4224), freeing that memory for the mamba cache on hybrid models like Qwen3.5.
…size mamba_cache_mem_manager.profile_size was subtracting `total * (1 - mem_fraction)` from available memory, the same safety-margin logic the KV cache used to apply. Under the auto-profile design, the mem_fraction multiplier applies only to the KV cache budget in Phase 2, and the 256 MB canary absorbs allocator jitter. Keeping mamba's subtraction double-counts the safety margin. On memory-tight configurations (e.g. Qwen3.5-122B MoE on 80 GB cards where per-rank weights alone consume ~60 GB), this caused mamba's available_memory to go negative, producing nonsense diagnostic output like "Reduce --running_max_req_size to -82 or lower". mamba now takes mamba_cache_ratio * get_available_gpu_memory directly. The safety margin is entirely the KV cache's responsibility.
Summary
Replaces PR #1253 with a complete multimodal OOM fix covering:
min → maxbugfix at visualserver/manager.py:87 that was silently capping per-DP visual batch size to 1.torch.cuda.empty_cache(). Visual and audio runtime semaphores tightened from× 8to× 1so concurrency cannot exceed the stress-tested peak.--mem_fractiontuning. Phase 1 runs a probe-sized KV + real prefill + real decode stress. Phase 2 computes the target KV size from measured peak. Phase 3 tears down + empty_caches + reallocs at target + re-captures graphs + allocates a 256 MB canary. Phase 4 validates. On OOM, shrinks 5% and retries (cap 3 retries, 4 total attempts) then fails fast with a multi-knob diagnostic.--mem_fractiondefault flipped from 0.9 to 1.0 (now an optional paranoia multiplier).DISABLE_CHECK_MAX_LEN_INFERenv var removed — use--max_total_token_numto pin a size instead.See
docs/superpowers/specs/2026-04-16-multimodal-oom-fix-design.mdfor the full design.Test plan
unit_tests/common/basemodel/test_auto_profile.pypass (8/8)--max_total_token_numescape hatchNotes
Do not merge to
origin/main. This PR is for review only pending maintainer sign-off after the staging replay validates the four-commit stack.