Skip to content

fix(oom): multimodal OOM fix with auto-profile#1273

Open
sufubao wants to merge 9 commits intomainfrom
fix_oom
Open

fix(oom): multimodal OOM fix with auto-profile#1273
sufubao wants to merge 9 commits intomainfrom
fix_oom

Conversation

@sufubao
Copy link
Copy Markdown
Collaborator

@sufubao sufubao commented Apr 16, 2026

Summary

Replaces PR #1253 with a complete multimodal OOM fix covering:

  • Layer 0: min → max bugfix at visualserver/manager.py:87 that was silently capping per-DP visual batch size to 1.
  • Layer 1: per-model stress helpers for Qwen2_VL, Qwen2_5_VL, Qwen3_VL, Qwen3_omni_moe that hold the stress peak at the driver level by deliberately omitting torch.cuda.empty_cache(). Visual and audio runtime semaphores tightened from × 8 to × 1 so concurrency cannot exceed the stress-tested peak.
  • Layer 2: LLM-side two-pass auto-profile that eliminates --mem_fraction tuning. Phase 1 runs a probe-sized KV + real prefill + real decode stress. Phase 2 computes the target KV size from measured peak. Phase 3 tears down + empty_caches + reallocs at target + re-captures graphs + allocates a 256 MB canary. Phase 4 validates. On OOM, shrinks 5% and retries (cap 3 retries, 4 total attempts) then fails fast with a multi-knob diagnostic.
  • --mem_fraction default flipped from 0.9 to 1.0 (now an optional paranoia multiplier).
  • DISABLE_CHECK_MAX_LEN_INFER env var removed — use --max_total_token_num to pin a size instead.

See docs/superpowers/specs/2026-04-16-multimodal-oom-fix-design.md for the full design.

Test plan

  • pure-Python unit tests in unit_tests/common/basemodel/test_auto_profile.py pass (8/8)
  • local GPU smoke test on Qwen2.5-VL-7B — auto-profile happy path
  • local GPU smoke test — explicit --max_total_token_num escape hatch
  • local GPU smoke test — forced failure path raises with diagnostic
  • 1-hour replay on staging replica of the production Qwen3.5-VL workload, all acceptance criteria from spec §10.3 green

Notes

Do not merge to origin/main. This PR is for review only pending maintainer sign-off after the staging replay validates the four-commit stack.

sufubao added 4 commits April 16, 2026 10:15
…ening

- fix min/max bug at visualserver/manager.py:87 — was silently capping
  per-DP visual batch size to 1 regardless of --visual_infer_batch_size
- tighten visual and audio runtime semaphores from *8 to *1 so runtime
  concurrency never exceeds the stress-tested peak
- add per-model _check_max_len_infer for Qwen2_VL, Qwen2_5_VL, Qwen3_VL,
  Qwen3_omni_moe (absorbed from PR #1253)
- qwen_vl_check_max_len_infer deliberately omits torch.cuda.empty_cache
  so the stress peak stays pinned in the caching allocator at the driver
  level for the rest of process lifetime — this is the reserve-then-yield
  contract that lets the LLM's later profile_size see peer reservations
- wire _check_max_len_infer call site into visual model_rpc.py::exposed_init_model
  with hasattr gate and warning log for uncovered model types
- absorb PR #1253's config-driven worst-case derivation for InternVL
- port _check_decode_infer helper from origin/qw35_stable into basemodel.py
  (not yet called from __init__ — Commit 2 wires it in)

Part of the multimodal OOM fix. See
docs/superpowers/specs/2026-04-16-multimodal-oom-fix-design.md for rationale.
- profile_size now picks a probe KV size (max(gmbs*(bmt+256), 8192))
  instead of the old (1-mem_fraction)*total fudge subtraction. The
  mem_fraction argument is saved for Phase 2 to use as an optional
  safety multiplier.
- add profile_size_target() method that computes the budget from
  measured peak_reserved, total memory, peer footprint, and canary.
  TP-safe via all_reduce(MIN) across ranks.
- wire _check_decode_infer into basemodel __init__ as a call site
  immediately after _check_max_len_infer.
- add _auto_profile_log_only_phase2 transitional helper that computes
  what auto-profile would pick and logs the delta versus the probe size,
  without acting on the result. The server still ships with probe-sized
  KV; Commit 3 replaces this helper with the real Phase 3 rebuild.
- add pure-Python unit tests under unit_tests/common/basemodel/
  covering the probe formula, target arithmetic, peer footprint flooring,
  and mem_fraction multiplier semantics. These tests mock torch.cuda
  and run without a GPU.
- apply carry-forward Fix A: add defensive mem_manager.free_all() on
  the actual_batch < 2 early-return path in _check_decode_infer.
- apply carry-forward Fix B: add clarifying comment on mem_indexes[:actual_batch]
  in _build_decode_model_input explaining the 1-slot-per-request semantics.

Part of the multimodal OOM fix. See
docs/superpowers/specs/2026-04-16-multimodal-oom-fix-design.md §6.
- add _teardown_graphs_and_kv helper that drops all references to the
  probe's kv_buffer and CUDA graphs in the order required for
  empty_cache() to actually return blocks to the driver.
- replace the log-only Commit 2 helper with _auto_profile_rebuild_and_validate,
  which measures peak, computes target, tears down, empty_caches, re-inits
  mem_manager + req_manager + cudagraphs at target size, allocates the 256 MB
  canary, and runs a validation stress forward. On OOM, shrinks target by 5%
  and retries (cap 3 retries, 4 total attempts).
- on exhausted retries, raises with the multi-knob diagnostic from spec §7.2.
- post-empty_cache sanity check: assert torch.cuda.memory_reserved() dropped
  by at least probe_kv_bytes to catch teardown reference leaks early.
- explicit --max_total_token_num path short-circuits the whole loop.
- add negative non_kv_overhead guard in profile_size_target (clamps to 0
  with warning if peak < probe, per Commit 2 quality review Fix D).
- add unit tests for the retry cap (4 attempts max, all diagnostic knobs
  named in the error) and the escape-hatch short-circuit.

Part of the multimodal OOM fix. See
docs/superpowers/specs/2026-04-16-multimodal-oom-fix-design.md §6.5 and §7.
…MAX_LEN_INFER

- --mem_fraction default changes from 0.9 to 1.0. Under the new auto-profile,
  mem_fraction is an optional safety multiplier on top of the measured budget,
  not a blind subtraction. Default 1.0 means "use what auto-profile measured."
- remove all DISABLE_CHECK_MAX_LEN_INFER env var checks (three call sites):
  basemodel._check_max_len_infer, basemodel._check_decode_infer,
  vit/model.py._check_max_len_infer, qwen2_vl/vision_process.qwen_vl_check_max_len_infer.
  Operators who need to skip stress should use --max_total_token_num to pin
  a known-good size instead, which implicitly skips the auto-profile loop
  but preserves the single-pass stress as a safety net.
- this is the user-visible behavior change commit. Commits 1-3 preserved
  the env var specifically so that rollout could emergency-disable stress
  if something broke; commit 4 is when that escape hatch goes away.
- also removes the now-dead os.environ setter in base_backend.init_mtp_draft_model

Part of the multimodal OOM fix. See
docs/superpowers/specs/2026-04-16-multimodal-oom-fix-design.md §8.3 and §11.
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a two-pass auto-profiling mechanism and a "reserve-then-yield" memory contract to prevent multimodal CUDA OOM errors. It updates subprocess initialization, adds realistic stress tests for vision models, and tightens concurrency limits. Feedback indicates that the Phase 3 rebuild loop is missing necessary re-initialization calls for the attention backend and padded requests, and the teardown process fails to clear references to the attention backend and OOM canary, which could prevent memory from being released.

Comment on lines +1212 to +1219
# Phase 3: re-init everything at target_tokens
self.max_total_token_num = target_tokens
self._init_mem_manager()
self._init_kv_move_buffer()
self._check_mem_size()
self._init_req_manager()
self._init_cudagraph()
self._init_prefill_cuda_graph()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The rebuild loop in Phase 3 is missing calls to self._init_att_backend() and self._init_padded_req(). Since mem_manager and req_manager are recreated, the attention backend and the padded request (which hold references to the old managers or their internal state like KV buffer pointers and request slots) must also be re-initialized. Without this, the model will continue to use stale state pointing to the released probe memory, leading to crashes or incorrect results during validation and service.

Suggested change
# Phase 3: re-init everything at target_tokens
self.max_total_token_num = target_tokens
self._init_mem_manager()
self._init_kv_move_buffer()
self._check_mem_size()
self._init_req_manager()
self._init_cudagraph()
self._init_prefill_cuda_graph()
# Phase 3: re-init everything at target_tokens
self.max_total_token_num = target_tokens
self._init_mem_manager()
self._init_kv_move_buffer()
self._check_mem_size()
self._init_req_manager()
self._init_att_backend()
if hasattr(self, "_init_att_backend1"):
self._init_att_backend1()
self._init_padded_req()
self._init_cudagraph()
self._init_prefill_cuda_graph()

Comment on lines +1125 to +1151
def _teardown_graphs_and_kv(self):
"""Phase 3 teardown: drop references to the probe's kv_buffer and
all captured CUDA graphs so torch.cuda.empty_cache() can actually
return the blocks to the driver.

Order matters — graphs hold tensor pointers into kv_buffer and must
go first, then req_manager (which references mem_manager), then
mem_manager itself. See spec §6.5 for the rationale.
"""
if hasattr(self, "mem_manager") and self.mem_manager is not None:
try:
self.mem_manager.free_all()
except Exception:
pass

for attr in ("graph", "prefill_graph", "prefill_cuda_graph"):
if hasattr(self, attr):
setattr(self, attr, None)
# MTP variants (if present)
for attr in ("graph1", "prefill_graph1"):
if hasattr(self, attr):
setattr(self, attr, None)

if hasattr(self, "req_manager"):
self.req_manager = None
if hasattr(self, "mem_manager"):
self.mem_manager = None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The _teardown_graphs_and_kv method is missing the cleanup of self.att_backend (and self.att_backend1 if present) and self._oom_canary. If these references are not dropped, torch.cuda.empty_cache() will not be able to release the memory occupied by the probe KV buffer and the canary tensor, which will cause the teardown leak check to fail or lead to OOM during the rebuild phase.

    def _teardown_graphs_and_kv(self):
        """Phase 3 teardown: drop references to the probe's kv_buffer and
        all captured CUDA graphs so torch.cuda.empty_cache() can actually
        return the blocks to the driver.

        Order matters — graphs hold tensor pointers into kv_buffer and must
        go first, then req_manager (which references mem_manager), then
        mem_manager itself. See spec §6.5 for the rationale.
        """
        if hasattr(self, "mem_manager") and self.mem_manager is not None:
            try:
                self.mem_manager.free_all()
            except Exception:
                pass

        for attr in ("graph", "prefill_graph", "prefill_cuda_graph", "graph1", "prefill_graph1"):
            if hasattr(self, attr):
                setattr(self, attr, None)

        if hasattr(self, "att_backend"):
            self.att_backend = None
        if hasattr(self, "att_backend1"):
            self.att_backend1 = None
        if hasattr(self, "_oom_canary"):
            self._oom_canary = None

        if hasattr(self, "req_manager"):
            self.req_manager = None
        if hasattr(self, "mem_manager"):
            self.mem_manager = None

sufubao added 5 commits April 16, 2026 10:32
- change --mem_fraction default from 1.0 to 0.95: reserves 5% extra
  headroom on top of the auto-profiled budget for per-request spikes
  and allocator fragmentation the stress test cannot cover. Users can
  still set 1.0 explicitly for aggressive mode.
- fix api_start.py assertion from strict `< 1` to `<= 1` so that
  mem_fraction=1.0 no longer crashes on startup.
The probe size formula was max(gmbs * (bmt + 256), 8192), which produced
a huge KV allocation when both graph_max_batch_size and batch_max_tokens
were large. On hybrid models like Qwen3.5 (attention + mamba), the
oversized probe KV starved the mamba cache manager's allocation, causing
a startup error asking the operator to raise --mem_fraction or reduce
--running_max_req_size.

The probe only needs enough KV slots for:
- one full-length prefill stress (bmt slots)
- one decode batch stress (gmbs slots, 1 new token per request)

So the correct formula is max(bmt + gmbs, 8192), not max(gmbs * bmt, 8192).
This is a 100x+ reduction in probe memory for typical configs.

Update unit test to match new formula; rename from _graph_heavy to
_large_bmt to reflect that the exceeding-floor case is now driven by bmt.
The probe size formula was max(bmt + gmbs, 8192), which was smaller than
max_req_total_len in chunked-prefill configurations (e.g. bmt=4096 but
max_req_total_len=262144). This broke _check_mem_size's assertion
max_seq_length <= max_total_token_num because a single request's full
KV must fit in the (probe-sized) manager.

New formula: max(max_req_total_len + gmbs, bmt + gmbs, 8192). This
ensures the probe can hold one full-length request plus a decode batch,
which is the minimum to pass _check_mem_size AND run both stress
forwards.

Update unit tests: add max_req_total_len to the fixture and add a new
_long_context test for the chunked-prefill case.
The max_seq_length <= max_total_token_num assertion in _check_mem_size is
a correctness constraint for production KV sizing — a single request's
full KV must fit. But during the auto-profile probe phase, the KV is
intentionally small (just enough for the stress forwards), so the
assertion is a false positive.

- _check_mem_size now skips the assertion when mem_manager._probe_tokens
  is set (probe phase) AND the probe size is smaller than max_seq_length.
  After Phase 3 rebuild, _probe_tokens is unset and the assertion fires
  as before.
- Revert probe formula to max(bmt + gmbs, 8192). We no longer need
  max_req_total_len in the formula because the probe doesn't have to
  hold a full-length request — it only needs enough slots for the stress
  forwards.

This reduces probe KV by ~60x on chunked-prefill configs (e.g.
max_req_total_len=262144, bmt=4096 → probe was 262272, now 4224),
freeing that memory for the mamba cache on hybrid models like Qwen3.5.
…size

mamba_cache_mem_manager.profile_size was subtracting `total * (1 - mem_fraction)`
from available memory, the same safety-margin logic the KV cache used to
apply. Under the auto-profile design, the mem_fraction multiplier applies
only to the KV cache budget in Phase 2, and the 256 MB canary absorbs
allocator jitter. Keeping mamba's subtraction double-counts the safety
margin.

On memory-tight configurations (e.g. Qwen3.5-122B MoE on 80 GB cards
where per-rank weights alone consume ~60 GB), this caused mamba's
available_memory to go negative, producing nonsense diagnostic output
like "Reduce --running_max_req_size to -82 or lower".

mamba now takes mamba_cache_ratio * get_available_gpu_memory directly.
The safety margin is entirely the KV cache's responsibility.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant