Skip to content

feat: add video input support for Gemma 4 Unified (gemma4_unified)#400

Merged
inureyes merged 2 commits into
mainfrom
feature/issue-164-gemma4-unified-video
Jun 22, 2026
Merged

feat: add video input support for Gemma 4 Unified (gemma4_unified)#400
inureyes merged 2 commits into
mainfrom
feature/issue-164-gemma4-unified-video

Conversation

@inureyes

Copy link
Copy Markdown
Member

Summary

Add video input support to the encoder-free Gemma 4 Unified (gemma4_unified) runtime, replacing the previous "video input is not yet supported" gate. Video is handled as images-per-frame: frames are extracted with ffmpeg (uniform sampling, default 2.0 fps via the shared multimodal::video module), patchified through the existing patch-projection vision embedder, and the per-frame soft tokens scatter into video_token_id placeholder spans, mirroring the image scatter. Works on both the CLI (--video) and the server (video_url content blocks).

What changed

  • Processor (src/vision/processors/gemma4_unified.rs): factored the image patch loop into a shared patchify(image, soft_token_cap) and added preprocess_video_frames, which caps each frame at vision_soft_tokens_per_video_frame (70 by default) instead of the larger per-image budget (280). resize_dims now takes the cap as a parameter so image and video share one loop. New DEFAULT_VIDEO_SOFT_TOKENS_PER_FRAME constant and set_video_soft_tokens_per_frame setter.
  • Config (src/vision/gemma4_unified_config.rs): new vision_soft_tokens_per_video_frame field (default 70), wired into the processor by the loader (src/loading/vlm_gemma_unified.rs).
  • Model (src/vision/gemma4_unified.rs): factored the per-input project/trim/concat into project_vision_features; added get_video_features and get_input_embeddings_with_video; the image/video/audio scatters now run on the running embeddings inside a shared merge_multimodal core. The image/audio/text paths run the identical op sequence as before, so their output is unchanged.
  • Token expansion (src/multimodal/vlm_runtime.rs): new expand_gemma4_unified_video_tokens frames each frame as <boi> video_token*N <eoi> and either replaces a <|video|> placeholder in prompt order or splices after BOS when absent. Per-frame soft tokens use video_token_id (not image_token_id), matching the encoder-free scatter target.
  • CLI (src/commands/generate.rs, generate_vlm.rs): route --video for Gemma4Unified to a new compute_gemma4_unified_video_embeddings; render a <|video|> content part inside the user turn, gated to gemma4_unified (via cli_video_content_part_count) so the ViT-backed gemma4 VLM path keeps its splice-after-BOS behavior byte-for-byte. New ChatTemplateProcessor::supports_video_content.
  • Server (src/server/model_worker.rs, startup.rs): removed the Gemma4Unified "not yet supported" gate and added prepare_gemma4_unified_video_embeddings; detect_model_media_support now advertises video for gemma4_unified so the route guard admits video_url requests. The server gets the <|video|> placeholder from the request content parts.
  • Mask fix (src/models/gemma4.rs): overlay_block_bidirectional panicked ("Cannot reshape array of size N into shape (1, window)") whenever a prefill exceeded the 1024-token sliding window, because the windowed base mask caps its key axis to the window while the block-id vector spans the full sequence. It now aligns the key-side ids to the trailing window positions the rotating cache retains. Each vision span is at most one frame and fits inside the window, so per-span bidirectional attention is preserved.
  • Docs (docs/supported-models.md): mark video as supported for gemma4_unified and document the frame budget / sliding-window caveat.

Reused infra (no reinvention)

multimodal::video (ffmpeg_available, load_videos, load_video_source, DEFAULT_FPS) for decode and FPS; the existing Gemma4UnifiedVisionEmbedder + Gemma4MultimodalEmbedder patch projector; merge::merge_llava for the scatter (now keyed on video_token_id); the existing gemma4_unified_mask blockwise bidirectional builder (video tokens were already type 2 and treated as vision); decode_request_images and the server media resolver / fd-backed VideoSource for the server path.

Validation (real model: mlx-community/gemma-4-12b-it-4bit, M1 Ultra)

CLI, grounded description (6 frames at --fps 0.6, 425 prompt tokens):

$ mlxcel generate -m models/gemma-4-12b-it-4bit --video car_video.mp4 --fps 0.6 -p "Describe this video." -n 100
The video shows a car accident in a parking lot. A black SUV is moving forward and hits a silver car parked in a space. The impact causes the front of the black SUV to crumple and the front of the silver car to be damaged. The black SUV then comes to a stop.

Server /v1/chat/completions with a video_url content part (startup log: model_type=Gemma4Unified: enabling video_url content block support):

The video shows a black minivan and a silver sedan parked in a parking lot. The black minivan is parked in a space and the silver sedan is parked in the space next to it...

Regressions on the same model are unaffected: text-only returns a correct short answer, and a single-image run returns "A solid, dark reddish-brown square." for a solid-red fixture.

Known limitation (pre-existing, not introduced here)

The acceptance command at the default 2.0 fps on the 10s car_video.mp4 yields 20 frames (1377 prompt tokens), which exceeds the model's 1024-token sliding window. Single-pass prefill of a sequence longer than the window degenerates to <pad> output because the rotating KV cache evicts the keys the earliest query rows need (their windowed mask row becomes all-masked, producing NaN that propagates). This reproduces with a text-only ~1700-token coherent prompt and is independent of the video path (text generation never touches the bidirectional overlay). It also reproduces on the existing gemma4 VLM video path. Lowering --fps (or the frame count) keeps a clip within the window and yields a grounded description, as shown above. A proper fix (chunked prefill for sliding-window models) is a separate follow-up.

Tests

Narrow unit tests added beside the code:

  • Processor: preprocess_video_frames per-frame shape + soft-token count caps at vision_soft_tokens_per_video_frame; image path unchanged by the video budget.
  • Config: vision_soft_tokens_per_video_frame defaults to 70 and honors an explicit value.
  • Expansion: expand_gemma4_unified_video_tokens placeholder-replace, splice-after-BOS, count-mismatch error, empty no-op.
  • Mask: video frame spans separated by eoi/boi get distinct bidirectional blocks; overlay_block_bidirectional aligns block ids to a windowed (capped) key axis without panicking.
  • Prompt: the VLM chat template renders a <|video|> content part inside the user turn (image-then-video order) and omits it when the template lacks video support.
  • Server: detect_model_media_support enables video for gemma4_unified.

Test plan

  • cargo build --release --features metal,accelerate -p mlxcel
  • cargo test --release -p mlxcel --lib gemma4_unified (33 passed) and the new lib tests (38 passed for the combined filter)
  • cargo test --release -p mlxcel --bin mlxcel (125 passed; the one failure tests::family_order_is_exhaustive is pre-existing on main and unrelated: it flags BitNet missing from FAMILY_ORDER)
  • cargo clippy --features metal,accelerate -p mlxcel --bins --tests -- -D warnings
  • cargo fmt --check
  • Real-model video generation (CLI) produces a grounded car-accident description
  • Server /v1/chat/completions video content part produces a grounded description
  • Text-only and single-image regressions on the same model

Closes #164

Wire `--video` (CLI) and `video_url` content blocks (server) through the encoder-free `gemma4_unified` runtime, replacing the previous "not yet supported" gate. Video is handled as images-per-frame: frames are extracted with `ffmpeg` (uniform sampling, default 2.0 fps via the shared `multimodal::video` module), each frame is patchified through the existing patch-projection vision embedder, and the per-frame soft tokens scatter into `video_token_id` placeholder spans, mirroring the image scatter.

Processor: factor the image patch loop into a shared `patchify(image, soft_token_cap)` and add `preprocess_video_frames`, which caps each frame at `vision_soft_tokens_per_video_frame` (70 by default, new config field) instead of the larger per-image budget (280). `resize_dims` now takes the cap as a parameter so image and video share one loop.

Model: factor the per-input project+trim+concat into `project_vision_features`, add `get_video_features` and `get_input_embeddings_with_video`, and run image then video then audio scatters on the running embeddings inside a shared `merge_multimodal` core. The image/audio/text paths run the identical op sequence as before, so their output is unchanged.

Prompt placement: the CLI now renders a `<|video|>` content part inside the user turn (gated to `gemma4_unified` so the ViT-backed `gemma4` VLM keeps its splice-after-BOS behavior byte-for-byte). The unified expansion `expand_gemma4_unified_video_tokens` frames each frame as `<boi> video_token*N <eoi>` and replaces the placeholder in prompt order (or splices after BOS when absent). The server gets the placeholder from the request content parts and `detect_model_media_support` now advertises video for `gemma4_unified` so the route guard admits the request.

Mask fix: `overlay_block_bidirectional` panicked ("Cannot reshape array of size N into shape (1, window)") whenever a prefill exceeded the 1024-token sliding window, because the windowed base mask caps its key axis to the window while the block-id vector spans the full sequence. It now aligns the key-side ids to the trailing `window` positions the rotating cache retains. Each vision span is at most one frame and fits inside the window, so per-span bidirectional attention is preserved.

Validation (mlx-community/gemma-4-12b-it-4bit, M1 Ultra): `--video car_video.mp4 --fps 0.6 -p "Describe this video."` returns "The video shows a car accident in a parking lot. A black SUV is moving forward and hits a silver car parked in a space..." Server `/v1/chat/completions` with a `video_url` content part returns "The video shows a black minivan and a silver sedan parked in a parking lot...". Text-only and single-image runs on the same model are unaffected. The default 2.0 fps on a 10s clip yields 20 frames (1377 tokens) which exceeds the model's 1024-token sliding window and degenerates during single-pass prefill; this is a pre-existing limitation reproducible with a text-only 1700-token prompt, independent of the video path. Lower `--fps` keeps a clip within the window.

Closes #164
@inureyes inureyes added status:review Under review type:enhancement New features, capabilities, or significant additions priority:medium Medium priority area:models Model architectures, weights, loading, metadata labels Jun 22, 2026
Two comments introduced by the video-input change used em dashes, which the project style avoids. Replace them with a period and a colon respectively. Comment-only; no behavior change.
@inureyes

Copy link
Copy Markdown
Member Author

PR Finalization Complete

Summary

Tests: All 4 expand_gemma4_unified_video_tokens error paths (count-mismatch, no-placeholder splice, placeholder replacement, empty-frames no-op) are covered by existing tests in src/multimodal/vlm_runtime_tests.rs. No gap found.

Documentation: docs/supported-models.md is accurate and complete. The gemma4_unified bullet covers the video capability, the ~70-tokens/frame budget, the 1024-token sliding-window caveat, the --fps mitigation, and both CLI (--video) and server (video_url) paths. The video env vars (MLXCEL_VIDEO_DIR_ALLOWLIST, MLXCEL_VIDEO_MAX_PIXELS, MLXCEL_VIDEO_MAX_DURATION_SEC) are already documented generically in docs/environment-variables.md under "Video and local-media variables"; no model-specific addition needed. No other VLM usage doc lists per-model --video support, so nothing else to update.

Lint/Format: cargo fmt --check -p mlxcel clean. cargo clippy --features metal,accelerate -p mlxcel --lib --tests -- -D warnings clean (15.57s, no warnings).

Em dashes: None found anywhere in the PR diff (Unicode scan confirmed).

No commits needed. Branch is up to date with origin at 5e848d740. Ready for merge.

@inureyes inureyes added status:done Completed and removed status:review Under review labels Jun 22, 2026
@inureyes inureyes merged commit d35ef06 into main Jun 22, 2026
5 checks passed
@inureyes inureyes deleted the feature/issue-164-gemma4-unified-video branch June 22, 2026 05:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:models Model architectures, weights, loading, metadata priority:medium Medium priority status:done Completed type:enhancement New features, capabilities, or significant additions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: video input support for Gemma 4 Unified (gemma4_unified)

1 participant