feat: add video input support for Gemma 4 Unified (gemma4_unified)#400
Conversation
Wire `--video` (CLI) and `video_url` content blocks (server) through the encoder-free `gemma4_unified` runtime, replacing the previous "not yet supported" gate. Video is handled as images-per-frame: frames are extracted with `ffmpeg` (uniform sampling, default 2.0 fps via the shared `multimodal::video` module), each frame is patchified through the existing patch-projection vision embedder, and the per-frame soft tokens scatter into `video_token_id` placeholder spans, mirroring the image scatter.
Processor: factor the image patch loop into a shared `patchify(image, soft_token_cap)` and add `preprocess_video_frames`, which caps each frame at `vision_soft_tokens_per_video_frame` (70 by default, new config field) instead of the larger per-image budget (280). `resize_dims` now takes the cap as a parameter so image and video share one loop.
Model: factor the per-input project+trim+concat into `project_vision_features`, add `get_video_features` and `get_input_embeddings_with_video`, and run image then video then audio scatters on the running embeddings inside a shared `merge_multimodal` core. The image/audio/text paths run the identical op sequence as before, so their output is unchanged.
Prompt placement: the CLI now renders a `<|video|>` content part inside the user turn (gated to `gemma4_unified` so the ViT-backed `gemma4` VLM keeps its splice-after-BOS behavior byte-for-byte). The unified expansion `expand_gemma4_unified_video_tokens` frames each frame as `<boi> video_token*N <eoi>` and replaces the placeholder in prompt order (or splices after BOS when absent). The server gets the placeholder from the request content parts and `detect_model_media_support` now advertises video for `gemma4_unified` so the route guard admits the request.
Mask fix: `overlay_block_bidirectional` panicked ("Cannot reshape array of size N into shape (1, window)") whenever a prefill exceeded the 1024-token sliding window, because the windowed base mask caps its key axis to the window while the block-id vector spans the full sequence. It now aligns the key-side ids to the trailing `window` positions the rotating cache retains. Each vision span is at most one frame and fits inside the window, so per-span bidirectional attention is preserved.
Validation (mlx-community/gemma-4-12b-it-4bit, M1 Ultra): `--video car_video.mp4 --fps 0.6 -p "Describe this video."` returns "The video shows a car accident in a parking lot. A black SUV is moving forward and hits a silver car parked in a space..." Server `/v1/chat/completions` with a `video_url` content part returns "The video shows a black minivan and a silver sedan parked in a parking lot...". Text-only and single-image runs on the same model are unaffected. The default 2.0 fps on a 10s clip yields 20 frames (1377 tokens) which exceeds the model's 1024-token sliding window and degenerates during single-pass prefill; this is a pre-existing limitation reproducible with a text-only 1700-token prompt, independent of the video path. Lower `--fps` keeps a clip within the window.
Closes #164
Two comments introduced by the video-input change used em dashes, which the project style avoids. Replace them with a period and a colon respectively. Comment-only; no behavior change.
PR Finalization CompleteSummaryTests: All 4 Documentation: Lint/Format: Em dashes: None found anywhere in the PR diff (Unicode scan confirmed). No commits needed. Branch is up to date with origin at |
Summary
Add video input support to the encoder-free Gemma 4 Unified (
gemma4_unified) runtime, replacing the previous "video input is not yet supported" gate. Video is handled as images-per-frame: frames are extracted withffmpeg(uniform sampling, default 2.0 fps via the sharedmultimodal::videomodule), patchified through the existing patch-projection vision embedder, and the per-frame soft tokens scatter intovideo_token_idplaceholder spans, mirroring the image scatter. Works on both the CLI (--video) and the server (video_urlcontent blocks).What changed
src/vision/processors/gemma4_unified.rs): factored the image patch loop into a sharedpatchify(image, soft_token_cap)and addedpreprocess_video_frames, which caps each frame atvision_soft_tokens_per_video_frame(70 by default) instead of the larger per-image budget (280).resize_dimsnow takes the cap as a parameter so image and video share one loop. NewDEFAULT_VIDEO_SOFT_TOKENS_PER_FRAMEconstant andset_video_soft_tokens_per_framesetter.src/vision/gemma4_unified_config.rs): newvision_soft_tokens_per_video_framefield (default 70), wired into the processor by the loader (src/loading/vlm_gemma_unified.rs).src/vision/gemma4_unified.rs): factored the per-input project/trim/concat intoproject_vision_features; addedget_video_featuresandget_input_embeddings_with_video; the image/video/audio scatters now run on the running embeddings inside a sharedmerge_multimodalcore. The image/audio/text paths run the identical op sequence as before, so their output is unchanged.src/multimodal/vlm_runtime.rs): newexpand_gemma4_unified_video_tokensframes each frame as<boi> video_token*N <eoi>and either replaces a<|video|>placeholder in prompt order or splices after BOS when absent. Per-frame soft tokens usevideo_token_id(notimage_token_id), matching the encoder-free scatter target.src/commands/generate.rs,generate_vlm.rs): route--videoforGemma4Unifiedto a newcompute_gemma4_unified_video_embeddings; render a<|video|>content part inside the user turn, gated togemma4_unified(viacli_video_content_part_count) so the ViT-backedgemma4VLM path keeps its splice-after-BOS behavior byte-for-byte. NewChatTemplateProcessor::supports_video_content.src/server/model_worker.rs,startup.rs): removed theGemma4Unified"not yet supported" gate and addedprepare_gemma4_unified_video_embeddings;detect_model_media_supportnow advertises video forgemma4_unifiedso the route guard admitsvideo_urlrequests. The server gets the<|video|>placeholder from the request content parts.src/models/gemma4.rs):overlay_block_bidirectionalpanicked ("Cannot reshape array of size N into shape (1, window)") whenever a prefill exceeded the 1024-token sliding window, because the windowed base mask caps its key axis to the window while the block-id vector spans the full sequence. It now aligns the key-side ids to the trailingwindowpositions the rotating cache retains. Each vision span is at most one frame and fits inside the window, so per-span bidirectional attention is preserved.docs/supported-models.md): mark video as supported forgemma4_unifiedand document the frame budget / sliding-window caveat.Reused infra (no reinvention)
multimodal::video(ffmpeg_available,load_videos,load_video_source,DEFAULT_FPS) for decode and FPS; the existingGemma4UnifiedVisionEmbedder+Gemma4MultimodalEmbedderpatch projector;merge::merge_llavafor the scatter (now keyed onvideo_token_id); the existinggemma4_unified_maskblockwise bidirectional builder (video tokens were already type 2 and treated as vision);decode_request_imagesand the server media resolver / fd-backedVideoSourcefor the server path.Validation (real model:
mlx-community/gemma-4-12b-it-4bit, M1 Ultra)CLI, grounded description (6 frames at
--fps 0.6, 425 prompt tokens):Server
/v1/chat/completionswith avideo_urlcontent part (startup log:model_type=Gemma4Unified: enabling video_url content block support):Regressions on the same model are unaffected: text-only returns a correct short answer, and a single-image run returns "A solid, dark reddish-brown square." for a solid-red fixture.
Known limitation (pre-existing, not introduced here)
The acceptance command at the default 2.0 fps on the 10s
car_video.mp4yields 20 frames (1377 prompt tokens), which exceeds the model's 1024-token sliding window. Single-pass prefill of a sequence longer than the window degenerates to<pad>output because the rotating KV cache evicts the keys the earliest query rows need (their windowed mask row becomes all-masked, producing NaN that propagates). This reproduces with a text-only ~1700-token coherent prompt and is independent of the video path (text generation never touches the bidirectional overlay). It also reproduces on the existinggemma4VLM video path. Lowering--fps(or the frame count) keeps a clip within the window and yields a grounded description, as shown above. A proper fix (chunked prefill for sliding-window models) is a separate follow-up.Tests
Narrow unit tests added beside the code:
preprocess_video_framesper-frame shape + soft-token count caps atvision_soft_tokens_per_video_frame; image path unchanged by the video budget.vision_soft_tokens_per_video_framedefaults to 70 and honors an explicit value.expand_gemma4_unified_video_tokensplaceholder-replace, splice-after-BOS, count-mismatch error, empty no-op.eoi/boiget distinct bidirectional blocks;overlay_block_bidirectionalaligns block ids to a windowed (capped) key axis without panicking.<|video|>content part inside the user turn (image-then-video order) and omits it when the template lacks video support.detect_model_media_supportenables video forgemma4_unified.Test plan
cargo build --release --features metal,accelerate -p mlxcelcargo test --release -p mlxcel --lib gemma4_unified(33 passed) and the new lib tests (38 passed for the combined filter)cargo test --release -p mlxcel --bin mlxcel(125 passed; the one failuretests::family_order_is_exhaustiveis pre-existing onmainand unrelated: it flagsBitNetmissing fromFAMILY_ORDER)cargo clippy --features metal,accelerate -p mlxcel --bins --tests -- -D warningscargo fmt --check/v1/chat/completionsvideo content part produces a grounded descriptionCloses #164