Skip to content

Adding VLM support#25

Open
sanchitmonga22 wants to merge 15 commits intomainfrom
VLM
Open

Adding VLM support#25
sanchitmonga22 wants to merge 15 commits intomainfrom
VLM

Conversation

@sanchitmonga22
Copy link
Collaborator

@sanchitmonga22 sanchitmonga22 commented Mar 13, 2026

Summary

  • Adds MetalRT VLM backend with automatic fallback to llama.cpp VlmEngine
  • MetalRTLoader resolves 11 vision function pointers (vision_create/destroy/load/analyze/analyze_stream/generate/generate_stream/reset/model_name/device_name/free_result)
  • vlm_init_locked() tries MetalRT first (checks for Qwen3-VL safetensors model), falls back to llama.cpp GGUF transparently
  • rcli_vlm_analyze() and handle_screen_intent() route through MetalRT when active, with GPU mutex serialization
  • Stats reporting handles both backends

Depends on

  • RunanywhereAI/MetalRT#28 (Vision C API)

Test plan

  • Verify MetalRT VLM loads when safetensors model is at ~/Library/RCLI/models/metalrt/Qwen3-VL-2B-MLX-4bit/
  • Verify llama.cpp fallback when MetalRT dylib or model is missing
  • Test rcli vlm analyze <image> end-to-end with MetalRT backend
  • Test screen capture intent triggers MetalRT vision analysis
  • Verify rcli_vlm_get_stats returns correct timing for both backends

sanchitmonga22 and others added 14 commits March 13, 2026 00:42
Implements screencapture CLI wrapper, active window detection,
terminal PID skip logic, app tracking via polling, and visual
overlay mode via a subprocess helper (rcli_overlay) communicating
over stdin/stdout pipes.
Separate process with its own AppKit event loop that shows a
draggable/resizable green-bordered transparent overlay window.
Communicates with parent RCLI via stdin/stdout pipe protocol.
Add screen_capture.mm to rcli library sources, rcli_overlay as
a standalone executable target, and link CoreGraphics framework.
Detect screen-related voice intents via keyword combinations,
capture screen (overlay or behind-terminal), analyze with VLM,
and speak response using sentence-level streaming TTS for low TTFA.
[S] key toggles visual overlay on/off, status bar shows active
state, screen/visual text commands trigger capture + VLM analysis,
switched from rcli_speak to rcli_speak_streaming for lower TTFA.
Adds vision function pointer resolution to MetalRTLoader and routes
VLM calls through MetalRT when the dylib is loaded and the Qwen3-VL
safetensors model is installed. Falls back to llama.cpp VlmEngine
transparently if MetalRT is unavailable.
Register Qwen3-VL-2B in MetalRT component models, auto-download from
HuggingFace in vlm_init_locked() when model files are missing, and
update setup/download UI to handle VLM component type.
Always use reset_cache=true for KV cache continuation in both
rcli_process_command and rcli_process_and_speak. The incremental
path (reset_cache=false) was unsafe because metalrt_kv_continuation_len
tracked text length but not generated response tokens, causing
duplicate content in the KV cache and corrupted attention on turn 2+.
Visual mode (S key) now swaps MetalRT LLM out and VLM in on the GPU,
avoiding dual-model GPU corruption. Voice commands in visual mode route
through VLM screen capture + streaming analysis with TTS. Exiting
visual mode unloads VLM and restores the LLM with re-cached system prompt.

New C API: rcli_vlm_enter(), rcli_vlm_exit(), rcli_vlm_analyze_stream()
6px solid rounded border with outer glow, 18px corner grab handles with
white center dots, centered green label pill, 120x80 minimum size.
@AmanSwar
Copy link
Collaborator

Latest changes pushed

GPU model swap for VLM visual mode

  • S key now swaps MetalRT LLM ↔ VLM on the GPU (only one model at a time, no dual-model corruption)
  • Voice commands in visual mode route to VLM screen capture + streaming analysis + TTS
  • Exiting visual mode unloads VLM and restores LLM with re-cached system prompt
  • New C API: rcli_vlm_enter(), rcli_vlm_exit(), rcli_vlm_analyze_stream()

Multi-turn KV cache fix

  • Always use reset_cache=true for continuation — fixes garbage output on turn 2+ caused by generated tokens not tracked in metalrt_kv_continuation_len

Overlay improvements

  • 6px solid rounded border with outer glow (was 4px dashed)
  • 18px corner grab handles with white center dots
  • Centered green label pill
  • 120x80 minimum size

Depends on

  • MetalRT fix: EOS check before streaming callback (RunanywhereAI/MetalRT fix/restore-threadgroup-memory branch)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants