Adding VLM support by sanchitmonga22 · Pull Request #25 · RunanywhereAI/RCLI

sanchitmonga22 · 2026-03-13T07:43:20Z

Summary

Adds MetalRT VLM backend with automatic fallback to llama.cpp VlmEngine
MetalRTLoader resolves 11 vision function pointers (vision_create/destroy/load/analyze/analyze_stream/generate/generate_stream/reset/model_name/device_name/free_result)
vlm_init_locked() tries MetalRT first (checks for Qwen3-VL safetensors model), falls back to llama.cpp GGUF transparently
rcli_vlm_analyze() and handle_screen_intent() route through MetalRT when active, with GPU mutex serialization
Stats reporting handles both backends

Depends on

RunanywhereAI/MetalRT#28 (Vision C API)

Test plan

Verify MetalRT VLM loads when safetensors model is at ~/Library/RCLI/models/metalrt/Qwen3-VL-2B-MLX-4bit/
Verify llama.cpp fallback when MetalRT dylib or model is missing
Test rcli vlm analyze <image> end-to-end with MetalRT backend
Test screen capture intent triggers MetalRT vision analysis
Verify rcli_vlm_get_stats returns correct timing for both backends

Implements screencapture CLI wrapper, active window detection, terminal PID skip logic, app tracking via polling, and visual overlay mode via a subprocess helper (rcli_overlay) communicating over stdin/stdout pipes.

Separate process with its own AppKit event loop that shows a draggable/resizable green-bordered transparent overlay window. Communicates with parent RCLI via stdin/stdout pipe protocol.

Add screen_capture.mm to rcli library sources, rcli_overlay as a standalone executable target, and link CoreGraphics framework.

Detect screen-related voice intents via keyword combinations, capture screen (overlay or behind-terminal), analyze with VLM, and speak response using sentence-level streaming TTS for low TTFA.

[S] key toggles visual overlay on/off, status bar shows active state, screen/visual text commands trigger capture + VLM analysis, switched from rcli_speak to rcli_speak_streaming for lower TTFA.

Adds vision function pointer resolution to MetalRTLoader and routes VLM calls through MetalRT when the dylib is loaded and the Qwen3-VL safetensors model is installed. Falls back to llama.cpp VlmEngine transparently if MetalRT is unavailable.

Register Qwen3-VL-2B in MetalRT component models, auto-download from HuggingFace in vlm_init_locked() when model files are missing, and update setup/download UI to handle VLM component type.

Always use reset_cache=true for KV cache continuation in both rcli_process_command and rcli_process_and_speak. The incremental path (reset_cache=false) was unsafe because metalrt_kv_continuation_len tracked text length but not generated response tokens, causing duplicate content in the KV cache and corrupted attention on turn 2+.

Visual mode (S key) now swaps MetalRT LLM out and VLM in on the GPU, avoiding dual-model GPU corruption. Voice commands in visual mode route through VLM screen capture + streaming analysis with TTS. Exiting visual mode unloads VLM and restores the LLM with re-cached system prompt. New C API: rcli_vlm_enter(), rcli_vlm_exit(), rcli_vlm_analyze_stream()

6px solid rounded border with outer glow, 18px corner grab handles with white center dots, centered green label pill, 120x80 minimum size.

AmanSwar · 2026-03-14T23:02:27Z

Latest changes pushed

GPU model swap for VLM visual mode

S key now swaps MetalRT LLM ↔ VLM on the GPU (only one model at a time, no dual-model corruption)
Voice commands in visual mode route to VLM screen capture + streaming analysis + TTS
Exiting visual mode unloads VLM and restores LLM with re-cached system prompt
New C API: rcli_vlm_enter(), rcli_vlm_exit(), rcli_vlm_analyze_stream()

Multi-turn KV cache fix

Always use reset_cache=true for continuation — fixes garbage output on turn 2+ caused by generated tokens not tracked in metalrt_kv_continuation_len

Overlay improvements

6px solid rounded border with outer glow (was 4px dashed)
18px corner grab handles with white center dots
Centered green label pill
120x80 minimum size

Depends on

MetalRT fix: EOS check before streaming callback (RunanywhereAI/MetalRT fix/restore-threadgroup-memory branch)

…metalRT to do

sanchitmonga22 and others added 14 commits March 13, 2026 00:42

adding VLM support

62d2ff3

add screen capture module with visual overlay support

1ff1b4f

Implements screencapture CLI wrapper, active window detection, terminal PID skip logic, app tracking via polling, and visual overlay mode via a subprocess helper (rcli_overlay) communicating over stdin/stdout pipes.

add rcli_overlay standalone Cocoa helper for visual mode

e91107c

Separate process with its own AppKit event loop that shows a draggable/resizable green-bordered transparent overlay window. Communicates with parent RCLI via stdin/stdout pipe protocol.

add screen_capture and rcli_overlay to build system

7f5f3bd

Add screen_capture.mm to rcli library sources, rcli_overlay as a standalone executable target, and link CoreGraphics framework.

add screen intent detection and VLM analysis with streamed TTS

3c6aed3

Detect screen-related voice intents via keyword combinations, capture screen (overlay or behind-terminal), analyze with VLM, and speak response using sentence-level streaming TTS for low TTFA.

wire visual mode into TUI with [S] toggle and streamed TTS

fdb76a0

[S] key toggles visual overlay on/off, status bar shows active state, screen/visual text commands trigger capture + VLM analysis, switched from rcli_speak to rcli_speak_streaming for lower TTFA.

add screen CLI subcommand and TTS to camera/screen commands

e6bf765

add screen command to help text and usage examples

593e072

minor fixes: TTS engine, VLM engine, text sanitizer, tests

b6af46a

Add auto-download for Qwen3-VL model in setup and on first VLM use

c634d54

Register Qwen3-VL-2B in MetalRT component models, auto-download from HuggingFace in vlm_init_locked() when model files are missing, and update setup/download UI to handle VLM component type.

Improve visual mode overlay: thicker border, corner handles, label pill

14908e3

6px solid rounded border with outer glow, 18px corner grab handles with white center dots, centered green label pill, 120x80 minimum size.

fixed vlm to be using fallback for llamacpp / llamacpp vlm working - …

8a3e514

…metalRT to do

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding VLM support#25

Adding VLM support#25
sanchitmonga22 wants to merge 15 commits intomainfrom
VLM

sanchitmonga22 commented Mar 13, 2026 •

edited by AmanSwar

Loading

Uh oh!

AmanSwar commented Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sanchitmonga22 commented Mar 13, 2026 • edited by AmanSwar Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Depends on

Test plan

Uh oh!

AmanSwar commented Mar 14, 2026

Latest changes pushed

GPU model swap for VLM visual mode

Multi-turn KV cache fix

Overlay improvements

Depends on

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sanchitmonga22 commented Mar 13, 2026 •

edited by AmanSwar

Loading