Open
Conversation
Implements screencapture CLI wrapper, active window detection, terminal PID skip logic, app tracking via polling, and visual overlay mode via a subprocess helper (rcli_overlay) communicating over stdin/stdout pipes.
Separate process with its own AppKit event loop that shows a draggable/resizable green-bordered transparent overlay window. Communicates with parent RCLI via stdin/stdout pipe protocol.
Add screen_capture.mm to rcli library sources, rcli_overlay as a standalone executable target, and link CoreGraphics framework.
Detect screen-related voice intents via keyword combinations, capture screen (overlay or behind-terminal), analyze with VLM, and speak response using sentence-level streaming TTS for low TTFA.
[S] key toggles visual overlay on/off, status bar shows active state, screen/visual text commands trigger capture + VLM analysis, switched from rcli_speak to rcli_speak_streaming for lower TTFA.
Adds vision function pointer resolution to MetalRTLoader and routes VLM calls through MetalRT when the dylib is loaded and the Qwen3-VL safetensors model is installed. Falls back to llama.cpp VlmEngine transparently if MetalRT is unavailable.
Register Qwen3-VL-2B in MetalRT component models, auto-download from HuggingFace in vlm_init_locked() when model files are missing, and update setup/download UI to handle VLM component type.
Always use reset_cache=true for KV cache continuation in both rcli_process_command and rcli_process_and_speak. The incremental path (reset_cache=false) was unsafe because metalrt_kv_continuation_len tracked text length but not generated response tokens, causing duplicate content in the KV cache and corrupted attention on turn 2+.
Visual mode (S key) now swaps MetalRT LLM out and VLM in on the GPU, avoiding dual-model GPU corruption. Voice commands in visual mode route through VLM screen capture + streaming analysis with TTS. Exiting visual mode unloads VLM and restores the LLM with re-cached system prompt. New C API: rcli_vlm_enter(), rcli_vlm_exit(), rcli_vlm_analyze_stream()
6px solid rounded border with outer glow, 18px corner grab handles with white center dots, centered green label pill, 120x80 minimum size.
Collaborator
Latest changes pushedGPU model swap for VLM visual mode
Multi-turn KV cache fix
Overlay improvements
Depends on
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
MetalRTLoaderresolves 11 vision function pointers (vision_create/destroy/load/analyze/analyze_stream/generate/generate_stream/reset/model_name/device_name/free_result)vlm_init_locked()tries MetalRT first (checks for Qwen3-VL safetensors model), falls back to llama.cpp GGUF transparentlyrcli_vlm_analyze()andhandle_screen_intent()route through MetalRT when active, with GPU mutex serializationDepends on
Test plan
~/Library/RCLI/models/metalrt/Qwen3-VL-2B-MLX-4bit/rcli vlm analyze <image>end-to-end with MetalRT backendrcli_vlm_get_statsreturns correct timing for both backends