Skip to content

jd-opensource/JoyAI-VL-Interaction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

20 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

JoyAI-VL-Interaction banner

JoyAI-VL-Interaction

⚑ An Open Real-time Video-Language Interaction System

An 8B-scale, fully open vision-language interaction model with a complete deployable system β€” the model, training recipe, time-aligned interaction data, and a real-time streaming stack, all in one repository.

πŸ“„ arXiv | πŸš€ Blog | πŸ“‘ Report | πŸ’» Code | πŸ€— Model | πŸ“¦ Dataset

πŸš€ Quick Start | 🧩 Capability | πŸ“Š Evaluation | πŸ“ Citation

8B Model Python 3.12 vLLM CUDA 12.x Apache 2.0 Sub-second latency

δΈ­ζ–‡ζ–‡ζ‘£: README.zh-CN.md

πŸ”₯ News

  • [2026-06-20] πŸŽ‰ Full open-source release β€” model weights, deployable system, and technical report are now available.
  • [TODO] Release aligned interaction training data.
  • [TODO] HuggingFace model & dataset pages.
intro-joyai-vl-interaction_under10mb.mp4

✨ Introduction

JoyAI-VL-Interaction overview

The most important moments rarely wait for you to ask. A pot boils over while your hands are full. A toddler wanders toward the stove. The best moment of the game is gone before you can react. Today's AI can't help with moments like these β€” these models are turn-based by design: they sit quietly until you address them, then answer the question you asked.

We think the next step is a model that's present like a person: one that watches what's happening now, decides on its own when a moment is worth a word, speaks up when it matters and stays quiet when it doesn't, and hands off to a stronger model when a problem is hard.

JoyAI-VL-Interaction is an 8B-scale, vision-first interaction model released together with its training recipe, its data, and a complete deployable system β€” all fully open. Point a webcam or a livestream at it and it's immediately present in the scene, watching and responding in real time.

🌟 Key Features

Feature Description
⚑ Real-time Presence Watches continuously and responds in under a second when needed.
πŸ‘οΈ Vision-triggered Proactivity Speaks from what it sees, while staying quiet when nothing matters.
πŸ€– Agent Delegation Hands hard subtasks to a background model, API, or agent while continuing to watch the stream.
πŸ”“ Fully Open Stack Model, data, training recipe, and deployable system β€” all released for full reproducibility.

πŸš€ Quick Start

git clone https://github.com/jd-opensource/JoyAI-VL-Interaction.git
cd JoyAI-VL-Interaction

# Install dependencies
./install/install.sh --with-all

# Download all model weights
./install/download-models.sh --all

# Start the core services
./services/scripts/run.sh minimal

Then open https://127.0.0.1:8099 in your browser.

πŸ‘‰ For the full setup (ASR, TTS, background agent) and configuration details, see the Getting Started Guide.

πŸ› οΈ System Architecture

JoyAI-VL-Interaction system architecture

At the core of JoyAI-VL-Interaction is one decision the model makes on its own, every second: speak, stay silent, or delegate. We build it on JoyAI-VL-8B and keep speech as pluggable input/output, so the model's only job is to watch and judge the right moment to act. A predictive video codec (AdaCodec) spends only a handful of tokens on each predictable frame and saves full detail for the moments the scene actually changes, keeping the token budget manageable over long streams. The behavior is learned from more than four million time-aligned clips, and refined with reinforcement learning.

Around the model we build a complete, deployable system:

Component Summary
🧠 Model JoyAI-VL-Interaction: the first open vision-language interaction model.
πŸ“Š Data 4M time-aligned interaction samples, with clear gains from scaling further.
βš™οΈ System Five pluggable services β€” inference, WebUI, ASR, TTS, background agent β€” running on standard vLLM infrastructure.

πŸ“ For the full architecture diagram and component details, see the Architecture Guide.

🧩 Capability

Once interactivity is trained into the model itself rather than bolted on by an external harness, a whole class of capabilities comes naturally β€” being present, acting at the right moment, sensing time, and remembering across a long stream.

JoyAI-VL-Interaction capability grid

Beyond the nine capabilities above, JoyAI-VL-Interaction can call a live game as it's played, guide you through a recipe step by step while you cook, or generate danmaku-style live comments over a stream on its own. Explore more video demos in the Capability section of the blog.

πŸ“Š Evaluation

We evaluate JoyAI-VL-Interaction in 58 real, event-driven visual interaction settings, judged pairwise by human raters for both response quality and timing.

JoyAI-VL-Interaction vs Doubao

Aspect JoyAI-VL-Interaction Tie Doubao
Monitoring and alerting 100.0% 0.0% 0.0%
Real-time counting 70.0% 30.0% 0.0%
Real-time translation 80.0% 20.0% 0.0%
Time awareness 80.0% 10.0% 10.0%
Live commentary and guidance 55.6% 22.2% 22.2%
Long visual memory 77.8% 22.2% 0.0%
Overall 77.6% 17.2% 5.2%

JoyAI-VL-Interaction vs Gemini

Aspect JoyAI-VL-Interaction Tie Gemini
Monitoring and alerting 100.0% 0.0% 0.0%
Real-time counting 100.0% 0.0% 0.0%
Real-time translation 100.0% 0.0% 0.0%
Time awareness 50.0% 40.0% 10.0%
Live commentary and guidance 100.0% 0.0% 0.0%
Long visual memory 77.8% 22.2% 0.0%
Overall 87.9% 10.3% 1.7%

🚧 Limitations and Future Work

Limitations. We want to be upfront about scale. The video-call assistants we compare against, Doubao and Gemini, are backed by far larger models and polished through years of product iteration against real users; they are comprehensive, broadly knowledgeable, and hard to beat on open-ended chat, personal style, and the long tail of everyday requests. JoyAI-VL-Interaction is a compact 8B model, and we don't claim to match them everywhere. What we have done is pry open a door: in the advantage zone of a vision-language interaction model, real-time presence, vision-triggered proactivity, and a sense of time across a stream, a far smaller open model already comes out ahead. That a compact, open model can do this against large, heavily optimized products is exactly why we're excited to put this work in front of the community.

What is next. And we think this is only the beginning. The interaction data we trained on is still small, yet even this much was enough for capabilities we never explicitly taught, like guiding a shopper through changing app screens, to emerge on their own; we're convinced the headroom is large, and that scaling this kind of time-aligned data, together with the recipe and the system, will take the model much further. The moment we are reaching for is an everyday one: you come home worn out after a long day, and before you have said a word, a quiet voice notices and offers, "I can see you're tired; today must have been hard on you." Presence like that, given unasked, is what an interaction model makes possible and a turn-based one, waiting to be addressed, never can. We have released the whole stack openly, the 8B model, the time-aligned data, the training recipe, and the deployable system, to lower the barrier for everyone working in this direction. We'd love for you to explore, with us, what a model that is truly present in the world can become.

πŸ“‚ Repository Layout

.
β”œβ”€β”€ services/
β”‚   β”œβ”€β”€ scripts/           # Service orchestration entrypoints (run/stop)
β”‚   β”œβ”€β”€ webinfer/          # Real-time video inference (OpenAI-compatible API)
β”‚   β”œβ”€β”€ webui/             # Browser frontend + WebRTC streaming
β”‚   β”œβ”€β”€ asr/               # Speech recognition adapter (Qwen3-ASR)
β”‚   β”œβ”€β”€ tts/               # Speech synthesis adapter (Qwen3-TTS)
β”‚   └── background-agent/  # Background task delegation agent
β”œβ”€β”€ install/               # Install scripts, dependency setup, model downloads
β”œβ”€β”€ doc/
β”‚   β”œβ”€β”€ architecture.md          # System architecture & data flow
β”‚   β”œβ”€β”€ getting_started.md       # Full deployment guide
β”‚   β”œβ”€β”€ rtsp_streaming.md        # Local RTSP stream testing guide
β”‚   └── *.zh-CN.md               # Chinese documentation mirrors
β”œβ”€β”€ img/                   # Diagrams and figures
β”œβ”€β”€ README.md
β”œβ”€β”€ README.zh-CN.md
β”œβ”€β”€ LICENSE
└── JoyAI-VL-Interaction-Reportv1.pdf

πŸ“‹ TODO

  • Release interaction model blog
  • Release deployable system code
  • Release technical report
  • Release time-aligned interaction training data
  • HuggingFace model & dataset pages

πŸ“ Citation

If JoyAI-VL-Interaction helps your research or products, please cite:

@techreport{joyai2026vlinteraction,
  title        = {JoyAI-VL-Interaction: Real-Time Vision-Language Interaction Intelligence},
  author       = {{Video Understanding Team of JoyAI-VL @ Joy Future Academy, JD}},
  institution  = {Joy Future Academy, JD},
  year         = {2026},
  month        = {June}
}

πŸ“„ License

This project is licensed under the Apache License 2.0. See LICENSE for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors