LiveKit Agent Backend - Visually-Impaired Video Voice Assistant

A production-ready LiveKit Python agent backend that provides real-time video obstacle detection and voice assistance for visually impaired users. The agent processes video frames for obstacle detection using Gemini Vision and provides bidirectional voice interaction using Voxstral STT, Gemini Live Realtime, and ElevenLabs TTS.

Architecture Overview

The agent uses LiveKit's native Live Video input with Gemini Live Realtime, enabling the model to see and understand video frames automatically. The architecture integrates:

Live Video Input: Enabled via RoomInputOptions(video_enabled=True) - Gemini Live automatically receives video frames
- 1 frame/second while user speaks
- 1 frame every 3 seconds otherwise
- Frames are fit into 1024x1024 and encoded to JPEG
Obstacle Detection: Custom VisionAgent hooks into video frames for additional processing
- Processes frames using Gemini Vision API
- Publishes results via DataChannel for frontend consumption
Audio Pipeline: Voxstral STT → Gemini Live Realtime → MCP Tools → ElevenLabs TTS

Mobile (WebRTC) → LiveKit Room → LiveKit Agent (Python)
                                    ↓
                    ┌───────────────┴───────────────┐
                    ↓                               ↓
        Gemini Live (with Live Video)      VisionAgent (obstacle detection)
                    ↓                               ↓
            MCP Tools → TTS              Gemini Vision → DataChannel
                    ↓
            WebRTC Audio Out

Features

Native Live Video Input: Uses LiveKit's built-in Live Video input with Gemini Live Realtime
- Automatic video frame sampling and processing
- Gemini Live can see and understand video in real-time
Real-time Video Obstacle Detection: Processes video frames using Gemini Vision API
- Custom VisionAgent hooks for additional obstacle detection
- Results published via DataChannel for frontend
Bidirectional Voice Interaction: Natural voice conversation using Gemini Live Realtime
MCP Tool Integration: Extensible tool system via Model Context Protocol (MCP) servers
Google Maps Navigation: Built-in Google Maps MCP server for finding places and getting directions
WhatsApp Integration: Full WhatsApp API integration via LiveKit function tools
- Search contacts, read messages, send messages
- Manage chats and conversations
- Send files and audio messages
Production-Ready: Error handling, logging, graceful shutdown, and Docker support
LiveKit Native: All media processing uses LiveKit SDK (no custom WebRTC servers)

Prerequisites

Python 3.12+
LiveKit Server (Cloud or self-hosted)
API Keys:
- Google API Key (for Gemini Vision and Realtime API)
- Google Maps API Key (optional, for navigation features - see Google Maps Integration Guide)
- Mistral API Key (for Voxstral STT)
- ElevenLabs API Key (for TTS)
- LiveKit API Key and Secret

Installation

Clone the repository:

git clone <repository-url>
cd ParisAIHackathon

Create a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Or using uv (recommended):

uv pip install -r requirements.txt

Configure environment variables:

cp .env.example .env
# Edit .env with your API keys and configuration

Configuration

Environment Variables

See .env.example for all available configuration options. Key variables:

LIVEKIT_URL: LiveKit server WebSocket URL
LIVEKIT_API_KEY / LIVEKIT_API_SECRET: LiveKit credentials
GOOGLE_API_KEY: Google API key for Gemini
MISTRAL_API_KEY: Mistral API key for Voxstral STT
ELEVENLABS_API_KEY: ElevenLabs API key for TTS
MCP_SERVER_URLS: Comma-separated list of MCP server URLs
WHATSAPP_API_URL: WhatsApp API base URL (default: https://whatsapp-mcp-794750095859.europe-west1.run.app)
WHATSAPP_API_HEADERS: JSON object with headers for WhatsApp API requests (optional)

MCP Server Configuration

MCP servers are hosted separately (e.g., on Cloud Run). Configure them in .env:

MCP_SERVER_URLS=https://maps-mcp-server.com,https://airbnb-mcp-server.com
MCP_SERVER_HEADERS={"https://maps-mcp-server.com": {"Authorization": "Bearer token"}}

WhatsApp Integration

The agent includes full WhatsApp API integration via MCP (Model Context Protocol) servers. The WhatsApp API is automatically added to the MCP servers list and tools are loaded using the recommended MCP integration pattern.

Configuration:

The WhatsApp API is configured via MCP server settings:

# MCP Server URLs (WhatsApp API included by default)
MCP_SERVER_URLS=https://whatsapp-mcp-794750095859.europe-west1.run.app

# Optional: Add headers if authentication is required
MCP_SERVER_HEADERS={"https://whatsapp-mcp-794750095859.europe-west1.run.app": {"Authorization": "Bearer your_token"}}

# WhatsApp API Configuration (for reference)
WHATSAPP_API_URL=https://whatsapp-mcp-794750095859.europe-west1.run.app
WHATSAPP_API_HEADERS={}

The WhatsApp API URL is automatically included in MCP_SERVER_URLS by default. Tools are loaded automatically through LiveKit's MCP integration.

Available WhatsApp Tools:

search_contacts: Search WhatsApp contacts by name or phone number
list_chats: List and browse WhatsApp chats with pagination
get_chat: Get detailed information about a specific chat
get_direct_chat_by_contact: Find direct chat with a contact by phone number
get_contact_chats: Get all chats involving a specific contact
list_messages: Search and filter messages with optional context
get_last_interaction: Get the most recent message with a contact
get_message_context: Get context around a specific message
send_message: Send text messages to contacts or groups
send_file: Send files (images, videos, documents) via WhatsApp
send_audio: Send audio files as WhatsApp voice messages
download_media: Download media from WhatsApp messages

Usage Examples:

Users can interact with WhatsApp through voice commands:

"Search for contacts named John"
"Show me my recent chats"
"Read messages from Sarah"
"Send a message to John saying I'll be there in 10 minutes"
"What was the last message from Mom?"

The agent will automatically use the appropriate WhatsApp tools to fulfill these requests.

Usage

Running the Agent Worker

Development mode (with hot reload):

uv run agent.py dev

Production mode:

uv run agent.py start

Alternative: If you're running from the project root with PYTHONPATH set:

# Set PYTHONPATH (Linux/Mac)
export PYTHONPATH="${PYTHONPATH}:$(pwd)"

# Or on Windows PowerShell
$env:PYTHONPATH = "$(Get-Location);$env:PYTHONPATH"

# Then run
python -m app.agent.worker dev

Running the FastAPI Server (Optional)

The FastAPI server provides token generation endpoints:

uvicorn app.main:app --host 0.0.0.0 --port 8000

API Endpoints

GET /health: Health check endpoint

POST /api/token: Generate LiveKit access token

{
  "room_name": "my-room",
  "participant_identity": "user-123",
  "participant_name": "John Doe"
}

Testing with Frontend

Option 1: JavaScript Project (Recommended for Development)

A proper JavaScript project using LiveKit JS SDK with Vite:

Install dependencies:
```
npm install
```
Start development server:
```
npm run dev
```

Start backend services:

# Terminal 1: Agent worker
python agent.py dev

# Terminal 2: FastAPI server
uvicorn app.main:app --host 0.0.0.0 --port 8000

Open http://localhost:3000 in your browser

The JavaScript project (src/index.js) provides:

✅ Proper LiveKit authentication protocol
✅ Modern ES6 modules with Vite
✅ Video and audio streaming
✅ Real-time track management
✅ Data channel support for obstacle detection
✅ Production-ready build setup

See README_JS.md for detailed documentation.

Option 2: Quick Test (Standalone HTML File):

Start agent worker: python agent.py dev
Start FastAPI server: uvicorn app.main:app --host 0.0.0.0 --port 8000
Open test-frontend.html in your browser
Configure LiveKit URL and API URL (if needed)
Click "Connect" and grant camera/microphone permissions

The test frontend (test-frontend.html) provides:

✅ One-click video streaming setup
✅ Real-time obstacle detection display
✅ Voice interaction with the agent
✅ Video/audio controls
✅ Beautiful, responsive UI

For more details, see the Frontend Testing Guide which includes:

Complete HTML/JavaScript example
React component example
Step-by-step testing instructions
Troubleshooting guide

Project Structure

ParisAIHackathon/
├── app/
│   ├── main.py                 # FastAPI application
│   ├── agent/
│   │   ├── worker.py           # LiveKit agent entrypoint
│   │   ├── vision_agent.py     # Custom Agent with video hooks for obstacle detection
│   │   ├── audio_pipeline.py   # Audio processing with Live Video input
│   │   └── session.py          # Session manager
│   ├── mcp/
│   │   ├── router.py           # MCP tool router
│   │   └── schemas.py          # MCP schemas
│   ├── services/
│   │   ├── gemini.py           # Gemini Vision service
│   │   ├── voxstral.py         # Voxstral STT wrapper
│   │   └── elevenlabs.py       # ElevenLabs TTS wrapper
│   ├── core/
│   │   ├── config.py           # Configuration management
│   │   └── security.py         # JWT token generation
│   └── api/
│       └── token.py            # Token API endpoints
├── docker/
│   └── Dockerfile              # Production Docker image
├── src/                         # JavaScript client source
│   └── index.js                # LiveKit client implementation
├── index.html                   # JavaScript client HTML entry
├── package.json                 # JavaScript dependencies
├── vite.config.js              # Vite build configuration
├── test-frontend.html           # Standalone HTML test frontend
├── FRONTEND_TESTING.md          # Frontend testing guide
├── README_JS.md                 # JavaScript client documentation
├── .env.example                 # Environment variable template
├── pyproject.toml               # Project configuration
└── README.md                    # This file

Docker Deployment

Build the Docker image:

docker build -f docker/Dockerfile -t livekit-agent-backend .

Run the container:

docker run -d \
  --env-file .env \
  -p 8000:8000 \
  livekit-agent-backend

Development

Running Tests

pytest tests/

Code Formatting

ruff check .
ruff format .

Type Checking

mypy app/

Key Implementation Details

Live Video Input

Native Integration: Uses RoomInputOptions(video_enabled=True) to enable LiveKit's built-in video input
Automatic Sampling: LiveKit automatically samples video frames:
- 1 frame/second while user speaks
- 1 frame every 3 seconds otherwise
Gemini Live Processing: Video frames are automatically sent to Gemini Live Realtime
- Frames are fit into 1024x1024 and encoded to JPEG
- Gemini Live can see and understand the video context
Custom Processing: VisionAgent hooks into video frames for additional obstacle detection
- Processes frames at configurable FPS (default: 2 FPS)
- Uses Gemini Vision API for detailed obstacle analysis
- Publishes results via DataChannel for frontend consumption

Audio Pipeline

Uses LiveKit's AgentSession with VisionAgent for integrated video/audio processing
MCP servers are configured in the AgentSession constructor
Audio output is automatically published to the room
Gemini Live receives both audio and video input simultaneously

Error Handling

Errors in video pipeline don't affect audio pipeline
External API calls include retry logic
Graceful shutdown on SIGTERM/SIGINT

Troubleshooting

Agent Not Connecting

Verify LIVEKIT_URL, LIVEKIT_API_KEY, and LIVEKIT_API_SECRET are correct
Check LiveKit server is accessible
Review agent logs for connection errors

Video Processing Not Working

Ensure participant is publishing video track
Check GOOGLE_API_KEY is valid
Verify video processing FPS is not too high (causing rate limits)

MCP Tools Not Available

Verify MCP server URLs are correct and accessible
Check MCP server headers include authentication if required
Review MCP router logs for connection errors

Audio Issues

Verify all API keys (Mistral, Google, ElevenLabs) are valid
Check agent logs for STT/TTS errors
Ensure microphone permissions are granted on client

Google Maps Integration

The agent includes a built-in Google Maps MCP server for navigation assistance. This allows users to:

Find nearby places: "Find nearby restaurants", "Where is the nearest pharmacy?"
Get directions: "How do I get to the Eiffel Tower?", "Give me walking directions to 123 Main Street"
Geocode addresses: Convert addresses to coordinates and vice versa

Quick Setup

Get a Google Maps API Key:
- Create a project in Google Cloud Console
- Enable Places API, Directions API, and Geocoding API
- Create an API key

Configure Environment Variables:

GOOGLE_MAPS_API_KEY=your_google_maps_api_key
MCP_SERVER_URLS=http://localhost:8080

Run the Google Maps MCP Server (choose one):

Local Development:

python scripts/run_google_maps_mcp.py

Production (Cloud Run):

export GOOGLE_MAPS_API_KEY=your_api_key_here
./scripts/deploy-google-maps-mcp.sh

Start Your Agent (the agent will automatically connect to the MCP server)

For detailed setup instructions, see the Google Maps Integration Guide and Cloud Run Deployment Guide.

Additional Resources

LiveKit Agents Documentation
Gemini Live API
MCP Protocol
Google Maps Integration Guide - Complete guide for setting up Google Maps navigation
Worker Entrypoint Best Practices
Frontend Testing Guide - Complete guide for testing video streaming from frontend

License

Apache-2.0

Contributing

Contributions are welcome! Please ensure all code follows the project's style guidelines and includes appropriate tests.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Google-Maps		Google-Maps
app		app
docker		docker
docs		docs
scripts		scripts
src		src
tests		tests
.env		.env
.env.example		.env.example
.gitignore		.gitignore
ENV_SETUP.md		ENV_SETUP.md
FRONTEND_TESTING.md		FRONTEND_TESTING.md
QUICKSTART_JS.md		QUICKSTART_JS.md
README.md		README.md
README_JS.md		README_JS.md
agent.py		agent.py
index.html		index.html
package.json		package.json
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
test-frontend.html		test-frontend.html
vite.config.js		vite.config.js

Folders and files

Latest commit

History

Repository files navigation

LiveKit Agent Backend - Visually-Impaired Video Voice Assistant

Architecture Overview

Features

Prerequisites

Installation

Configuration

Environment Variables

MCP Server Configuration

WhatsApp Integration

Usage

Running the Agent Worker

Running the FastAPI Server (Optional)

API Endpoints

Testing with Frontend

Project Structure

Docker Deployment

Build the Docker image:

Run the container:

Development

Running Tests

Code Formatting

Type Checking

Key Implementation Details

Live Video Input

Audio Pipeline

Error Handling

Troubleshooting

Agent Not Connecting

Video Processing Not Working

MCP Tools Not Available

Audio Issues

Google Maps Integration

Quick Setup

Additional Resources

License

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages