A production-ready LiveKit Python agent backend that provides real-time video obstacle detection and voice assistance for visually impaired users. The agent processes video frames for obstacle detection using Gemini Vision and provides bidirectional voice interaction using Voxstral STT, Gemini Live Realtime, and ElevenLabs TTS.
The agent uses LiveKit's native Live Video input with Gemini Live Realtime, enabling the model to see and understand video frames automatically. The architecture integrates:
- Live Video Input: Enabled via
RoomInputOptions(video_enabled=True)- Gemini Live automatically receives video frames- 1 frame/second while user speaks
- 1 frame every 3 seconds otherwise
- Frames are fit into 1024x1024 and encoded to JPEG
- Obstacle Detection: Custom
VisionAgenthooks into video frames for additional processing- Processes frames using Gemini Vision API
- Publishes results via DataChannel for frontend consumption
- Audio Pipeline: Voxstral STT → Gemini Live Realtime → MCP Tools → ElevenLabs TTS
Mobile (WebRTC) → LiveKit Room → LiveKit Agent (Python)
↓
┌───────────────┴───────────────┐
↓ ↓
Gemini Live (with Live Video) VisionAgent (obstacle detection)
↓ ↓
MCP Tools → TTS Gemini Vision → DataChannel
↓
WebRTC Audio Out
- Native Live Video Input: Uses LiveKit's built-in Live Video input with Gemini Live Realtime
- Automatic video frame sampling and processing
- Gemini Live can see and understand video in real-time
- Real-time Video Obstacle Detection: Processes video frames using Gemini Vision API
- Custom VisionAgent hooks for additional obstacle detection
- Results published via DataChannel for frontend
- Bidirectional Voice Interaction: Natural voice conversation using Gemini Live Realtime
- MCP Tool Integration: Extensible tool system via Model Context Protocol (MCP) servers
- Google Maps Navigation: Built-in Google Maps MCP server for finding places and getting directions
- WhatsApp Integration: Full WhatsApp API integration via LiveKit function tools
- Search contacts, read messages, send messages
- Manage chats and conversations
- Send files and audio messages
- Production-Ready: Error handling, logging, graceful shutdown, and Docker support
- LiveKit Native: All media processing uses LiveKit SDK (no custom WebRTC servers)
- Python 3.12+
- LiveKit Server (Cloud or self-hosted)
- API Keys:
- Google API Key (for Gemini Vision and Realtime API)
- Google Maps API Key (optional, for navigation features - see Google Maps Integration Guide)
- Mistral API Key (for Voxstral STT)
- ElevenLabs API Key (for TTS)
- LiveKit API Key and Secret
- Clone the repository:
git clone <repository-url>
cd ParisAIHackathon- Create a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txtOr using uv (recommended):
uv pip install -r requirements.txt- Configure environment variables:
cp .env.example .env
# Edit .env with your API keys and configurationSee .env.example for all available configuration options. Key variables:
LIVEKIT_URL: LiveKit server WebSocket URLLIVEKIT_API_KEY/LIVEKIT_API_SECRET: LiveKit credentialsGOOGLE_API_KEY: Google API key for GeminiMISTRAL_API_KEY: Mistral API key for Voxstral STTELEVENLABS_API_KEY: ElevenLabs API key for TTSMCP_SERVER_URLS: Comma-separated list of MCP server URLsWHATSAPP_API_URL: WhatsApp API base URL (default: https://whatsapp-mcp-794750095859.europe-west1.run.app)WHATSAPP_API_HEADERS: JSON object with headers for WhatsApp API requests (optional)
MCP servers are hosted separately (e.g., on Cloud Run). Configure them in .env:
MCP_SERVER_URLS=https://maps-mcp-server.com,https://airbnb-mcp-server.com
MCP_SERVER_HEADERS={"https://maps-mcp-server.com": {"Authorization": "Bearer token"}}The agent includes full WhatsApp API integration via MCP (Model Context Protocol) servers. The WhatsApp API is automatically added to the MCP servers list and tools are loaded using the recommended MCP integration pattern.
Configuration:
The WhatsApp API is configured via MCP server settings:
# MCP Server URLs (WhatsApp API included by default)
MCP_SERVER_URLS=https://whatsapp-mcp-794750095859.europe-west1.run.app
# Optional: Add headers if authentication is required
MCP_SERVER_HEADERS={"https://whatsapp-mcp-794750095859.europe-west1.run.app": {"Authorization": "Bearer your_token"}}
# WhatsApp API Configuration (for reference)
WHATSAPP_API_URL=https://whatsapp-mcp-794750095859.europe-west1.run.app
WHATSAPP_API_HEADERS={}The WhatsApp API URL is automatically included in MCP_SERVER_URLS by default. Tools are loaded automatically through LiveKit's MCP integration.
Available WhatsApp Tools:
search_contacts: Search WhatsApp contacts by name or phone numberlist_chats: List and browse WhatsApp chats with paginationget_chat: Get detailed information about a specific chatget_direct_chat_by_contact: Find direct chat with a contact by phone numberget_contact_chats: Get all chats involving a specific contactlist_messages: Search and filter messages with optional contextget_last_interaction: Get the most recent message with a contactget_message_context: Get context around a specific messagesend_message: Send text messages to contacts or groupssend_file: Send files (images, videos, documents) via WhatsAppsend_audio: Send audio files as WhatsApp voice messagesdownload_media: Download media from WhatsApp messages
Usage Examples:
Users can interact with WhatsApp through voice commands:
- "Search for contacts named John"
- "Show me my recent chats"
- "Read messages from Sarah"
- "Send a message to John saying I'll be there in 10 minutes"
- "What was the last message from Mom?"
The agent will automatically use the appropriate WhatsApp tools to fulfill these requests.
Development mode (with hot reload):
uv run agent.py devProduction mode:
uv run agent.py startAlternative: If you're running from the project root with PYTHONPATH set:
# Set PYTHONPATH (Linux/Mac)
export PYTHONPATH="${PYTHONPATH}:$(pwd)"
# Or on Windows PowerShell
$env:PYTHONPATH = "$(Get-Location);$env:PYTHONPATH"
# Then run
python -m app.agent.worker devThe FastAPI server provides token generation endpoints:
uvicorn app.main:app --host 0.0.0.0 --port 8000GET /health: Health check endpointPOST /api/token: Generate LiveKit access token{ "room_name": "my-room", "participant_identity": "user-123", "participant_name": "John Doe" }
Option 1: JavaScript Project (Recommended for Development)
A proper JavaScript project using LiveKit JS SDK with Vite:
-
Install dependencies:
npm install
-
Start development server:
npm run dev
-
Start backend services:
# Terminal 1: Agent worker python agent.py dev # Terminal 2: FastAPI server uvicorn app.main:app --host 0.0.0.0 --port 8000
-
Open
http://localhost:3000in your browser
The JavaScript project (src/index.js) provides:
- ✅ Proper LiveKit authentication protocol
- ✅ Modern ES6 modules with Vite
- ✅ Video and audio streaming
- ✅ Real-time track management
- ✅ Data channel support for obstacle detection
- ✅ Production-ready build setup
See README_JS.md for detailed documentation.
Option 2: Quick Test (Standalone HTML File):
- Start agent worker:
python agent.py dev - Start FastAPI server:
uvicorn app.main:app --host 0.0.0.0 --port 8000 - Open
test-frontend.htmlin your browser - Configure LiveKit URL and API URL (if needed)
- Click "Connect" and grant camera/microphone permissions
The test frontend (test-frontend.html) provides:
- ✅ One-click video streaming setup
- ✅ Real-time obstacle detection display
- ✅ Voice interaction with the agent
- ✅ Video/audio controls
- ✅ Beautiful, responsive UI
For more details, see the Frontend Testing Guide which includes:
- Complete HTML/JavaScript example
- React component example
- Step-by-step testing instructions
- Troubleshooting guide
ParisAIHackathon/
├── app/
│ ├── main.py # FastAPI application
│ ├── agent/
│ │ ├── worker.py # LiveKit agent entrypoint
│ │ ├── vision_agent.py # Custom Agent with video hooks for obstacle detection
│ │ ├── audio_pipeline.py # Audio processing with Live Video input
│ │ └── session.py # Session manager
│ ├── mcp/
│ │ ├── router.py # MCP tool router
│ │ └── schemas.py # MCP schemas
│ ├── services/
│ │ ├── gemini.py # Gemini Vision service
│ │ ├── voxstral.py # Voxstral STT wrapper
│ │ └── elevenlabs.py # ElevenLabs TTS wrapper
│ ├── core/
│ │ ├── config.py # Configuration management
│ │ └── security.py # JWT token generation
│ └── api/
│ └── token.py # Token API endpoints
├── docker/
│ └── Dockerfile # Production Docker image
├── src/ # JavaScript client source
│ └── index.js # LiveKit client implementation
├── index.html # JavaScript client HTML entry
├── package.json # JavaScript dependencies
├── vite.config.js # Vite build configuration
├── test-frontend.html # Standalone HTML test frontend
├── FRONTEND_TESTING.md # Frontend testing guide
├── README_JS.md # JavaScript client documentation
├── .env.example # Environment variable template
├── pyproject.toml # Project configuration
└── README.md # This file
docker build -f docker/Dockerfile -t livekit-agent-backend .docker run -d \
--env-file .env \
-p 8000:8000 \
livekit-agent-backendpytest tests/ruff check .
ruff format .mypy app/- Native Integration: Uses
RoomInputOptions(video_enabled=True)to enable LiveKit's built-in video input - Automatic Sampling: LiveKit automatically samples video frames:
- 1 frame/second while user speaks
- 1 frame every 3 seconds otherwise
- Gemini Live Processing: Video frames are automatically sent to Gemini Live Realtime
- Frames are fit into 1024x1024 and encoded to JPEG
- Gemini Live can see and understand the video context
- Custom Processing:
VisionAgenthooks into video frames for additional obstacle detection- Processes frames at configurable FPS (default: 2 FPS)
- Uses Gemini Vision API for detailed obstacle analysis
- Publishes results via DataChannel for frontend consumption
- Uses LiveKit's
AgentSessionwithVisionAgentfor integrated video/audio processing - MCP servers are configured in the
AgentSessionconstructor - Audio output is automatically published to the room
- Gemini Live receives both audio and video input simultaneously
- Errors in video pipeline don't affect audio pipeline
- External API calls include retry logic
- Graceful shutdown on SIGTERM/SIGINT
- Verify
LIVEKIT_URL,LIVEKIT_API_KEY, andLIVEKIT_API_SECRETare correct - Check LiveKit server is accessible
- Review agent logs for connection errors
- Ensure participant is publishing video track
- Check
GOOGLE_API_KEYis valid - Verify video processing FPS is not too high (causing rate limits)
- Verify MCP server URLs are correct and accessible
- Check MCP server headers include authentication if required
- Review MCP router logs for connection errors
- Verify all API keys (Mistral, Google, ElevenLabs) are valid
- Check agent logs for STT/TTS errors
- Ensure microphone permissions are granted on client
The agent includes a built-in Google Maps MCP server for navigation assistance. This allows users to:
- Find nearby places: "Find nearby restaurants", "Where is the nearest pharmacy?"
- Get directions: "How do I get to the Eiffel Tower?", "Give me walking directions to 123 Main Street"
- Geocode addresses: Convert addresses to coordinates and vice versa
-
Get a Google Maps API Key:
- Create a project in Google Cloud Console
- Enable Places API, Directions API, and Geocoding API
- Create an API key
-
Configure Environment Variables:
GOOGLE_MAPS_API_KEY=your_google_maps_api_key MCP_SERVER_URLS=http://localhost:8080
-
Run the Google Maps MCP Server (choose one):
Local Development:
python scripts/run_google_maps_mcp.py
Production (Cloud Run):
export GOOGLE_MAPS_API_KEY=your_api_key_here ./scripts/deploy-google-maps-mcp.sh -
Start Your Agent (the agent will automatically connect to the MCP server)
For detailed setup instructions, see the Google Maps Integration Guide and Cloud Run Deployment Guide.
- LiveKit Agents Documentation
- Gemini Live API
- MCP Protocol
- Google Maps Integration Guide - Complete guide for setting up Google Maps navigation
- Worker Entrypoint Best Practices
- Frontend Testing Guide - Complete guide for testing video streaming from frontend
Apache-2.0
Contributions are welcome! Please ensure all code follows the project's style guidelines and includes appropriate tests.