-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Open
Labels
Description
Summary
Implement WebSocket Mode support for LocalAI's OpenAI API-compatible Responses endpoint. This enables persistent WebSocket connections for long-running, tool-call-heavy agentic workflows.
Background
OpenAI recently introduced WebSocket Mode for their Responses API (https://developers.openai.com/api/docs/guides/websocket-mode). This mode enables:
- Up to 40% faster end-to-end execution for workflows with 20+ tool calls
- Persistent connections to
/v1/responsesvia WebSocket - Incremental continuation - only send new inputs plus
previous_response_id - Connection-local caching for low-latency continuations
- Compatibility with Zero Data Retention (ZDR) and
store=false
Technical Specifications
Connection
- Endpoint:
wss://api.openai.com/v1/responses(LocalAI:ws://<host>:<port>/v1/responses) - Authentication: Bearer token in header
- Max duration: 60 minutes per connection
Message Types
1. response.create (Initial Turn)
{
"type": "response.create",
"model": "gpt-4o",
"store": false,
"input": [
{
"type": "message",
"role": "user",
"content": [{"type": "input_text", "text": "..."}]
}
],
"tools": []
}Note: stream and background fields are NOT used in WebSocket mode.
2. response.create with Warmup (Optional)
{
"type": "response.create",
"model": "gpt-4o",
"generate": false,
"input": [...],
"tools": [...]
}Returns a response_id that can be chained.
3. response.create with Continuation (Subsequent Turns)
{
"type": "response.create",
"model": "gpt-4o",
"store": false,
"previous_response_id": "resp_123",
"input": [
{
"type": "function_call_output",
"call_id": "call_123",
"output": "tool result"
},
{
"type": "message",
"role": "user",
"content": [{"type": "input_text", "text": "..."}]
}
],
"tools": []
}Response Events (Server → Client)
- response.created
{
"type": "response.created",
"response": {"id": "resp_abc", "model": "...", "..."}
}- response.progress
{
"type": "response.progress",
"response_id": "resp_abc",
"output": [...]
}- response.function_call_arguments.delta
{
"type": "response.function_call_arguments.delta",
"response_id": "resp_abc",
"call_id": "call_123",
"delta": "..."
}- response.function_call_arguments.done
{
"type": "response.function_call_arguments.done",
"response_id": "resp_abc",
"call_id": "call_123",
"arguments": "..."
}- response.done
{
"type": "response.done",
"response": {...}
}Error Handling
previous_response_not_found(400): When continuing withstore=falseand response not in cachewebsocket_connection_limit_reached(400): When 60-minute limit reached
Implementation Requirements
- WebSocket Server: Add WebSocket endpoint for
/v1/responses - Connection Management: Track active connections with 60-min timeout
- State Caching: Implement connection-local in-memory cache for responses
- Message Parsing: Handle all message types (response.create, etc.)
- Event Streaming: Send proper response events back to client
- Error Handling: Proper error responses for invalid states
- Compaction Support: Handle
/responses/compactintegration
Limits
- One in-flight response at a time per connection
- No multiplexing - multiple connections needed for parallel runs
Acceptance Criteria
- WebSocket endpoint accepts connections at
/v1/responses -
response.createworks for initial turn with full context -
response.createwithprevious_response_idworks for continuations - All response events are properly streamed to client
- Function call arguments are properly delta-streamed
- Error handling for
previous_response_not_foundimplemented - Connection timeout (60 min) is enforced
- Tool calling works correctly over WebSocket
References
Reactions are currently unavailable