Implement WebSocket Mode support for OpenAI Responses API

## Summary

Implement WebSocket Mode support for LocalAI's OpenAI API-compatible Responses endpoint. This enables persistent WebSocket connections for long-running, tool-call-heavy agentic workflows.

## Background

OpenAI recently introduced WebSocket Mode for their Responses API (https://developers.openai.com/api/docs/guides/websocket-mode). This mode enables:

- **Up to 40% faster** end-to-end execution for workflows with 20+ tool calls
- **Persistent connections** to `/v1/responses` via WebSocket
- **Incremental continuation** - only send new inputs plus `previous_response_id`
- **Connection-local caching** for low-latency continuations
- **Compatibility** with Zero Data Retention (ZDR) and `store=false`

## Technical Specifications

### Connection
- **Endpoint**: `wss://api.openai.com/v1/responses` (LocalAI: `ws://<host>:<port>/v1/responses`)
- **Authentication**: Bearer token in header
- **Max duration**: 60 minutes per connection

### Message Types

#### 1. response.create (Initial Turn)
```json
{
  "type": "response.create",
  "model": "gpt-4o",
  "store": false,
  "input": [
    {
      "type": "message",
      "role": "user",
      "content": [{"type": "input_text", "text": "..."}]
    }
  ],
  "tools": []
}
```

Note: `stream` and `background` fields are NOT used in WebSocket mode.

#### 2. response.create with Warmup (Optional)
```json
{
  "type": "response.create",
  "model": "gpt-4o",
  "generate": false,
  "input": [...],
  "tools": [...]
}
```
Returns a `response_id` that can be chained.

#### 3. response.create with Continuation (Subsequent Turns)
```json
{
  "type": "response.create",
  "model": "gpt-4o",
  "store": false,
  "previous_response_id": "resp_123",
  "input": [
    {
      "type": "function_call_output",
      "call_id": "call_123",
      "output": "tool result"
    },
    {
      "type": "message",
      "role": "user",
      "content": [{"type": "input_text", "text": "..."}]
    }
  ],
  "tools": []
}
```

### Response Events (Server → Client)

1. **response.created**
```json
{
  "type": "response.created",
  "response": {"id": "resp_abc", "model": "...", "..."}
}
```

2. **response.progress**
```json
{
  "type": "response.progress",
  "response_id": "resp_abc",
  "output": [...]
}
```

3. **response.function_call_arguments.delta**
```json
{
  "type": "response.function_call_arguments.delta",
  "response_id": "resp_abc",
  "call_id": "call_123",
  "delta": "..."
}
```

4. **response.function_call_arguments.done**
```json
{
  "type": "response.function_call_arguments.done",
  "response_id": "resp_abc",
  "call_id": "call_123",
  "arguments": "..."
}
```

5. **response.done**
```json
{
  "type": "response.done",
  "response": {...}
}
```

### Error Handling

- `previous_response_not_found` (400): When continuing with `store=false` and response not in cache
- `websocket_connection_limit_reached` (400): When 60-minute limit reached

### Implementation Requirements

1. **WebSocket Server**: Add WebSocket endpoint for `/v1/responses`
2. **Connection Management**: Track active connections with 60-min timeout
3. **State Caching**: Implement connection-local in-memory cache for responses
4. **Message Parsing**: Handle all message types (response.create, etc.)
5. **Event Streaming**: Send proper response events back to client
6. **Error Handling**: Proper error responses for invalid states
7. **Compaction Support**: Handle `/responses/compact` integration

### Limits
- **One in-flight response** at a time per connection
- **No multiplexing** - multiple connections needed for parallel runs

## Acceptance Criteria

1. [ ] WebSocket endpoint accepts connections at `/v1/responses`
2. [ ] `response.create` works for initial turn with full context
3. [ ] `response.create` with `previous_response_id` works for continuations
4. [ ] All response events are properly streamed to client
5. [ ] Function call arguments are properly delta-streamed
6. [ ] Error handling for `previous_response_not_found` implemented
7. [ ] Connection timeout (60 min) is enforced
8. [ ] Tool calling works correctly over WebSocket

## References
- https://developers.openai.com/api/docs/guides/websocket-mode
- https://developers.openai.com/api/docs/api-reference/responses-streaming


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement WebSocket Mode support for OpenAI Responses API #8644

Summary

Background

Technical Specifications

Connection

Message Types

1. response.create (Initial Turn)

2. response.create with Warmup (Optional)

3. response.create with Continuation (Subsequent Turns)

Response Events (Server → Client)

Error Handling

Implementation Requirements

Limits

Acceptance Criteria

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Implement WebSocket Mode support for OpenAI Responses API #8644

Description

Summary

Background

Technical Specifications

Connection

Message Types

1. response.create (Initial Turn)

2. response.create with Warmup (Optional)

3. response.create with Continuation (Subsequent Turns)

Response Events (Server → Client)

Error Handling

Implementation Requirements

Limits

Acceptance Criteria

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions