From 5f04ca537fa334127f0afd79f1a411f2a6bd2761 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Nathan=20=F0=9F=94=B6=20Tarbert?=
 <66887028+NathanTarbert@users.noreply.github.com>
Date: Tue, 31 Mar 2026 21:06:19 -0400
Subject: [PATCH] docs: update README with WebSocket support and expanded API
 reference

Add video demo, document WebSocket endpoints (OpenAI Responses, Realtime,
Gemini Live), expand fixture matching/response reference, and add error
injection and request journal sections.
---
 README.md | 207 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 204 insertions(+), 3 deletions(-)

diff --git a/README.md b/README.md
index 2b3448b..231f3d9 100644
--- a/README.md
+++ b/README.md
@@ -1,8 +1,12 @@
 # @copilotkit/llmock [![Unit Tests](https://github.com/CopilotKit/llmock/actions/workflows/test-unit.yml/badge.svg)](https://github.com/CopilotKit/llmock/actions/workflows/test-unit.yml) [![Drift Tests](https://github.com/CopilotKit/llmock/actions/workflows/test-drift.yml/badge.svg)](https://github.com/CopilotKit/llmock/actions/workflows/test-drift.yml) [![npm version](https://img.shields.io/npm/v/@copilotkit/llmock)](https://www.npmjs.com/package/@copilotkit/llmock)
 
-Deterministic mock LLM server for testing. A real HTTP server on a real port — not an in-process interceptor — so every process in your stack (Playwright, Next.js, agent workers, microservices) can point at it via `OPENAI_BASE_URL` / `ANTHROPIC_BASE_URL` and get reproducible, instant responses. Streams SSE in real OpenAI, Claude, Gemini, Bedrock, Azure, Vertex AI, Ollama, and Cohere API formats, driven entirely by fixtures. Zero runtime dependencies.
+https://github.com/user-attachments/assets/1aa9f81d-7efb-4bd2-8e81-51f466f8a8e3
 
-## Quick Start
+Deterministic multi-provider mock LLM server for testing. Streams SSE and WebSocket responses in real OpenAI, Claude, and Gemini API formats, driven entirely by fixtures. Zero runtime dependencies — built on Node.js builtins only.
+
+Supports streaming (SSE), non-streaming JSON, and WebSocket responses across OpenAI (Chat Completions + Responses + Realtime), Anthropic Claude (Messages), and Google Gemini (GenerateContent + Live) APIs. Text completions, tool calls, and error injection. Point any process at it via `OPENAI_BASE_URL`, `ANTHROPIC_BASE_URL`, or Gemini base URL and get reproducible, instant responses.
+
+## Install
 
 ```bash
 npm install @copilotkit/llmock
@@ -42,7 +46,204 @@ await mock.stop();
 - **[Drift detection](https://llmock.copilotkit.dev/drift-detection.html)** — Daily CI runs against real APIs to catch response format changes
 - **Claude Code integration** — `/write-fixtures` skill teaches your AI assistant how to write fixtures correctly
 
-## CLI Quick Reference
+### Error Injection
+
+#### `nextRequestError(status, errorBody?)`
+
+Queue a one-shot error for the very next request. The error fires once, then auto-removes itself.
+
+```typescript
+mock.nextRequestError(429, {
+  message: "Rate limited",
+  type: "rate_limit_error",
+});
+
+// Next request → 429 error
+// Subsequent requests → normal fixture matching
+```
+
+### Request Journal
+
+Every request to all API endpoints (`/v1/chat/completions`, `/v1/responses`, `/v1/messages`, Gemini endpoints, and all WebSocket endpoints) is recorded in a journal.
+
+#### Programmatic Access
+
+| Method             | Returns                | Description                           |
+| ------------------ | ---------------------- | ------------------------------------- |
+| `getRequests()`    | `JournalEntry[]`       | All recorded requests                 |
+| `getLastRequest()` | `JournalEntry \| null` | Most recent request                   |
+| `clearRequests()`  | `void`                 | Clear the journal                     |
+| `journal`          | `Journal`              | Direct access to the journal instance |
+
+```typescript
+await fetch(mock.url + "/v1/chat/completions", { ... });
+
+const last = mock.getLastRequest();
+expect(last?.body.messages).toContainEqual({
+  role: "user",
+  content: "hello",
+});
+```
+
+#### HTTP Endpoints
+
+The server also exposes journal data over HTTP (useful in CLI mode):
+
+- `GET /v1/_requests` — returns all journal entries as JSON. Supports `?limit=N`.
+- `DELETE /v1/_requests` — clears the journal. Returns 204.
+
+### Reset
+
+#### `reset()`
+
+Clear all fixtures **and** the journal in one call. Works before or after the server is started.
+
+```typescript
+afterEach(() => {
+  mock.reset();
+});
+```
+
+## Fixture Matching
+
+Fixtures are evaluated in registration order (first match wins). A fixture matches when **all** specified fields match the incoming request (AND logic).
+
+| Field         | Type               | Matches on                                    |
+| ------------- | ------------------ | --------------------------------------------- |
+| `userMessage` | `string \| RegExp` | Content of the last `role: "user"` message    |
+| `toolName`    | `string`           | Name of a tool in the request's `tools` array |
+| `toolCallId`  | `string`           | `tool_call_id` on a `role: "tool"` message    |
+| `model`       | `string \| RegExp` | The `model` field in the request              |
+| `predicate`   | `(req) => boolean` | Arbitrary matching function                   |
+
+## Fixture Responses
+
+### Text
+
+```typescript
+{
+  content: "Hello world";
+}
+```
+
+Streams as SSE chunks, splitting `content` by `chunkSize`. With `stream: false`, returns a standard `chat.completion` JSON object.
+
+### Tool Calls
+
+```typescript
+{
+  toolCalls: [{ name: "get_weather", arguments: '{"location":"SF"}' }];
+}
+```
+
+### Errors
+
+```typescript
+{
+  error: { message: "Rate limited", type: "rate_limit_error" },
+  status: 429
+}
+```
+
+## API Endpoints
+
+The server handles:
+
+- **POST `/v1/chat/completions`** — OpenAI Chat Completions API (streaming and non-streaming)
+- **POST `/v1/responses`** — OpenAI Responses API (streaming and non-streaming)
+- **POST `/v1/messages`** — Anthropic Claude Messages API (streaming and non-streaming)
+- **POST `/v1beta/models/{model}:generateContent`** — Google Gemini (non-streaming)
+- **POST `/v1beta/models/{model}:streamGenerateContent`** — Google Gemini (streaming)
+
+WebSocket endpoints:
+
+- **WS `/v1/responses`** — OpenAI Responses API over WebSocket
+- **WS `/v1/realtime`** — OpenAI Realtime API (text + tool calls)
+- **WS `/ws/google.ai.generativelanguage.v1beta.GenerativeService.BidiGenerateContent`** — Gemini Live
+
+All endpoints share the same fixture pool — the same fixtures work across all providers. Requests are translated to a common format internally for fixture matching.
+
+## WebSocket APIs
+
+The same fixtures that drive HTTP responses also work over WebSocket transport. llmock implements RFC 6455 WebSocket framing with zero external dependencies — connect, send events, and receive streaming responses in real provider formats.
+
+Only text and tool call paths are supported over WebSocket. Audio, video, and binary frames are not implemented.
+
+### OpenAI Responses API (WebSocket)
+
+Connect to `ws://localhost:5555/v1/responses` and send a `response.create` event. The server streams back the same events as OpenAI's real WebSocket Responses API:
+
+```jsonc
+// → Client sends:
+{
+  "type": "response.create",
+  "response": {
+    "modalities": ["text"],
+    "instructions": "You are a helpful assistant.",
+    "input": [
+      { "type": "message", "role": "user", "content": [{ "type": "input_text", "text": "Hello" }] },
+    ],
+  },
+}
+
+// ← Server streams:
+// {"type": "response.created", ...}
+// {"type": "response.output_item.added", ...}
+// {"type": "response.content_part.added", ...}
+// {"type": "response.output_item.done", ...}
+// {"type": "response.done", ...}
+```
+
+### OpenAI Realtime API
+
+Connect to `ws://localhost:5555/v1/realtime`. The Realtime API uses a session-based protocol — configure the session, add conversation items, then request a response:
+
+```jsonc
+// → Configure session:
+{ "type": "session.update", "session": { "modalities": ["text"], "model": "gpt-4o-realtime" } }
+
+// → Add a user message:
+{
+  "type": "conversation.item.create",
+  "item": {
+    "type": "message",
+    "role": "user",
+    "content": [{ "type": "input_text", "text": "What is the capital of France?" }]
+  }
+}
+
+// → Request a response:
+{ "type": "response.create" }
+
+// ← Server streams:
+// {"type": "response.created", ...}
+// {"type": "response.text.delta", "delta": "The"}
+// {"type": "response.text.delta", "delta": " capital"}
+// ...
+// {"type": "response.text.done", ...}
+// {"type": "response.done", ...}
+```
+
+### Gemini Live (BidiGenerateContent)
+
+Connect to `ws://localhost:5555/ws/google.ai.generativelanguage.v1beta.GenerativeService.BidiGenerateContent`. Gemini Live uses a setup/content/response flow:
+
+```jsonc
+// → Setup message (must be first):
+{ "setup": { "model": "models/gemini-2.0-flash-live", "generationConfig": { "responseModalities": ["TEXT"] } } }
+
+// → Send user content:
+{ "clientContent": { "turns": [{ "role": "user", "parts": [{ "text": "Hello" }] }], "turnComplete": true } }
+
+// ← Server streams:
+// {"setupComplete": {}}
+// {"serverContent": {"modelTurnComplete": false, "parts": [{"text": "Hello"}]}}
+// {"serverContent": {"modelTurnComplete": true}}
+```
+
+## CLI
+
+The package includes a standalone server binary:
 
 ```bash
 llmock [options]