diff --git a/.mintlify/skills/fish-audio-api/SKILL.md b/.mintlify/skills/fish-audio-api/SKILL.md index 68e21c3..d62c05f 100644 --- a/.mintlify/skills/fish-audio-api/SKILL.md +++ b/.mintlify/skills/fish-audio-api/SKILL.md @@ -1,6 +1,6 @@ --- name: fish-audio-api -description: Write direct HTTP / WebSocket calls to the Fish Audio platform (TTS, ASR, voice models, wallet, real-time TTS streaming) without depending on the Python or JavaScript SDK. Use when the user asks to call Fish Audio from curl, a language without an official SDK, an edge/runtime environment that cannot install the SDK, or when they explicitly want raw REST / WebSocket code. Covers authentication, endpoint URLs, required headers, request / response schemas, MessagePack vs JSON vs multipart encoding rules, multi-speaker dialogue, and the WebSocket streaming protocol. +description: Write direct HTTP / WebSocket calls to the Fish Audio platform (TTS, ASR, voice design, voice models, wallet, real-time TTS streaming) without depending on the Python or JavaScript SDK. Use when the user asks to call Fish Audio from curl, a language without an official SDK, an edge/runtime environment that cannot install the SDK, or when they explicitly want raw REST / WebSocket code. Covers authentication, endpoint URLs, required headers, request / response schemas, MessagePack vs JSON vs multipart encoding rules, multi-speaker dialogue, voice-design candidate generation, and the WebSocket streaming protocol. --- # Fish Audio Raw API Skill @@ -27,6 +27,7 @@ This file condenses those into rules an agent can apply directly. | --- | --- | --- | | POST | `/v1/tts` | Text-to-Speech (streams audio bytes) | | POST | `/v1/asr` | Speech-to-Text (returns JSON transcript) | +| POST | `/v1/voice-design` | Voice Design (returns generated voice candidates) | | GET | `/model` | List voice models | | POST | `/model` | Create voice model (voice cloning) | | GET | `/model/{id}` | Get voice model metadata | @@ -239,6 +240,77 @@ r.raise_for_status() print(r.json()["text"]) ``` +## Voice Design — `POST /v1/voice-design` + +Required headers: + +- `Authorization: Bearer ` +- `Content-Type: application/json` +- `model: voice-design-1` (required; currently the only public Voice Design model) + +Response: JSON `{ candidates: VoiceDesignCandidate[] }`. Each candidate includes `audio_base64`; decode it to write the generated audio bytes to a file. The current candidate audio payload is WAV bytes encoded as base64. + +### Request body fields (VoiceDesignRequest) + +| Field | Type | Default | Notes | +| ------------------------- | -------------- | ------------ | --------------------------------------------------------------------------- | +| `instruction` | string | — (required) | Voice design prompt. 1 to 2000 characters. | +| `reference_text` | string \| null | null | Optional preview text to read in the generated voice. Up to 300 characters. | +| `language` | string \| null | null | Optional language hint such as `en`, `zh`, or `ja`. | +| `n` | int | 2 | Number of candidates. Range: 1 to 4. | +| `speed` | number | 1.0 | Speaking speed multiplier. Must be greater than 0 and at most 3. | +| `num_step` | int | 32 | Diffusion steps. Range: 1 to 128. | +| `guidance_scale` | number | 2.0 | Prompt guidance. Must be at least 0. | +| `instruct_guidance_scale` | number | 0.0 | Instruction guidance. Must be at least 0. | +| `seed` | int \| null | null | Optional deterministic seed for candidate generation. | + +Do **not** send MessagePack, multipart form data, inline reference audio, or service-internal fields such as `features`, `features_json_file`, or `include_audio_base64`. + +### curl + +```bash +curl --request POST https://api.fish.audio/v1/voice-design \ + --header "Authorization: Bearer $FISH_API_KEY" \ + --header "Content-Type: application/json" \ + --header "model: voice-design-1" \ + --data '{ + "instruction": "Warm, confident studio narrator with a natural tone", + "reference_text": "Welcome to Fish Audio.", + "language": "en", + "n": 2 + }' | jq -r '.candidates[0].audio_base64' | base64 --decode > voice.wav +``` + +### Python + +```python +import base64 +import os +import httpx + +r = httpx.post( + "https://api.fish.audio/v1/voice-design", + headers={ + "Authorization": f"Bearer {os.environ['FISH_API_KEY']}", + "Content-Type": "application/json", + "model": "voice-design-1", + }, + json={ + "instruction": "Warm, confident studio narrator with a natural tone", + "reference_text": "Welcome to Fish Audio.", + "language": "en", + "n": 2, + }, + timeout=120, +) +r.raise_for_status() +candidate = r.json()["candidates"][0] +with open("voice.wav", "wb") as f: + f.write(base64.b64decode(candidate["audio_base64"])) +``` + +Billing: one successful generation request is charged once, even when it returns multiple candidates. Authentication, validation, balance, concurrency, and service errors are not billed. + ## Voice models — `/model` ### List: `GET /model` diff --git a/api-reference/endpoint/openapi-v1/voice-design.mdx b/api-reference/endpoint/openapi-v1/voice-design.mdx new file mode 100644 index 0000000..099f213 --- /dev/null +++ b/api-reference/endpoint/openapi-v1/voice-design.mdx @@ -0,0 +1,49 @@ +--- +openapi: post /v1/voice-design +title: "Voice Design" +description: "Generate candidate voices from a prompt" +icon: "wand-magic-sparkles" +iconType: "solid" +--- + + +This endpoint only accepts `application/json`. + +You must include the `model: voice-design-1` header. Extra request fields are rejected. + + + + + A successful request returns generated voice candidates with `audio_base64` + audio payloads. Decode the base64 value to write the candidate audio to a + file. + + +## Example + +```bash +curl --request POST https://api.fish.audio/v1/voice-design \ + --header "Authorization: Bearer $FISH_API_KEY" \ + --header "Content-Type: application/json" \ + --header "model: voice-design-1" \ + --data '{ + "instruction": "Warm, confident studio narrator with a natural tone", + "reference_text": "Welcome to Fish Audio.", + "language": "en", + "n": 2, + "speed": 1, + "num_step": 32, + "guidance_scale": 2, + "instruct_guidance_scale": 0, + "seed": 42 + }' +``` + +## Usage notes + +- `instruction` is required and must be 1 to 2000 characters. +- `reference_text` is optional preview text and can be up to 300 characters. +- `n` controls how many candidates are returned. The supported range is 1 to 4. +- `seed` is optional and can help reproduce candidate generation. +- The endpoint is stateless: it does not create batches, samples, voice models, or presigned URLs. +- Billing happens once per successful generation request, not once per candidate. diff --git a/api-reference/introduction.mdx b/api-reference/introduction.mdx index aba1caa..fa4e842 100644 --- a/api-reference/introduction.mdx +++ b/api-reference/introduction.mdx @@ -30,6 +30,10 @@ Use our [/model endpoint](/api-reference/endpoint/model/create-model) to create Use our [/v1/tts endpoint](/api-reference/endpoint/openapi-v1/text-to-speech) to generate speech. +## Design a Voice + +Use our [/v1/voice-design endpoint](/api-reference/endpoint/openapi-v1/voice-design) to generate candidate voices from a prompt. + ## Real-time Streaming Use our [Python SDK](/features/realtime-streaming) or [JavaScript SDK](/features/realtime-streaming) for real-time audio streaming with WebSocket. diff --git a/api-reference/openapi.json b/api-reference/openapi.json index 70beecf..09a661b 100644 --- a/api-reference/openapi.json +++ b/api-reference/openapi.json @@ -3773,6 +3773,182 @@ "OpenAPI v1" ] } + }, + "/v1/voice-design": { + "post": { + "summary": "Voice Design", + "security": [ + { + "BearerAuth": [] + } + ], + "parameters": [ + { + "in": "header", + "name": "model", + "description": "Specify which voice-design model to use.", + "required": true, + "schema": { + "const": "voice-design-1", + "default": "voice-design-1", + "title": "Model", + "type": "string" + }, + "deprecated": false + } + ], + "requestBody": { + "required": true, + "content": { + "application/json": { + "schema": { + "$ref": "#/components/schemas/VoiceDesignRequest" + } + } + } + }, + "responses": { + "200": { + "description": "Request fulfilled, document follows", + "headers": {}, + "content": { + "application/json": { + "schema": { + "properties": { + "candidates": { + "description": "Generated voice candidates.", + "items": { + "$ref": "#/components/schemas/VoiceDesignCandidate" + }, + "title": "Candidates", + "type": "array" + } + }, + "required": [ + "candidates" + ], + "type": "object" + } + } + } + }, + "401": { + "description": "No permission -- see authorization schemes", + "headers": {}, + "content": { + "application/json": { + "schema": { + "properties": { + "status": { + "title": "Status", + "type": "integer" + }, + "message": { + "title": "Message", + "type": "string" + } + }, + "required": [ + "status", + "message" + ], + "type": "object" + } + } + } + }, + "402": { + "description": "No payment -- see charging schemes", + "headers": {}, + "content": { + "application/json": { + "schema": { + "properties": { + "status": { + "title": "Status", + "type": "integer" + }, + "message": { + "title": "Message", + "type": "string" + } + }, + "required": [ + "status", + "message" + ], + "type": "object" + } + } + } + }, + "422": { + "description": "", + "headers": {}, + "content": { + "application/json": { + "schema": { + "type": "array", + "items": { + "type": "object", + "properties": { + "loc": { + "title": "Location", + "description": "error field", + "type": "array", + "items": { + "type": "string" + } + }, + "type": { + "title": "Type", + "description": "error type", + "type": "string" + }, + "msg": { + "title": "Message", + "description": "error message", + "type": "string" + }, + "ctx": { + "title": "Context", + "description": "error context", + "type": "string" + }, + "in": { + "title": "In", + "type": "string", + "enum": [ + "path", + "query", + "header", + "cookie", + "body" + ] + } + }, + "required": [ + "loc", + "type", + "msg" + ] + } + } + } + } + } + }, + "tags": [ + "OpenAPI v1" + ], + "x-codeSamples": [ + { + "lang": "bash", + "label": "Voice Design", + "source": "curl --request POST \\\n --url https://api.fish.audio/v1/voice-design \\\n --header 'Authorization: Bearer ' \\\n --header 'Content-Type: application/json' \\\n --header 'model: voice-design-1' \\\n --data '{\n \"instruction\": \"Warm, confident studio narrator with a natural tone\",\n \"reference_text\": \"Welcome to Fish Audio.\",\n \"language\": \"en\",\n \"n\": 2,\n \"speed\": 1,\n \"num_step\": 32,\n \"guidance_scale\": 2,\n \"instruct_guidance_scale\": 0,\n \"seed\": 42\n }'" + } + ] + } } }, "tags": [], @@ -4603,6 +4779,195 @@ ], "title": "ASRSegment", "type": "object" + }, + "VoiceDesignRequest": { + "additionalProperties": false, + "description": "Request body for synchronous voice design generation. The endpoint returns generated voice candidates with base64-encoded audio.", + "examples": [ + { + "guidance_scale": 2, + "instruct_guidance_scale": 0, + "instruction": "Warm, confident studio narrator with a natural tone", + "language": "en", + "n": 2, + "num_step": 32, + "reference_text": "Welcome to Fish Audio.", + "seed": 42, + "speed": 1 + } + ], + "properties": { + "instruction": { + "description": "Voice design prompt. Must contain 1 to 2000 characters.", + "maxLength": 2000, + "minLength": 1, + "title": "Instruction", + "type": "string" + }, + "reference_text": { + "anyOf": [ + { + "maxLength": 300, + "type": "string" + }, + { + "type": "null" + } + ], + "default": null, + "description": "Optional text used as reference content for the generated voice.", + "title": "Reference Text" + }, + "language": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "default": null, + "description": "Optional BCP-47 language hint, such as `en`, `zh`, or `ja`.", + "title": "Language" + }, + "n": { + "default": 2, + "description": "Number of voice candidates to generate.", + "maximum": 4, + "minimum": 1, + "title": "N", + "type": "integer" + }, + "speed": { + "default": 1, + "description": "Speaking speed multiplier for candidate generation.", + "exclusiveMinimum": 0, + "maximum": 3, + "title": "Speed", + "type": "number" + }, + "num_step": { + "default": 32, + "description": "Number of diffusion steps used by the voice-design model.", + "maximum": 128, + "minimum": 1, + "title": "Num Step", + "type": "integer" + }, + "guidance_scale": { + "default": 2, + "description": "Classifier-free guidance scale. Higher values follow the prompt more strongly.", + "minimum": 0, + "title": "Guidance Scale", + "type": "number" + }, + "instruct_guidance_scale": { + "default": 0, + "description": "Instruction guidance scale for prompt conditioning.", + "minimum": 0, + "title": "Instruct Guidance Scale", + "type": "number" + }, + "seed": { + "anyOf": [ + { + "type": "integer" + }, + { + "type": "null" + } + ], + "default": null, + "description": "Optional deterministic seed for candidate generation.", + "title": "Seed" + } + }, + "required": [ + "instruction" + ], + "title": "VoiceDesignRequest", + "type": "object" + }, + "VoiceDesignCandidate": { + "properties": { + "id": { + "description": "Stable candidate identifier.", + "title": "Id", + "type": "string" + }, + "index": { + "description": "Candidate index in this response.", + "minimum": 0, + "title": "Index", + "type": "integer" + }, + "audio_base64": { + "description": "Base64 encoded generated audio.", + "title": "Audio Base64", + "type": "string" + }, + "sample_rate": { + "description": "Audio sample rate in Hz.", + "exclusiveMinimum": 0, + "title": "Sample Rate", + "type": "integer" + }, + "duration_ms": { + "description": "Audio duration in milliseconds.", + "minimum": 0, + "title": "Duration Ms", + "type": "integer" + }, + "text": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "default": null, + "description": "Preview text associated with this generated voice, when available.", + "title": "Text" + }, + "instruct": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "default": null, + "description": "Instruction text associated with this candidate, when available.", + "title": "Instruct" + }, + "language": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "default": null, + "description": "Detected or requested candidate language, when available.", + "title": "Language" + } + }, + "required": [ + "id", + "index", + "audio_base64", + "sample_rate", + "duration_ms" + ], + "title": "VoiceDesignCandidate", + "type": "object" } } }, diff --git a/developer-guide/models-pricing/pricing-and-rate-limits.mdx b/developer-guide/models-pricing/pricing-and-rate-limits.mdx index 58cf144..89eae97 100644 --- a/developer-guide/models-pricing/pricing-and-rate-limits.mdx +++ b/developer-guide/models-pricing/pricing-and-rate-limits.mdx @@ -40,6 +40,17 @@ TTS pricing is based on the size of input text, measured in millions of UTF-8 by - Charges are based on the duration of audio processed - Duration is rounded up to the nearest second +### Voice Design + +| Model Name | Price (USD) | +|------------------|--------------------------------| +| `voice-design-1` | $0.01 / successful API request | + +**How Voice Design billing works:** +- Charges are based on successful `POST /v1/voice-design` requests +- One successful request is charged once, even when it returns multiple candidates +- Authentication, validation, balance, concurrency, and service errors are not billed + ## Rate Limits These limits help us ensure fair usage and maintain service quality for all users. diff --git a/docs.json b/docs.json index 4254322..d9b086d 100644 --- a/docs.json +++ b/docs.json @@ -46,6 +46,7 @@ ] }, "features/speech-to-text", + "features/voice-design", "features/voice-cloning", "features/realtime-streaming", "features/manage-voices" @@ -195,6 +196,13 @@ "api-reference/endpoint/openapi-v1/speech-to-text", "api-reference/endpoint/websocket/tts-live" ] + }, + { + "group": "Voice Design", + "icon": "wand-magic-sparkles", + "pages": [ + "api-reference/endpoint/openapi-v1/voice-design" + ] } ] }, diff --git a/features/voice-design.mdx b/features/voice-design.mdx new file mode 100644 index 0000000..6bbb354 --- /dev/null +++ b/features/voice-design.mdx @@ -0,0 +1,184 @@ +--- +title: "Voice Design" +description: "Generate candidate voices from a prompt" +icon: "wand-magic-sparkles" +--- + +Voice Design creates short voice candidates from a natural-language prompt. Use it when you want to explore a voice direction before building a longer text-to-speech workflow or creating a persistent voice model. + + + + Every parameter for `POST /v1/voice-design`. + + + Voice Design is billed per successful generation request. + + + Create a reusable voice model from reference audio. + + + +## When to use it + + + + Generate several candidate voices from a short creative brief. + + + Provide preview text to hear how a generated voice reads a specific line. + + + Use generated candidates to choose a voice direction before longer TTS + production. + + + Get generated audio directly without creating batches, samples, or voice + models. + + + +## Quick start + +Send a JSON request with a prompt and receive generated candidates. The current candidate audio payload is WAV bytes encoded as base64. + + +```bash API (curl) +curl --request POST https://api.fish.audio/v1/voice-design \ + --header "Authorization: Bearer $FISH_API_KEY" \ + --header "Content-Type: application/json" \ + --header "model: voice-design-1" \ + --data '{ + "instruction": "Warm, confident studio narrator with a natural tone", + "reference_text": "Welcome to Fish Audio.", + "language": "en", + "n": 2 + }' | jq -r '.candidates[0].audio_base64' | base64 --decode > voice.wav +``` + +```python Python +import base64 +import os +import requests + +response = requests.post( + "https://api.fish.audio/v1/voice-design", + headers={ + "Authorization": f"Bearer {os.environ['FISH_API_KEY']}", + "Content-Type": "application/json", + "model": "voice-design-1", + }, + json={ + "instruction": "Warm, confident studio narrator with a natural tone", + "reference_text": "Welcome to Fish Audio.", + "language": "en", + "n": 2, + }, + timeout=120, +) +response.raise_for_status() + +candidate = response.json()["candidates"][0] +with open("voice.wav", "wb") as f: + f.write(base64.b64decode(candidate["audio_base64"])) + +print(candidate["sample_rate"], candidate["duration_ms"]) +``` + +```javascript JavaScript +import { writeFile } from "node:fs/promises"; + +const response = await fetch("https://api.fish.audio/v1/voice-design", { + method: "POST", + headers: { + Authorization: `Bearer ${process.env.FISH_API_KEY}`, + "Content-Type": "application/json", + model: "voice-design-1", + }, + body: JSON.stringify({ + instruction: "Warm, confident studio narrator with a natural tone", + reference_text: "Welcome to Fish Audio.", + language: "en", + n: 2, + }), +}); + +if (!response.ok) + throw new Error(`${response.status} ${await response.text()}`); + +const { candidates } = await response.json(); +await writeFile("voice.wav", Buffer.from(candidates[0].audio_base64, "base64")); +console.log(candidates[0].sample_rate, candidates[0].duration_ms); +``` + + + +## Prompt and preview text + +`instruction` is the main voice design prompt. Describe the voice, age, delivery, tone, accent, pacing, and context in natural language. + +```json +{ + "instruction": "Energetic young presenter, bright tone, crisp diction, friendly but not cartoonish", + "reference_text": "Here is your weekly product update.", + "language": "en", + "n": 3 +} +``` + +`reference_text` is optional. When you provide it, candidates read that text so you can compare voices on the same line. Keep it short; the API accepts up to 300 characters. + +## Parameters + +| Field | Default | Notes | +| ------------------------- | -------- | ------------------------------------------------------------------ | +| `instruction` | Required | Voice design prompt. 1 to 2000 characters. | +| `reference_text` | `null` | Optional preview text. Up to 300 characters. | +| `language` | `null` | Optional language hint such as `en`, `zh`, or `ja`. | +| `n` | `2` | Number of candidates to generate. Range: 1 to 4. | +| `speed` | `1.0` | Speaking speed multiplier. Must be greater than 0 and at most 3. | +| `num_step` | `32` | Diffusion steps. Range: 1 to 128. | +| `guidance_scale` | `2.0` | Higher values follow the prompt more strongly. Must be at least 0. | +| `instruct_guidance_scale` | `0.0` | Prompt conditioning guidance. Must be at least 0. | +| `seed` | `null` | Optional deterministic seed for candidate generation. | + + + Voice Design accepts JSON only. Do not send MessagePack, multipart form data, + inline reference audio, or service-internal fields such as `features`, + `features_json_file`, or `include_audio_base64`. + + +## Response + +The response contains one or more generated candidates: + +```json +{ + "candidates": [ + { + "id": "candidate-id", + "index": 0, + "audio_base64": "UklGRg...", + "sample_rate": 44100, + "duration_ms": 3100, + "text": "Welcome to Fish Audio.", + "language": "en" + } + ] +} +``` + +Use `index` to preserve the order returned by the model. `id` is a stable candidate identifier for this response. Optional fields such as `text`, `instruct`, and `language` appear only when available. + +## Billing and errors + +Voice Design is billed once per successful generation request, not once per candidate. Authentication errors, validation errors, insufficient API credit, concurrency limits, upstream service errors, and empty candidate responses are not billed. + +For the full error format and retry guidance, see [Errors](/api-reference/errors). diff --git a/llms.txt b/llms.txt index 4b88bc6..a4a8918 100644 --- a/llms.txt +++ b/llms.txt @@ -18,6 +18,7 @@ - [API Introduction](https://docs.fish.audio/api-reference/introduction.md): How to use the Fish Audio API. - [Text to Speech Endpoint](https://docs.fish.audio/api-reference/endpoint/openapi-v1/text-to-speech.md): Convert text to speech. - [Speech to Text Endpoint](https://docs.fish.audio/api-reference/endpoint/openapi-v1/speech-to-text.md): Transcribe audio to text. +- [Voice Design Endpoint](https://docs.fish.audio/api-reference/endpoint/openapi-v1/voice-design.md): Generate candidate voices from a prompt. - [List Models](https://docs.fish.audio/api-reference/endpoint/model/list-models.md): Get a list of all models. - [Create Model](https://docs.fish.audio/api-reference/endpoint/model/create-model.md): Create a new voice model. - [Get Model](https://docs.fish.audio/api-reference/endpoint/model/get-model.md): Get details of a specific model. @@ -42,6 +43,7 @@ - [Text to Speech Guide](https://docs.fish.audio/developer-guide/core-features/text-to-speech.md): Convert text to natural-sounding speech with Fish Audio. - [Speech to Text Guide](https://docs.fish.audio/developer-guide/core-features/speech-to-text.md): Convert audio recordings into accurate text transcriptions. +- [Voice Design Guide](https://docs.fish.audio/features/voice-design.md): Generate candidate voices from natural-language prompts. - [Creating Voice Models](https://docs.fish.audio/developer-guide/core-features/creating-models.md): Learn how to create custom voice models with Fish Audio. - [Emotion Control](https://docs.fish.audio/developer-guide/core-features/emotions.md): Add natural emotions and expressions to your AI-generated speech. - [Fine-grained Control](https://docs.fish.audio/developer-guide/core-features/fine-grained-control.md): Advanced control over speech generation.