Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions docs/archives/chat-audio-tts-routing/plan.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Chat Audio TTS Routing Plan

## Implementation

- Tighten `isChatAudioTtsModel` so MiMo IDs must match the known MiMo prefixes and include a standalone `tts` segment.
- Update `executeTtsPatternB` to treat `message.content` as unknown response data.
- Extract audio parts only after checking `Array.isArray(message.content)`.
- Keep `message.audio.data` as the first-preference extraction path.
- Leave the existing missing-audio error path in place for responses that contain no audio data.

## Test Strategy

- Add shared helper coverage for MiMo TTS and non-TTS model IDs.
- Extend `test/main/presenter/llmProviderPresenter/aiSdkRuntime.test.ts`.
- Cover `mimo-v2.5-pro` using normal chat streaming instead of direct TTS `fetch`.
- Cover a successful HTTP response with string `message.content` and no audio payload.
- Assert the runtime rejects with the expected missing-audio error, not `content.find is not a function`.

## Compatibility

This change is backward-compatible for actual MiMo TTS models. Non-TTS MiMo chat models stop being routed through TTS handling, while providers returning `message.audio.data` or array content audio parts keep the same behavior.
26 changes: 26 additions & 0 deletions docs/archives/chat-audio-tts-routing/spec.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Chat Audio TTS Routing

## User Story

When a MiMo chat model is selected, DeepChat should only enter chat-audio TTS handling for model IDs that are actually TTS variants. Regular MiMo chat models such as `MiMo-V2.5-Pro` should use the normal chat streaming runtime.

## Acceptance Criteria

- `mimo-v2.5-pro` and provider-prefixed variants are not classified as TTS models.
- MiMo model IDs with a `tts` segment, such as `mimo-v2.5-tts`, continue to use chat-audio TTS Pattern B.
- Chat-audio TTS responses with `choices[0].message.audio.data` continue to emit cached audio.
- Chat-audio TTS responses with array `choices[0].message.content` can still extract an audio content part.
- Chat-audio TTS responses with string `choices[0].message.content` do not throw a `TypeError`.
- If no audio payload exists, DeepChat raises the existing missing-audio error instead of a response-shape crash.

## Non-Goals

- No changes to renderer audio display behavior.
- No changes to request body construction for chat-audio TTS models.

## Constraints

- Keep the fix localized to the AI SDK runtime.
- Keep TTS model classification in shared helpers so provider and agent runtime checks agree.
- Preserve current OpenAI-compatible chat-audio behavior.
- Add focused regression coverage for the reported MiMo Pro misrouting and response shape.
8 changes: 8 additions & 0 deletions docs/archives/chat-audio-tts-routing/tasks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Chat Audio TTS Routing Tasks

- [x] Create SDD issue artifacts.
- [x] Guard chat-audio TTS content audio extraction by response shape.
- [x] Add a regression test for string `message.content`.
- [x] Tighten MiMo chat-audio TTS classification.
- [x] Add regression coverage for MiMo Pro chat routing.
- [x] Run focused test coverage and quality checks.
26 changes: 21 additions & 5 deletions src/main/presenter/llmProviderPresenter/aiSdk/runtime.ts
Original file line number Diff line number Diff line change
Expand Up @@ -403,6 +403,22 @@ function extractTtsText(messages: ChatMessage[]): string {
return ''
}

function extractChatAudioContentData(content: unknown): string | undefined {
if (!Array.isArray(content)) {
return undefined
}

const audioPart = content.find(
(item) => item && typeof item === 'object' && 'type' in item && item.type === 'audio'
)
const audioData =
audioPart && typeof audioPart === 'object' && 'audio' in audioPart
? (audioPart.audio as { data?: unknown } | undefined)?.data
: undefined

return typeof audioData === 'string' && audioData ? audioData : undefined
}

/**
* Pattern A: calls the standard OpenAI-compatible /audio/speech endpoint.
*/
Expand Down Expand Up @@ -521,15 +537,15 @@ async function executeTtsPatternB(
const json = (await response.json()) as {
choices?: Array<{
message?: {
audio?: { data?: string }
content?: Array<{ type?: string; audio?: { data?: string } }>
audio?: { data?: unknown }
content?: unknown
}
}>
}
const firstMessage = json.choices?.[0]?.message
const audioData =
firstMessage?.audio?.data ??
firstMessage?.content?.find((item) => item?.type === 'audio')?.audio?.data
const directAudioData =
typeof firstMessage?.audio?.data === 'string' ? firstMessage.audio.data : undefined
const audioData = directAudioData ?? extractChatAudioContentData(firstMessage?.content)
if (!audioData) {
throw new Error('TTS response missing audio data in choices[0].message.audio.data')
}
Expand Down
7 changes: 4 additions & 3 deletions src/shared/ttsSettings.ts
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,8 @@ export const GEMINI_GENERATE_CONTENT_TTS_MODELS = [
* Model ID prefixes for TTS models that use the chat completions endpoint
* with audio output (Pattern B), e.g. xiaomimimo mimo-v2.5-tts series.
*/
export const CHAT_AUDIO_TTS_MODEL_PREFIXES = ['mimo-v'] as const
export const CHAT_AUDIO_TTS_MODEL_PREFIXES = ['mimo-v', 'xiaomi-mimo-v'] as const
const CHAT_AUDIO_TTS_MODEL_MARKER_PATTERN = /(^|-)tts($|-)/

function normalizeTtsModelId(modelId: string): string {
const trimmed = modelId.trim().toLowerCase()
Expand Down Expand Up @@ -59,8 +60,8 @@ export function isGeminiGenerateContentTtsModel(modelId: string): boolean {
export function isChatAudioTtsModel(modelId: string): boolean {
const id = normalizeTtsModelId(modelId)
return (
CHAT_AUDIO_TTS_MODEL_PREFIXES.some((prefix) => id.startsWith(prefix)) ||
id.startsWith('xiaomi-mimo-v')
CHAT_AUDIO_TTS_MODEL_PREFIXES.some((prefix) => id.startsWith(prefix)) &&
CHAT_AUDIO_TTS_MODEL_MARKER_PATTERN.test(id)
)
}

Expand Down
172 changes: 172 additions & 0 deletions test/main/presenter/llmProviderPresenter/aiSdkRuntime.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -367,6 +367,43 @@ describe('AI SDK runtime', () => {
expect(request).not.toHaveProperty('providerOptions')
})

it('uses normal chat streaming for non-TTS MiMo Pro models', async () => {
const fetchMock = vi.fn()
vi.stubGlobal('fetch', fetchMock)

const context = {
providerKind: 'openai-compatible',
provider: {
id: 'xiaomimimo',
apiType: 'openai-compatible',
baseUrl: 'https://example.com/v1',
apiKey: 'test-key'
},
configPresenter: {},
defaultHeaders: {}
} as any

const events = []
for await (const event of runAiSdkCoreStream(
context,
[{ role: 'user', content: 'hello mimo' }],
'mimo-v2.5-pro',
{
apiEndpoint: 'chat',
functionCall: false
} as any,
0.7,
1024,
[]
)) {
events.push(event)
}

expect(fetchMock).not.toHaveBeenCalled()
expect(mockStreamText).toHaveBeenCalledTimes(1)
expect(events).toEqual([])
})

it('includes an assistant role message for chat-audio TTS requests', async () => {
const fetchMock = vi.fn().mockResolvedValue(
new Response(
Expand Down Expand Up @@ -450,6 +487,141 @@ describe('AI SDK runtime', () => {
])
})

it('extracts chat-audio TTS data from content audio parts', async () => {
const fetchMock = vi.fn().mockResolvedValue(
new Response(
JSON.stringify({
choices: [
{
message: {
content: [
{ type: 'text', text: 'ok' },
{
type: 'audio',
audio: {
data: 'ZmFrZS1hdWRpby1wYXJ0'
}
}
]
}
}
]
}),
{
status: 200,
headers: {
'Content-Type': 'application/json'
}
}
)
)
vi.stubGlobal('fetch', fetchMock)

const context = {
providerKind: 'openai-compatible',
provider: {
id: 'xiaomimimo',
apiType: 'openai-compatible',
baseUrl: 'https://example.com/v1',
apiKey: 'test-key'
},
configPresenter: {},
defaultHeaders: {},
shouldUseTts: () => true
} as any

const events = []
for await (const event of runAiSdkCoreStream(
context,
[{ role: 'user', content: 'hello tts' }],
'mimo-v2.5-tts',
{
apiEndpoint: 'chat',
tts: {
responseFormat: 'wav'
}
} as any,
0.7,
1024,
[]
)) {
events.push(event)
}

expect(events).toEqual([
{
type: 'image_data',
image_data: {
data: 'cached://image',
mimeType: 'audio/wav'
}
},
{
type: 'stop',
stop_reason: 'complete'
}
])
})

it('fails cleanly when chat-audio TTS content is text without audio data', async () => {
const fetchMock = vi.fn().mockResolvedValue(
new Response(
JSON.stringify({
choices: [
{
message: {
content: 'plain text response without audio'
}
}
]
}),
{
status: 200,
headers: {
'Content-Type': 'application/json'
}
}
)
)
vi.stubGlobal('fetch', fetchMock)

const context = {
providerKind: 'openai-compatible',
provider: {
id: 'xiaomimimo',
apiType: 'openai-compatible',
baseUrl: 'https://example.com/v1',
apiKey: 'test-key'
},
configPresenter: {},
defaultHeaders: {},
shouldUseTts: () => true
} as any

const drainStream = async () => {
for await (const _event of runAiSdkCoreStream(
context,
[{ role: 'user', content: 'hello tts' }],
'mimo-v2.5-tts',
{
apiEndpoint: 'chat',
tts: {
responseFormat: 'wav'
}
} as any,
0.7,
1024,
[]
)) {
// Drain stream.
}
}

await expect(drainStream()).rejects.toThrow(
'TTS response missing audio data in choices[0].message.audio.data'
)
})

it('uses Gemini generateContent compatibility mode for AIHubMix Gemini TTS models', async () => {
const pcmBase64 = Buffer.from([0, 0, 255, 127]).toString('base64')
const fetchMock = vi.fn().mockResolvedValue(
Expand Down
14 changes: 14 additions & 0 deletions test/main/shared/ttsSettings.test.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
import { describe, expect, it } from 'vitest'
import { isChatAudioTtsModel, isTtsModelId } from '@shared/ttsSettings'

describe('TTS model helpers', () => {
it('classifies only MiMo TTS variants as chat-audio TTS models', () => {
expect(isChatAudioTtsModel('mimo-v2.5-tts')).toBe(true)
expect(isChatAudioTtsModel('xiaomi-mimo-v2.5-tts-preview')).toBe(true)
expect(isChatAudioTtsModel('xiaomimimo/mimo-v2.5-tts')).toBe(true)

expect(isChatAudioTtsModel('mimo-v2.5-pro')).toBe(false)
expect(isChatAudioTtsModel('xiaomimimo/mimo-v2.5-pro')).toBe(false)
expect(isTtsModelId('mimo-v2.5-pro')).toBe(false)
})
})