Skip to content

Towards controllable speech synthesis in the era of large language models: A systematic survey #52

@standing-o

Description

@standing-o

Towards controllable speech synthesis in the era of large language models: A systematic survey


Introduction

  • This paper is a comprehensive survey focused on controllable text-to-speech (TTS) — i.e., methods that let users steer attributes such as emotion, timbre, prosody, and style in synthesized speech.
  • The rise of large language models (LLMs) (e.g., ChatGPT) enables intuitive, natural-language-driven control of TTS and has spurred new instruction- or prompt-based approaches.
  • Technical Advances
    • Architectures: categorizes autoregressive (AR), non-autoregressive (NAR), and LLM-based methods, plus hybrids
    • Modeling families: transformers, VAEs, diffusion models, normalizing flows, and neural-codec / discrete-token LLM approaches
    • Control strategies: style tagging (predefined labels or continuous controls), reference-speech prompting (few-shot voice cloning / zero-shot), natural-language descriptions (description-to-speech), and instruction-guided synthesis/editing (LLM-interpreted free-form instructions).

Main Tasks in Controllable TTS

  • Prosody Control

    • Targets: pitch (F0), duration, energy (low-level acoustic features).
    • Goal: render emphasis, rhythm, and nuance to ensure naturalness and expressiveness.
    • Methods: explicit predictors for duration/pitch/energy (e.g., FastPitch/FastSpeech2 families), latent-based control (VAEs, diffusion).
  • Timbre Control

    • Targets: voice quality aspects that define speaker identity (gender, age, nasality).
    • Goal: voice personalization, voice conversion, speaker editing.
    • Methods: speaker embeddings, reference-speech prompts, discrete codec-token based zero-shot systems (e.g., VALL‑E variants, YourTTS).
  • Emotion Control

    • Targets: affective state and intensity (happy, sad, angry, etc.).
    • Goal: improve human–computer interaction, storytelling, adaptive virtual agents.
    • Methods: emotion embeddings, hierarchical modeling across global/utterance/local scales, instruction or preference-based tuning.
  • Style Control

    • Targets: high-level attributes like tone, formality, discourse mode (newscast, narration).
    • Goal: adapt speaking behavior to context/audience.
    • Methods: style tokens, natural-language description prompts, LLM-based instruction control.
  • Language Control

    • Targets: multilingual synthesis, dialects, code-switching.
    • Goal: cross-lingual communication and region-specific voices.
    • Methods: multilingual training, cross-lingual embeddings, token-based generalization (e.g., VALL‑E X, XTTS).
  • Environment Control

    • Targets: background noise, reverberation, spatial cues.
    • Goal: simulate acoustic scenes for film, games, audiobooks.
    • Methods: conditioning on environment embeddings, multimodal prompts (images → environment cues), diffusion/LDM-based environment synthesis.

Methods in Controllable TTS

Model Architectures

1. Non-Autoregressive Approaches (NAR)

  • Non-autoregressive TTS models generate the entire output speech sequence y = (y1, y2, . . . , yT) in parallel given the input x = (x1, x2, . . . , xT):

$$ \arg \max_{\theta} = P(y|\mathbf{x}; \theta), $$

  • where θ denotes model parameter.

  • 1. Transformer-based Methods

    • FastSpeech, Fast-Speech 2, FastPitch
    • FastSpeech (Ren et al., 2019) and FastSpeech 2 (Ren et al., 2021a) improve speed via duration prediction and enable pitch/energy control. FastPitch (Ła´ncucki, 2021) further integrates direct pitch prediction.
  • 2. VAE-based Methods

    • Hierarchical VAE, Parallel Tacotron, CLONE, Conditional VAE
    • VAEs (e.g., Hsu et al., 2018; Zhang et al., 2019; Liu et al., 2022) leverage structured, continuous latent representations for enhanced prosody, emotion, and style control, often combined with normalizing flows and adversarial training.
  • 3. Diffusion-based Methods

    • NaturalSpeech 2, 3, DEX-TTS, E3 TTS, AudioLDM, Make-An-Audio
    • These models (e.g., Ho et al., 2020) generate speech by reversing a noise injection process. NaturalSpeech 2 (Shen et al., 2024) and NaturalSpeech 3 (Ju et al., 2024) use latent diffusion with quantized vectors or factorized attribute subspaces. DEX-TTS (Park et al., 2024a) improves Diffusion Transformer (DiT) networks.
  • 4. Flow-based Methods

    • Audiobox, P-Flow, VoiceBox, FlashSpeech, E2 TTS, F5-TTS, ConvNeXt v2, E1 TTS
    • Flow models (e.g., Rezende and Mohamed, 2015; Lipman et al., 2023) use invertible flows for direct, high-fidelity generation. Recent models like Audiobox (Vyas et al., 2023) and E2 TTS (Eskimez et al., 2024) adopt flow-matching, framing TTS as speech infilling.

2. Autoregressive Approaches (AR)

  • These models predict the speech sequence sequentially, allowing effective modeling of implicit duration and long-range context but suffering from slower inference.
  • The speech sequence $$y = (y_1, \ldots, y_T)$$ is predicted given input $$x$$ as:

$$ \arg \max_{\theta} = \prod_{t=1}^{T} P( _{y_t} | _{y_{<t}}, x; \theta) , $$

  • where each frame $$y_t$$ depends on all previous outputs $$y_{<t}$$​ and the transcript $$x$$.

  • 1. RNN-based Methods

    • Prosody-Tacotron, Global Style Tokens (GST), MsEmoTTS
    • Prosody-Tacotron (Skerry-Ryan et al., 2018) and models with Global Style Tokens (GST) (Wang et al., 2018) extend Tacotron with explicit prosodic controls and style transfer. Emotion-controllable models like MsEmoTTS (Lei et al., 2022) refine emotional intensity.
  • 2. LLM-based Methods

    • VALL-E X, SpearTTS
    • Inspired by in-context learning, these approaches (e.g., VALL-E (Wang et al., 2023a)) use autoregressive decoder-only Transformers with discrete audio tokens (e.g., EnCodec) to generate speech. They frame TTS as a conditional language modeling task, laying groundwork for methods like VALL-E X (Zhang et al., 2023d) and SpearTTS (Kharitonov et al., 2023).

Control Strategies

  • Control strategies evolve from basic attribute manipulation to sophisticated, instruction-guided synthesis.

  • 1. Style Tagging

    • Controls specific attributes (pitch, energy, emotion) using discrete labels (e.g., StyleTagging-TTS (Kim et al., 2021)) or continuous signals (e.g., DiffStyleTTS (Liu et al., 2025a) for prosody contours).
    • Latent feature modification (e.g., Cauliflow (Abbas et al., 2022)) also falls here.
  • 2. Reference Speech Prompt

    • Customizes synthesized voice using short reference audio clips.
    • Examples include MetaStyleSpeech (Min et al., 2021) for zero-shot performance and SC VALL-E (Kim et al., 2023) for controlling acoustic features via style tokens.
  • 3. Natural Language Descriptions

    • Uses free-form text to describe desired speech attributes, offering user-friendliness (e.g., PromptTTS (Guo et al., 2023), InstructTTS (Yang et al., 2024b), NansyTTS (Yamamoto et al., 2024)).
  • 4. Instruction-Guided Synthesis

    • Leverages Large Language Models (LLMs) to interpret a single natural language prompt conveying both content and style descriptions.
    • VoxInstruct (Zhou et al., 2024) and CosyVoice (Du et al., 2024) enable precise control over speaker identity, emotion, and paralinguistic cues.
  • 5. Instruction-Guided Editing

    • Supports speech editing via user instructions, such as insertion, deletion, or substitution, while maintaining naturalness (e.g., VoiceCraft (Peng et al., 2024b), InstructSpeech (Huang et al., 2024a)).
  • 6. Research Trend

    • The progression shows a clear trajectory toward more expressive, personalized, and intuitive TTS, primarily driven by LLM integration.

Feature Representations

  • The choice of feature representations critically affects flexibility, naturalness, and controllability.

  • 1. Speech Attribute Disentanglement

    • Aims to isolate distinct speech factors (speaker identity, emotion, prosody) into separate latent representations using techniques like adversarial training (Goodfellow et al., 2020) and information bottlenecks (Lu et al., 2023).
  • 2. Continuous Representations

    • Model speech in a continuous feature space (e.g., Mel Spectrograms), preserving acoustic details and inherently encoding prosody, pitch, and emotion. Used in GAN-based, VAE-based, flow-based, and diffusion-based methods.
  • 3. Discrete Tokens

    • Uses quantized units or phoneme-like tokens, often derived from quantization (e.g., Zeghidour et al., 2021) or learned embeddings. They are computationally efficient, require fewer samples for learning, and are suitable for LLM training.

Datasets and Evaluation Methods

Datasets

  • Fully controllable TTS requires diverse and finely annotated datasets.

  • 1. Tag-based Datasets

    • Contain speech annotated with predefined discrete attribute labels (e.g., IEMOCAP, RAVDESS, ESD).
  • 2. Description-based Datasets

    • Pair speech samples with rich, free-form textual descriptions capturing nuanced attributes (e.g., PromptSpeech, TextrolSpeech, Parler-TTS).
  • 3. Dialogue Datasets

    • Contain multi-turn conversational speech, essential for dynamic and contextually appropriate speech (e.g., Taskmaster-1, DailyTalk).

Evaluation Methods

  • Objective Metrics: Enable automated and reproducible evaluation.

  • 1. Mel Cepstral Distortion (MCD): Quantifies spectral distance.

$$ MCD = \frac{10}{\ln 10} \cdot \sqrt{2 \sum_{d=1}^{D}(c_{d}^{(syn)} - c_{d}^{(ref)})^2}, $$

where $$c_{d}^{(syn)}$$ and $$c_{d}^{(ref)}$$ are the $$d$$-th MCC of synthesized and reference speech.

  • 2. Fréchet DeepSpeech Distance (FDSD): Measures distributional distance in a pre-trained speech recognition model's embedding space.

$$ FDSD = ||\mu_s - \mu_r||^2 + Tr(\Sigma_s + \Sigma_r - 2(\Sigma_s \Sigma_r)^{1/2}), $$

where $$\mu_s, \Sigma_sμs,Σs$$ are mean/covariance for synthesized, and $$\mu_r, \Sigma_r$$ for real speech.

  • 3. Word Error Rate (WER): Quantifies intelligibility.

$$ WER = \frac{S + D + I}{N}, $$

where $$S$$ is substitutions, $$D$$ deletions, $$I$$ insertions, and $$N$$ total words in reference.

  • 4. Cosine Similarity: Assesses speaker similarity using speaker embeddings.

$$ CosSim(e_1, e_2) = \frac{e_1 \cdot e_2}{||e_1|| ||e_2||}, $$

where $$e_1, e_2$$ are speaker embeddings.

  • 5. Perceptual Evaluation of Speech Quality (PESQ): Evaluates intelligibility and distortion by modeling human auditory perception.

$$ PESQ = a_0 + a_1 \cdot D_{frame} + a_2 \cdot D_{time}, $$

where $$D_{frame}$$ is frame-by-frame distortion, $$D_{time}$$​ time-domain distortion, and $$a_0, a_1, a_2$$ regression coefficients.

  • 6. Signal-to-Noise Ratio (SNR): Measures signal power to noise power ratio.

$$ SNR = 10 \log_{10}(\frac{P_{signal}}{P_{noise}}), $$

where $$P_{signal} = \frac{1}{N}\sum_{n=1}^{N} x[n]^2$$ and $$P_{noise} = \frac{1}{N}\sum_{n=1}^{N} e[n]^2$$.

  • 7. Subjective Metrics: Assess perceptual quality based on human judgments.
    • Mean Opinion Score (MOS): Rates synthesized speech on a 1-5 scale for naturalness, quality, etc.
    • Comparison MOS (CMOS): Assesses relative quality between paired samples.

$$ MOS/CMOS = \frac{1}{N}\sum_{i=1}^{N}s_i, $$

where $$s_i$$​ is score by i-th listener, $$N$$ number of listeners.

  • AB/ABX Tests: Compares two samples (AB) or two samples against a reference (ABX) for preference or closeness.

$$ Score_{AB}/Score_{ABX} = \frac{N_m}{N}, $$

where $$N_m$$ is listeners preferring model $$m$$, $$N$$ total listeners.

  • 8. Model-based Evaluation (Appendix A.5): The survey proposes a Google Gemini-based pipeline to assess instruction following, naturalness, and expressiveness, showing it outperforms traditional metrics like NISQA and UTMOS in correlation with human judgment for specific controllable TTS tasks.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions