Speech generation

Good open source models with training and data
- parler-tts from HF
Good open weight models
- xTTS: from Coqui (defunct) in TTS library, includes training code at least for FT
- Pypi TTS library ‣ includes a number of notable models
- Tortoise TTS: inspired by DALLE and diffusion https://github.com/neonbjb/tortoise-tts
  - Tortoise-TTS Fully Explained | Part 1 | Architecture Design - YouTube
- [D] What is the best open source text to speech model? : r/MachineLearning
Good open source tools
- vocode: STT → LLM → TTS
Good fully open
Bibliography
parler-tts from HF, FKA Stable Speech
- Fully open source
- Repro of “Natural language guidance of high-fidelity text-to-speech models with synthetic annotations”
- Dataset: originally 45k hours of audiobook data, may have changed?
  - A filtered version of LibriTTS-R dataset, a 1K hours high-quality speech dataset.
  - The English subset of Multilingual LibriSpeech.
Natural language guidance of high-fidelity text-to-speech models with synthetic annotations, 2023
- Site, Paper
- Competes with Audiobox, not with TTS
- Text + Description → Transformer → RVQ Tokens → DAC Decoder → Audio
1. Data and Labeling:
  - Uses a 45,000 hour dataset derived from LibriVox audiobooks
  - Automatically labels multiple attributes of speech:
    - Gender (using existing classifier)
    - Accent (using a trained classifier on 53 accents)
    - Recording quality (SNR and reverb measurements)
    - Pitch and speaking rate (analyzed from audio)
    - Audio fidelity characteristics
2. Model Architecture (as shown in Figure 1):
  - Takes two inputs:
    1. Transcript text (what should be spoken)
    2. Description text (how it should be spoken - style, accent, quality etc.)
  - Uses a decoder-only Transformer as the main language model
  - Employs cross-attention to connect the description to the audio generation
  - Uses the Descript Audio Codec (DAC) for high-quality audio output
3. Novel Technical Approaches:
  - Combines large-scale training data (45k hours) with a small amount (1%) of high-fidelity audio
  - Uses advanced audio codec models (DAC) instead of traditional encoders
  - Automatically generates natural language descriptions of speech characteristics
  - Converts continuous measurements (like pitch, speed) into discrete categories with natural language labels
- Uses Descript Audio Codec (DAC) as the token vocab
  - How DAC Sets the Vocabulary:
    - Has 9 separate codebooks
    - Each codebook defines its own set of discrete tokens
    - Frame rate of 86Hz means 86 sets of tokens per second
    - Model predicts tokens from these fixed codebooks
  - Training
    - The Transformer predicts tokens from DAC's pre-defined codebooks No need to learn its own audio tokenization DAC decoder is pre-trained and fixed
  - DAC universal model works on all domains (speech, environment, music, etc.), making it widely applicable to generative modeling of all audio
  - This makes it more like the audio equivalent of how many text-to-image models use pre-trained image encoders/decoders (like VQGAN) to establish their vocabulary
Audiobox: Unified Audio Generation with Natural Language Prompts, Meta 2023
- Paper
Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale, Meta 2023
- Paper
- Architecture:
  - Core is a Transformer model (24 layers, 16 attention heads, 1024 dim for base model)
  - Uses continuous normalizing flows (CNF) with "flow-matching" rather than diffusion
  - Split into two components:
    1. Audio Model: Generates 80-dim log Mel spectrograms at 100Hz (80 frequency bins per frame)
    2. Duration Model: Predicts phoneme timings
- Training Data & Scale:
  - 60K hours English audiobooks + 50K hours multilingual (6 languages)
  - Raw audio -> 80-dim log Mel spectrograms for 1-second chunks (”frames”)
  - Text: Uses phonemes as text tokens (from Montreal Forced Aligner)
    - Matching transcripts from audiobooks
    - Converted to phonemes (not word or LLM tokens) using Montreal Forced Aligner
    - Forced alignment gives timing information between phonemes and audio
  - Trained on 32 GPUs, ~500K-750K updates
  - Not filtered or enhanced, uses "in-the-wild" data
- Training Method:
  - Masks random segments of speech
    - Randomly masks 70-100% of frames
    - Model learns to predict masked regions given surrounding context and text which includes:
      - Unmasked audio context
      - Full phoneme sequence
      - Alignment information
  - No explicit style tokens/embeddings - style captured implicitly from context
  - Uses classifier-free guidance for better quality
- Key Technical Points:
  - Non-autoregressive: Can generate in parallel unlike typical TTS
  - Flow-matching: More efficient than diffusion (can generate high quality with just 8-16 steps)
  - Doesn't use discrete tokens like VALL-E, works directly with continuous spectrograms
  - Final audio generated via HiFi-GAN vocoder
  - Uses bi-directional context (can look both forward and backward)
- The innovation isn't in a radical new architecture, but in applying flow-matching to speech generation at scale and showing it can learn general speech patterns without task-specific training.
- Data format:
- Not designed for real-time - it's non-autoregressive
- Inference time: the shape depends on the task, but for the most common case (like zero-shot TTS), here's the structure:
- Comparison
- On the normalizing flows
- The key loss function is the Flow Matching objective, which is surprisingly simple:
WaveNet