• Good open source models with training and data

    • parler-tts from HF
  • Good open weight models

    • xTTS: from Coqui (defunct) in TTS library, includes training code at least for FT
    • Pypi TTS library ‣ includes a number of notable models
    • Tortoise TTS: inspired by DALLE and diffusion https://github.com/neonbjb/tortoise-tts
      • Tortoise-TTS Fully Explained | Part 1 | Architecture Design - YouTube
    • [D] What is the best open source text to speech model? : r/MachineLearning
  • Good open source tools

    • vocode: STT → LLM → TTS
  • Good fully open

  • Bibliography

  • parler-tts from HF, FKA Stable Speech

    • Fully open source
    • Repro of “Natural language guidance of high-fidelity text-to-speech models with synthetic annotations”
    • Dataset: originally 45k hours of audiobook data, may have changed?
      • A filtered version of LibriTTS-R dataset, a 1K hours high-quality speech dataset.
      • The English subset of Multilingual LibriSpeech.
  • Natural language guidance of high-fidelity text-to-speech models with synthetic annotations, 2023

    • Site, Paper
    • Competes with Audiobox, not with TTS
    • Text + Description → Transformer → RVQ Tokens → DAC Decoder → Audio
    1. Data and Labeling:
      • Uses a 45,000 hour dataset derived from LibriVox audiobooks
      • Automatically labels multiple attributes of speech:
        • Gender (using existing classifier)
        • Accent (using a trained classifier on 53 accents)
        • Recording quality (SNR and reverb measurements)
        • Pitch and speaking rate (analyzed from audio)
        • Audio fidelity characteristics
    2. Model Architecture (as shown in Figure 1):
      • Takes two inputs:
        1. Transcript text (what should be spoken)
        2. Description text (how it should be spoken - style, accent, quality etc.)
      • Uses a decoder-only Transformer as the main language model
      • Employs cross-attention to connect the description to the audio generation
      • Uses the Descript Audio Codec (DAC) for high-quality audio output
    3. Novel Technical Approaches:
      • Combines large-scale training data (45k hours) with a small amount (1%) of high-fidelity audio
      • Uses advanced audio codec models (DAC) instead of traditional encoders
      • Automatically generates natural language descriptions of speech characteristics
      • Converts continuous measurements (like pitch, speed) into discrete categories with natural language labels
    • Uses Descript Audio Codec (DAC) as the token vocab
      • How DAC Sets the Vocabulary:
        • Has 9 separate codebooks
        • Each codebook defines its own set of discrete tokens
        • Frame rate of 86Hz means 86 sets of tokens per second
        • Model predicts tokens from these fixed codebooks
      • Training
        • The Transformer predicts tokens from DAC's pre-defined codebooks No need to learn its own audio tokenization DAC decoder is pre-trained and fixed
      • DAC universal model works on all domains (speech, environment, music, etc.), making it widely applicable to generative modeling of all audio
      • This makes it more like the audio equivalent of how many text-to-image models use pre-trained image encoders/decoders (like VQGAN) to establish their vocabulary
  • Audiobox: Unified Audio Generation with Natural Language Prompts, Meta 2023

    • Paper
  • Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale, Meta 2023

    • Paper

    image.png

    • Architecture:
      • Core is a Transformer model (24 layers, 16 attention heads, 1024 dim for base model)
      • Uses continuous normalizing flows (CNF) with "flow-matching" rather than diffusion
      • Split into two components:
        1. Audio Model: Generates 80-dim log Mel spectrograms at 100Hz (80 frequency bins per frame)
        2. Duration Model: Predicts phoneme timings
    • Training Data & Scale:
      • 60K hours English audiobooks + 50K hours multilingual (6 languages)
      • Raw audio -> 80-dim log Mel spectrograms for 1-second chunks (”frames”)
      • Text: Uses phonemes as text tokens (from Montreal Forced Aligner)
        • Matching transcripts from audiobooks
        • Converted to phonemes (not word or LLM tokens) using Montreal Forced Aligner
        • Forced alignment gives timing information between phonemes and audio
      • Trained on 32 GPUs, ~500K-750K updates
      • Not filtered or enhanced, uses "in-the-wild" data
    • Training Method:
      • Masks random segments of speech
        • Randomly masks 70-100% of frames
        • Model learns to predict masked regions given surrounding context and text which includes:
          • Unmasked audio context
          • Full phoneme sequence
          • Alignment information
      • No explicit style tokens/embeddings - style captured implicitly from context
      • Uses classifier-free guidance for better quality
    • Key Technical Points:
      • Non-autoregressive: Can generate in parallel unlike typical TTS
      • Flow-matching: More efficient than diffusion (can generate high quality with just 8-16 steps)
      • Doesn't use discrete tokens like VALL-E, works directly with continuous spectrograms
      • Final audio generated via HiFi-GAN vocoder
      • Uses bi-directional context (can look both forward and backward)
    • The innovation isn't in a radical new architecture, but in applying flow-matching to speech generation at scale and showing it can learn general speech patterns without task-specific training.
    • Data format:
    • Not designed for real-time - it's non-autoregressive
    • Inference time: the shape depends on the task, but for the most common case (like zero-shot TTS), here's the structure:
    • Comparison
    • On the normalizing flows
    • The key loss function is the Flow Matching objective, which is surprisingly simple:
  • WaveNet