-
Good open source models with training and data
-
Good open weight models
-
Good open source tools
-
Good fully open
-
Bibliography
-
parler-tts from HF, FKA Stable Speech
- Fully open source
- Repro of “Natural language guidance of high-fidelity text-to-speech models with synthetic annotations”
- Dataset: originally 45k hours of audiobook data, may have changed?
-
Natural language guidance of high-fidelity text-to-speech models with synthetic annotations, 2023
- Site, Paper
- Competes with Audiobox, not with TTS
- Text + Description → Transformer → RVQ Tokens → DAC Decoder → Audio
- Data and Labeling:
- Uses a 45,000 hour dataset derived from LibriVox audiobooks
- Automatically labels multiple attributes of speech:
- Gender (using existing classifier)
- Accent (using a trained classifier on 53 accents)
- Recording quality (SNR and reverb measurements)
- Pitch and speaking rate (analyzed from audio)
- Audio fidelity characteristics
- Model Architecture (as shown in Figure 1):
- Takes two inputs:
- Transcript text (what should be spoken)
- Description text (how it should be spoken - style, accent, quality etc.)
- Uses a decoder-only Transformer as the main language model
- Employs cross-attention to connect the description to the audio generation
- Uses the Descript Audio Codec (DAC) for high-quality audio output
- Novel Technical Approaches:
- Combines large-scale training data (45k hours) with a small amount (1%) of high-fidelity audio
- Uses advanced audio codec models (DAC) instead of traditional encoders
- Automatically generates natural language descriptions of speech characteristics
- Converts continuous measurements (like pitch, speed) into discrete categories with natural language labels
- Uses Descript Audio Codec (DAC) as the token vocab
- How DAC Sets the Vocabulary:
- Has 9 separate codebooks
- Each codebook defines its own set of discrete tokens
- Frame rate of 86Hz means 86 sets of tokens per second
- Model predicts tokens from these fixed codebooks
- Training
- The Transformer predicts tokens from DAC's pre-defined codebooks
No need to learn its own audio tokenization
DAC decoder is pre-trained and fixed
- DAC universal model works on all domains (speech, environment, music, etc.), making it widely applicable to generative modeling of all audio
- This makes it more like the audio equivalent of how many text-to-image models use pre-trained image encoders/decoders (like VQGAN) to establish their vocabulary
-
Audiobox: Unified Audio Generation with Natural Language Prompts, Meta 2023
-
Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale, Meta 2023

- Architecture:
- Core is a Transformer model (24 layers, 16 attention heads, 1024 dim for base model)
- Uses continuous normalizing flows (CNF) with "flow-matching" rather than diffusion
- Split into two components:
- Audio Model: Generates 80-dim log Mel spectrograms at 100Hz (80 frequency bins per frame)
- Duration Model: Predicts phoneme timings
- Training Data & Scale:
- 60K hours English audiobooks + 50K hours multilingual (6 languages)
- Raw audio -> 80-dim log Mel spectrograms for 1-second chunks (”frames”)
- Text: Uses phonemes as text tokens (from Montreal Forced Aligner)
- Matching transcripts from audiobooks
- Converted to phonemes (not word or LLM tokens) using Montreal Forced Aligner
- Forced alignment gives timing information between phonemes and audio
- Trained on 32 GPUs, ~500K-750K updates
- Not filtered or enhanced, uses "in-the-wild" data
- Training Method:
- Masks random segments of speech
- Randomly masks 70-100% of frames
- Model learns to predict masked regions given surrounding context and text which includes:
- Unmasked audio context
- Full phoneme sequence
- Alignment information
- No explicit style tokens/embeddings - style captured implicitly from context
- Uses classifier-free guidance for better quality
- Key Technical Points:
- Non-autoregressive: Can generate in parallel unlike typical TTS
- Flow-matching: More efficient than diffusion (can generate high quality with just 8-16 steps)
- Doesn't use discrete tokens like VALL-E, works directly with continuous spectrograms
- Final audio generated via HiFi-GAN vocoder
- Uses bi-directional context (can look both forward and backward)
- The innovation isn't in a radical new architecture, but in applying flow-matching to speech generation at scale and showing it can learn general speech patterns without task-specific training.
- Data format:
- Length: Up to 1600 frames (16 seconds) per chunk
- Not tiny slivers, but substantial segments that can contain full sentences
- Input includes:
- Full spectrogram (80-dim × N frames @ 100Hz)
- Complete phoneme sequence
- Frame-level alignment between them
- Mask indicating which parts to predict
- Not designed for real-time - it's non-autoregressive
- Needs both past AND future context
- Inference speed: Can generate 10s of audio in ~0.31s (with 2 steps) to ~6s (with 64 steps)
- Throughput is fast but latency makes it unsuitable for real-time
- Inference time: the shape depends on the task, but for the most common case (like zero-shot TTS), here's the structure:
- Input:
- Text/Phoneme Sequence:
- Target text converted to phoneme sequence
- Length: M phonemes
- Audio Context (if doing voice cloning/style transfer):
- Log Mel spectrogram: [N × 80]
- N frames (typically 3s worth @ 100Hz = ~300 frames)
- Mask:
- Binary mask same length as target sequence
- Indicates which parts need to be generated
- Output:
- Generated spectrogram: [T × 80]
- T frames determined by duration model
- Then converted to waveform via HiFi-GAN vocoder
- Comparison
- VALL-E Approach:
- Uses Encodec to convert audio into discrete tokens (8 codebooks @ 75Hz)
- Works like a language model but on these quantized audio tokens
- Training/inference happens in discrete token space
- Must predict tokens sequentially in autoregressive way
- Voicebox Approach:
- On the normalizing flows
- The key loss function is the Flow Matching objective, which is surprisingly simple:
-
WaveNet