-
Whisper, OpenAI 2022
- Architecture is standard encoder-decoder transformer over log-mel spectrogram patches, outputting language tokens, with cross attention
- Data is the highlight
- Prior approaches: either large unsupervised data (1M hours) but with no decoder, or clean small labeled (5k hours). Later scaled up to 10-30k. This work uses weakly supervised data from web for 680k hours filtered using heuristics, closing the gap.
- Heuristics: all upper/lower case unlikely human generated, filter matching lang with CLD2, error analysis manual inspection further filters (expensive)
- All audio is re-sampled to 16,000 Hz, and an 80-channel log-magnitude Mel spectrogram representation is computed on 25-millisecond windows with a stride of 10 milliseconds. So, 80 dims. 176k samples / 160[?] = 1100 sequence length.
- Max context size of 30s chunks.
- Training format: either no speech or speech, and either no timestamp or timestamp