• Whisper, OpenAI 2022

    • Architecture is standard encoder-decoder transformer over log-mel spectrogram patches, outputting language tokens, with cross attention

    Untitled

    • Data is the highlight
      • Prior approaches: either large unsupervised data (1M hours) but with no decoder, or clean small labeled (5k hours). Later scaled up to 10-30k. This work uses weakly supervised data from web for 680k hours filtered using heuristics, closing the gap.
      • Heuristics: all upper/lower case unlikely human generated, filter matching lang with CLD2, error analysis manual inspection further filters (expensive)
    • All audio is re-sampled to 16,000 Hz, and an 80-channel log-magnitude Mel spectrogram representation is computed on 25-millisecond windows with a stride of 10 milliseconds. So, 80 dims. 176k samples / 160[?] = 1100 sequence length.
      • Max context size of 30s chunks.
    • Training format: either no speech or speech, and either no timestamp or timestamp