https://github.com/suno-ai/bark
AudioCraft
Discrete autoencoder → autoregressive transformer, also conditioned on text