what architecture is this model, how do they represent audio/images/video, do they autoregressively generate these as tokens, and how do they train?

Based on the paper, here's how BAGEL works:

Architecture

BAGEL uses a Mixture-of-Transformer-Experts (MoT) architecture with:

Representation of Different Modalities

Images:

Video:

Audio:

Generation Method