what architecture is this model, how do they represent audio/images/video, do they autoregressively generate these as tokens, and how do they train?
Based on the paper, here's how BAGEL works:
Architecture
BAGEL uses a Mixture-of-Transformer-Experts (MoT) architecture with:
- Two transformer experts: one for understanding, one for generation
- Both experts share self-attention at every layer (bottleneck-free design)
- 7B active parameters (14B total)
- Built on Qwen2.5 LLM as initialization
Representation of Different Modalities
Images:
- For understanding: Uses SigLIP2-so400m/14 ViT encoder (384 resolution, up to 980×980)
- For generation: Uses pre-trained VAE from FLUX with 8× downsampling and 16 latent channels
Video:
- Treated as sequences of images with temporal relationships
- Uses inter-frame captions to capture temporal dynamics
Audio:
- Not mentioned in this paper - BAGEL appears to focus only on text, image, and video
Generation Method