Site
Chat

what architecture is this model, how do they represent audio/images/video, do they autoregressively generate these as tokens, and how do they train?

Based on the paper, here's how BAGEL works:

Architecture

BAGEL uses a Mixture-of-Transformer-Experts (MoT) architecture with:

Two transformer experts: one for understanding, one for generation
Both experts share self-attention at every layer (bottleneck-free design)
7B active parameters (14B total)
Built on Qwen2.5 LLM as initialization

Representation of Different Modalities

Images:

For understanding: Uses SigLIP2-so400m/14 ViT encoder (384 resolution, up to 980×980)
For generation: Uses pre-trained VAE from FLUX with 8× downsampling and 16 latent channels

Video:

Treated as sequences of images with temporal relationships
Uses inter-frame captions to capture temporal dynamics

Audio:

Not mentioned in this paper - BAGEL appears to focus only on text, image, and video

Generation Method