Overview

Queue

MMaDA: Multimodal Large Diffusion Language Models
- Open source 8B diffusion model by princeton and bytedance that generates both text and image
- https://arxiv.org/abs/2505.15809

LLaDA, 2025

iteratively mask out random tokens, forward pass to fill in, add back some masks, repeat

Training Phase

During training, LLaDA randomly samples a masking ratio t from [0,1], then masks each token independently with probability t to create a partially masked sequence xt. A Transformer (without causal masking) predicts all masked tokens simultaneously, trained with cross-entropy loss only on the masked positions.

Inference/Sampling Phase

Here's the step-by-step process:

Start fully masked: Begin with a fully masked sequence at t=1
Iterative unmasking: At each step from time t to s (where s < t), the model:
- Feeds the partially masked sequence to the mask predictor
- Predicts all masked tokens simultaneously
- Crucially, remasks s/t of the predicted tokens to maintain consistency with the forward process
Remasking strategies: The paper explores different remasking approaches:
- Random remasking: The baseline approach that randomly selects which predicted tokens to remask
- Low-confidence remasking: Remasks the s/t tokens with the lowest prediction confidence
- Semi-autoregressive: For instruction-following models, divides the sequence into blocks and generates them left-to-right, applying the reverse process within each block
Continue until unmasked: This process continues until t=0, where all tokens are unmasked.

The key insight is that unlike autoregressive models that generate one token at a time from left to right, LLaDA can predict multiple tokens simultaneously and iteratively refine the entire sequence through this masking/unmasking process. This enables bidirectional reasoning and helps address limitations like the "reversal curse" that affects standard left-to-right models.

Blockwise diffusion (BD3-LM), 2025

Diffusion