Overview
Queue
- MMaDA: Multimodal Large Diffusion Language Models
LLaDA, 2025
iteratively mask out random tokens, forward pass to fill in, add back some masks, repeat
Training Phase
During training, LLaDA randomly samples a masking ratio t from [0,1], then masks each token independently with probability t to create a partially masked sequence xt. A Transformer (without causal masking) predicts all masked tokens simultaneously, trained with cross-entropy loss only on the masked positions.
Inference/Sampling Phase
Here's the step-by-step process:
- Start fully masked: Begin with a fully masked sequence at t=1
- Iterative unmasking: At each step from time t to s (where s < t), the model:
- Feeds the partially masked sequence to the mask predictor
- Predicts all masked tokens simultaneously
- Crucially, remasks s/t of the predicted tokens to maintain consistency with the forward process
- Remasking strategies: The paper explores different remasking approaches:
- Random remasking: The baseline approach that randomly selects which predicted tokens to remask
- Low-confidence remasking: Remasks the s/t tokens with the lowest prediction confidence
- Semi-autoregressive: For instruction-following models, divides the sequence into blocks and generates them left-to-right, applying the reverse process within each block
- Continue until unmasked: This process continues until t=0, where all tokens are unmasked.
The key insight is that unlike autoregressive models that generate one token at a time from left to right, LLaDA can predict multiple tokens simultaneously and iteratively refine the entire sequence through this masking/unmasking process. This enables bidirectional reasoning and helps address limitations like the "reversal curse" that affects standard left-to-right models.
Blockwise diffusion (BD3-LM), 2025
Diffusion
