Overview

Queue

LLaDA, 2025

iteratively mask out random tokens, forward pass to fill in, add back some masks, repeat

Training Phase

During training, LLaDA randomly samples a masking ratio t from [0,1], then masks each token independently with probability t to create a partially masked sequence xt. A Transformer (without causal masking) predicts all masked tokens simultaneously, trained with cross-entropy loss only on the masked positions.

Inference/Sampling Phase

Here's the step-by-step process:

  1. Start fully masked: Begin with a fully masked sequence at t=1
  2. Iterative unmasking: At each step from time t to s (where s < t), the model:
  3. Remasking strategies: The paper explores different remasking approaches:
  4. Continue until unmasked: This process continues until t=0, where all tokens are unmasked.

The key insight is that unlike autoregressive models that generate one token at a time from left to right, LLaDA can predict multiple tokens simultaneously and iteratively refine the entire sequence through this masking/unmasking process. This enables bidirectional reasoning and helps address limitations like the "reversal curse" that affects standard left-to-right models.

Blockwise diffusion (BD3-LM), 2025

Diffusion

mdlm.gif