I'll explain the core technique from the SEDD (Score Entropy Discrete Diffusion) paper in concrete terms.
Traditional language models use autoregressive modeling (predicting the next token given previous tokens). SEDD introduces an alternative approach based on learning probability ratios rather than absolute probabilities.
Instead of learning the probability distribution p(x) directly, SEDD learns the ratios between probabilities:
Concrete Score: s_θ(x)_y ≈ p_data(y)/p_data(x)
This is the discrete analogue of the score function ∇_x log p used in continuous diffusion models.
SEDD introduces a new loss function called "score entropy" to learn these ratios:
Score Entropy: Σ_{y~x} s_θ(x)_y - (p_data(y)/p_data(x)) log s_θ(x)_y
This loss:
The model uses a continuous-time Markov chain that gradually corrupts text: