concretely explain the core technique

I'll explain the core technique from the SEDD (Score Entropy Discrete Diffusion) paper in concrete terms.

The Core Problem

Traditional language models use autoregressive modeling (predicting the next token given previous tokens). SEDD introduces an alternative approach based on learning probability ratios rather than absolute probabilities.

The Key Innovation: Concrete Scores

Instead of learning the probability distribution p(x) directly, SEDD learns the ratios between probabilities:

Concrete Score: s_θ(x)_y ≈ p_data(y)/p_data(x)

This is the discrete analogue of the score function ∇_x log p used in continuous diffusion models.

How It Works

1. Score Entropy Loss

SEDD introduces a new loss function called "score entropy" to learn these ratios:

Score Entropy: Σ_{y~x} s_θ(x)_y - (p_data(y)/p_data(x)) log s_θ(x)_y

This loss:

Ensures the learned ratios stay positive (unlike squared loss)
Generalizes cross entropy to work with ratios
Can be made tractable through a "denoising" variant

2. Discrete Diffusion Process

The model uses a continuous-time Markov chain that gradually corrupts text: