To learn

TODO cross-attention

Insights

Insights

In a nutshell, the different levels of noise at which a diffusion model operates allow it to focus on different spatial frequency components of the image at each iterative refinement step. When sampling an image, the model effectively builds it up from low frequencies to high frequencies, first filling in large-scale structure and then adding progressively more fine-grained details.

During training, we sample a noise level for each training example, add noise to it, and then try to predict the noise. The relative weights with which we sample the different noise levels therefore determine the degree to which the model focuses on large-scale and fine-grained structure. The most commonly used formulation, with uniform weighting of the noise levels, yields a very different objective than the likelihood loss which e.g. autoregressive models are trained with.

It turns out that there is a particular weighting which corresponds directly to the likelihood loss11, but this puts significantly more weight on very low noise levels. Since low noise levels correspond to high spatial frequencies, this also indirectly explains why likelihood-based autoregressive models in pixel space never really took off: they end up spending way too much of their capacity on perceptually meaningless detail, and never get around to modelling larger-scale structure.

Relative to the likelihood loss, uniform weighting across noise levels in diffusion models yields an objective that is much more closely aligned with the human visual system. I don’t believe this was actually known when people first started training diffusion models on images – it was just a lucky coincidence! But we understand this pretty well now, and I think it is one of the two main reasons why this modelling approach completely took over in a matter of two years. (The other reason is of course classifier-free guidance, which you can read more about in my previous blog post on the topic.)
More insights

(1) Peyman Milanfar on X: "There’s a single formula that makes all of your diffusion models possible: Tweedie's Say 𝐱 is a noisy version of 𝐮 with 𝐞 ∼ 𝒩(𝟎, σ² 𝐈) 𝐱 = 𝐮 + 𝐞 MMSE estimate of 𝐮 is 𝔼[𝐮 | 𝐱] and would seem to require P(𝐮|𝐱). Yet Tweedie says P(𝐱) is all you need 1/3 https://t.co/aW7dHdjTed" / X

There’s a single formula that makes all of your diffusion models possible: Tweedie's

Say 𝐱 is a noisy version of 𝐮 with 𝐞 ∼ 𝒩(𝟎, σ² 𝐈)

𝐱 = 𝐮 + 𝐞

MMSE estimate of 𝐮 is 𝔼[𝐮 | 𝐱] and would seem to require P(𝐮|𝐱). Yet Tweedie says P(𝐱) is all you need

Diffusion foundations

Overview

General concept: each step, add Gaussian noise. Use CNN to recover.

Untitled

Train model $\epsilon_\theta$ to predict noise given the noised image and a timestep embedding
- Conceptually, this model finds the direction to move $x$ to maximize how likely it is an image.
Sampling: at each timestep
- Predicting fully from noise to initial image doesn’t work well (predicts the mean which is blurry), so move in small steps, which yields better results
- So subtract the predicted noise, but also add back in some noise (adds stability somehow)

Forward

Untitled

In one step (the “nice property” of Gaussians/”reparameterization trick”):

Untitled

Background for reverse

First we need to understand what we’re modeling. It’s the q distribution. Forward is easy. Reverse is intractable: