Diffusion models can be interpreted as learning an energy function at each noise level. Here's how:
Diffusion models learn the score function: $\nabla_x \log p_t(x)$
This is directly related to energy: $\nabla_x \log p_t(x) = -\nabla_x E_t(x)$
So the denoising network is actually learning the negative gradient of an implicit energy function!
For a diffusion model at timestep t:
python
# What the model learns
score = denoising_network(x_noisy, t)# Approximates ∇log p_t(x)# Implicit energy interpretation
energy_gradient = -score
# Energy could be recovered by integration (though we don't need to)
The denoising process can be viewed as gradient descent on energy:
python
# Standard diffusion sampling
x_t = x_{t+1} - learning_rate * denoising_network(x_{t+1}, t) + noise
# Energy-based interpretation
x_t = x_{t+1} - learning_rate * (-∇E_t(x_{t+1})) + noise
# = x_{t+1} + learning_rate * ∇E_t(x_{t+1}) + noise
This is Langevin dynamics - a way to sample from an EBM!