Image generation

Diffusion foundations
Fréchet inception distance (FID) score: lower is better
- Compare the activations of
DALL-E 3, openAI 2023
- Recaptioning technique
- Key hypothesis: existing models struggle to follow detailed image descriptions and often ignore words or confuse prompt meanings. Just need more detailed and accurate image captions to train. Our ability to generate captions should be able to help.
- At sample time: use GPT-4 to “upsample” prompt
- Some SD model
  - 256x256 → 32x32 latents + diffusion (not VAE!) decoder
  - Might have SR diffusion
Stable Diffusion 3, Stability 2024
- Diffusion transformer architecture, but with separate MLPs for text and image embeddings, but feeding into same attn. Reminiscent of MoEs (no routing).
  - Use ensemble of text embeddings (2 CLIPs and 1 T5), but aggressive dropout of 46.3% [dropping entire models?] means at inference time you can just load one model, don’t need all 3
  - Dropping T5 causes more misspellings
  - They parameterize a bunch of different sizes in the model in terms of number of DiT layers $d$
- Uses “flow matching” new general approach
- Rectified flow for straight paths
  - Start from ODE
  - They formulate all other design points within the same framework
  - Fortunately it’s also one of the simpler models, compared to EDM (Karras) and cosine (Nichol) and log-normal and uniform (actually this is the simplest)
  - Classic flows are curvy, need to take more steps than if the path was straighter:
- Trained on ImageNet, CC12M. Recaptioned with CogVLM. Deduped with faiss.
- Autoencoder increased hidden dimensions—important since it’s an upper bound on how good your generated images are
- Pretrain on low res, fine tune on high res, same as in SD2
- Pretrain → fine tune → DPO on 128 captions from Partiprompt (simple but realistic captions → high-quality images)
- Comprehensive paper did grid search over design space
- Paper tweets
- hu-po
- Rectified flow
- Sander on diffusion distillation
Diffusion Transformers (DiT) 2023
- Start with ViT
- Operate in noised latent space
- Patchify with strided 2x2 or 4x4 convolutions
- Naive method
  - Must condition on timestep, class. So in addition to patch sequence, could concat timestep embedding and class.
  - But want text, not class. So concat the text embedding. But now it’s messy, doesn’t work as well. [Why?]
- Also can do cross-attention on the text embeddings
  - Works well but computationally expensive because need another layer of attention (in addition to self-attention)
- So, introduced adaLN-Zero: conditioning on timestep embedding TODO
  - Let conditioning define the scale/shift params, just via MLP
  - Additional conditioned scale determines how important the attention result is (the skip sum)
- Transformers just more scalable
- https://www.youtube.com/watch?v=fWUwDEi1qlA
VQVAE
ControlNet, 2023
- Train with pair data, e.g. edges, depth maps, normal maps, p ose, style, etc.

DDIM

Actually feels cleaner and simpler than DDPM—people seem to prefer this formulation
Start with standard forward. WARNING: they change notations/variable names, somewhat confusing!

Untitled

Pretty straightforward: rewind x_t to x_0 then jump forward to x_{t-1}

Untitled

Can thus choose different sampling steps/skip steps
Altogether:
- We’re taking small steps
- Notice in below how: x_0 is weighted alpha, noise is weighted sigma, x_t is weighted the rest (both those subtracted from 1).
Also parameterizes the stochasticity (readded noise) with $\sigma$. $\eta$ is knob to control this: 0 means $\sigma=0$, 1 means $\sigma$ is its original DDPM value.

Code—notice how we parameterize the skip size:

def ddim_step(x_t, t, noise, abar_t, abar_t1, bbar_t, bbar_t1, eta):
    vari = ((bbar_t1/bbar_t) * (1-abar_t/abar_t1))
    sig = vari.sqrt()*eta
    x_0_hat = ((x_t-bbar_t.sqrt()*noise) / abar_t.sqrt())
    x_t = abar_t1.sqrt()*x_0_hat + (bbar_t1-sig**2).sqrt()*noise
    if t>0: x_t += sig * torch.randn(x_t.shape).to(x_t)
    return x_t
@torch.no_grad()
def sample(f, model, sz, n_steps, skips=1, eta=1.):
    tsteps = list(reversed(range(0, n_steps, skips)))
    x_t = torch.randn(sz).to(model.device)
    preds = []
    for i,t in enumerate(progress_bar(tsteps)):
        abar_t1 = abar[tsteps[i+1]] if t > 0 else torch.tensor(1)
        noise = model(x_t,t).sample
        x_t = f(x_t, t, noise, abar[t], abar_t1, 1-abar[t], 1-abar_t1, eta)
        preds.append(x_t.float().cpu())
    return preds

fastai

Muse 2023
Cold diffusion: none of the technicalities matter, could use anything outside of Gaussian noise too (blurring, masking, etc.)
- https://arxiv.org/pdf/2208.09392.pdf
Classifier free guidance
- Issues with classifier guidance:
  - Can’t use pretrained classifier since it must be trained on noisy data
  - Hard to justify
- Train a joint model over labeled images as well as unlabeled—basically, randomly decide whether to train the current sample with or without label (all absorbed into the UNet)
- Start with classifier guidance term, then you’re left with both conditional generator p(x|y) and unconditional p(x)—we’re training both in the model:
- Incorporate both labeled and unlabeled noise predictions in the sampling process
- Over sample timesteps, linearly interpolate away from unconditional to conditional on the class
GLIDE
VAE
VQ-VAE
UNet: just a popular CNN architecture with wide resnet
Google Imagen, 2022
Dall-E 2 aka unCLIP, OpenAI 2022
DALL-E, OpenAI 2021
Cascaded diffusion models, 2022
Stable Diffusion, Stability 2022

Muse 2023

Cold diffusion: none of the technicalities matter, could use anything outside of Gaussian noise too (blurring, masking, etc.)