• Diffusion foundations

  • Fréchet inception distance (FID) score: lower is better

    • Compare the activations of
  • DALL-E 3, openAI 2023

    • Recaptioning technique
    • Key hypothesis: existing models struggle to follow detailed image descriptions and often ignore words or confuse prompt meanings. Just need more detailed and accurate image captions to train. Our ability to generate captions should be able to help.
    • At sample time: use GPT-4 to “upsample” prompt
    • Some SD model
      • 256x256 → 32x32 latents + diffusion (not VAE!) decoder
      • Might have SR diffusion
  • Stable Diffusion 3, Stability 2024

    Untitled

    • Diffusion transformer architecture, but with separate MLPs for text and image embeddings, but feeding into same attn. Reminiscent of MoEs (no routing).

      • Use ensemble of text embeddings (2 CLIPs and 1 T5), but aggressive dropout of 46.3% [dropping entire models?] means at inference time you can just load one model, don’t need all 3
      • Dropping T5 causes more misspellings
      • They parameterize a bunch of different sizes in the model in terms of number of DiT layers $d$
    • Uses “flow matching” new general approach

    • Rectified flow for straight paths

      • Start from ODE

        Untitled

      • They formulate all other design points within the same framework

      • Fortunately it’s also one of the simpler models, compared to EDM (Karras) and cosine (Nichol) and log-normal and uniform (actually this is the simplest)

      • Classic flows are curvy, need to take more steps than if the path was straighter:

        Untitled

        Untitled

    • Trained on ImageNet, CC12M. Recaptioned with CogVLM. Deduped with faiss.

    • Autoencoder increased hidden dimensions—important since it’s an upper bound on how good your generated images are

    • Pretrain on low res, fine tune on high res, same as in SD2

    • Pretrain → fine tune → DPO on 128 captions from Partiprompt (simple but realistic captions → high-quality images)

    • Comprehensive paper did grid search over design space

    • Paper tweets

    • hu-po

    • Rectified flow

    • Sander on diffusion distillation

  • Diffusion Transformers (DiT) 2023

    Untitled

    • Start with ViT
    • Operate in noised latent space
    • Patchify with strided 2x2 or 4x4 convolutions
    • Naive method
      • Must condition on timestep, class. So in addition to patch sequence, could concat timestep embedding and class.
      • But want text, not class. So concat the text embedding. But now it’s messy, doesn’t work as well. [Why?]
    • Also can do cross-attention on the text embeddings
      • Works well but computationally expensive because need another layer of attention (in addition to self-attention)
    • So, introduced adaLN-Zero: conditioning on timestep embedding TODO
      • Let conditioning define the scale/shift params, just via MLP
      • Additional conditioned scale determines how important the attention result is (the skip sum)
    • Transformers just more scalable
    • https://www.youtube.com/watch?v=fWUwDEi1qlA
  • VQVAE

  • ControlNet, 2023

    • Train with pair data, e.g. edges, depth maps, normal maps, p ose, style, etc.

      Untitled

      Untitled

  • DDIM

    • Actually feels cleaner and simpler than DDPM—people seem to prefer this formulation
    • Start with standard forward. WARNING: they change notations/variable names, somewhat confusing!

    Untitled

    • Pretty straightforward: rewind x_t to x_0 then jump forward to x_{t-1}

    Untitled

    Untitled

    • Can thus choose different sampling steps/skip steps

    • Altogether:

      Untitled

      • We’re taking small steps

      Untitled

      Untitled

      Untitled

      • Notice in below how: x_0 is weighted alpha, noise is weighted sigma, x_t is weighted the rest (both those subtracted from 1).

      Untitled

    • Also parameterizes the stochasticity (readded noise) with $\sigma$. $\eta$ is knob to control this: 0 means $\sigma=0$, 1 means $\sigma$ is its original DDPM value.

    • Code—notice how we parameterize the skip size:

      def ddim_step(x_t, t, noise, abar_t, abar_t1, bbar_t, bbar_t1, eta):
          vari = ((bbar_t1/bbar_t) * (1-abar_t/abar_t1))
          sig = vari.sqrt()*eta
          x_0_hat = ((x_t-bbar_t.sqrt()*noise) / abar_t.sqrt())
          x_t = abar_t1.sqrt()*x_0_hat + (bbar_t1-sig**2).sqrt()*noise
          if t>0: x_t += sig * torch.randn(x_t.shape).to(x_t)
          return x_t
      @torch.no_grad()
      def sample(f, model, sz, n_steps, skips=1, eta=1.):
          tsteps = list(reversed(range(0, n_steps, skips)))
          x_t = torch.randn(sz).to(model.device)
          preds = []
          for i,t in enumerate(progress_bar(tsteps)):
              abar_t1 = abar[tsteps[i+1]] if t > 0 else torch.tensor(1)
              noise = model(x_t,t).sample
              x_t = f(x_t, t, noise, abar[t], abar_t1, 1-abar[t], 1-abar_t1, eta)
              preds.append(x_t.float().cpu())
          return preds
      
    • fastai

  • Muse 2023

    Untitled

  • Cold diffusion: none of the technicalities matter, could use anything outside of Gaussian noise too (blurring, masking, etc.)

    Untitled

    Untitled

    • https://arxiv.org/pdf/2208.09392.pdf
  • Classifier free guidance

    • Issues with classifier guidance:
      • Can’t use pretrained classifier since it must be trained on noisy data
      • Hard to justify
    • Train a joint model over labeled images as well as unlabeled—basically, randomly decide whether to train the current sample with or without label (all absorbed into the UNet)
    • Start with classifier guidance term, then you’re left with both conditional generator p(x|y) and unconditional p(x)—we’re training both in the model:
    • Incorporate both labeled and unlabeled noise predictions in the sampling process
    • Over sample timesteps, linearly interpolate away from unconditional to conditional on the class
  • GLIDE

  • VAE

  • VQ-VAE

  • UNet: just a popular CNN architecture with wide resnet

  • Google Imagen, 2022

  • Dall-E 2 aka unCLIP, OpenAI 2022

  • DALL-E, OpenAI 2021

  • Cascaded diffusion models, 2022

  • Stable Diffusion, Stability 2022