• References
    • Video diffusion models survey tutorial, 2023
    • Berkeley CS294-158 lecture
  • Approaches
    • (2+1)D factorization as a cascade: break up into smaller indep pieces consisting of spatial/temporal superres models.
      • Complex pipelines of various models means more moving pieces
      • E.g. Imagen Video, Make-A-Video, Lumiere [Lumiere doesn’t need the temporal SR though]
    • Per-frame latent space models, i.e. (2+1)D factorization over latent space. No spatial superres needed.
      • Learn per-frame autoencoder (spatial), e.g. VQGAN (for AR), VAE (diffusion)
      • Learn base video model on keyframes (temporal)
      • Learn frame interpolation model(s) to upsample FPS
      • E.g. Align Your Latents, Stable Video Diffusion, Emu Video
    • Both the above can leverage text-image pretraining. Freeze the spatial convs/attns, train only the new 1D (or 3D) conv/attn. Alt is full joint training.
    • 3D autoencoders. More popular recently. Downsample over time and space together, e.g. 16 × 256 × 256 -> 4 x 16 x 16.
      • E.g. VideoGPT (2021) / TATS (2022) which use 3D CNN VQ-VAE/VQ-GAN, learn an AR prior
      • E.g. LVDM (2022), which uses 3D CNN VAE, learn a diffusion prior
      • Problem: how to leverage image-text data, which had more diversity/independence, better text data? Approach: treat first frame differently, e.g. 17 x 256 × 256 -> (1 + 4) x 16 x 16.
        • E.g. Phenaki (Sept 2022), 3D ViT-VQ, learn a MaskGit prior
        • E.g. MAGVIT-v2 (Oct 2023) / VideoPoet (Dec 2023), Causal 3D CNN LFQ, learn a MaskGit prior / AR prior
        • E.g. WALT (Dec 2023), Causal 3D CNN VAE, learn a diffusion prior
  • Improvement areas: scale, repr (latent spaces), data (filters, recaptioning, finetune on extremely high quality)
  • Possibly look into
    • https://videogigagan.github.io/: Towards Detail-rich Video Super-Resolution (HN)
  • Open sora
    • Bold name given results, but at least it's an open attempt
    • Uses MAGVIT and MAGVAE
    • Details here, easy read https://github.com/hpcaitech/Open-Sora/blob/main/docs/report_v1.md
      • However, we found that there is no open-source high-quality spatial-temporal VAE model. MAGVIT's 4x4x4 VAE is not open-sourced, while VideoGPT's 2x4x4 VAE has a low quality in our experiments. Thus, we decided to use a 2D VAE (from Stability-AI) in our first version

      • As shown in the figure, we insert a temporal attention right after each spatial attention in STDiT (ST stands for spatial-temporal). This is similar to variant 3 in Latte's paper

        Untitled

      • PixArt-α is an efficiently trained high-quality image generation model with T5-conditioned DiT structure. We initialize our model with PixArt-α and initialize the projection layer of inserted temporal attention with zero

    • There’s also other attempts: https://github.com/PKU-YuanGroup/Open-Sora-Plan, https://github.com/mini-sora/minisora
  • Sora, OpenAI 2024
    • Another big step up in quality, duration

    • Scaled up diffusion transformer (see Image generation)

      • Diffusion transformer over space time patches representing codes in latent space
      • Learn a VAE → encode → patchify → flatten
      • spatial-temporal VAE to reduce the temporal dimensions. Boo good open st vae
    • Flatten can do diff. resolutions and timescales

      • Trains at native resolutions for better results
      • vs. trying to do this with a UNet
      • Probably requires 2D position embedding / 1 position embedding per axis
    • Sora is not an autoregressive model like GPT, but a diffusion transformer. From the technical report[1], it is clear that it predicts the entire sequence of spatiotemporal patches at once. [Like Lumiere]

    • Speculation: trained with random conditioning (masking) to support video prediction and infilling. [Like Lumiere, Phenaki]

      • Image to video

        Untitled

      • Extend video

        Untitled

      • Extend generated video (can also loop)

        Untitled

    • Uses recaptioning technique

    • Can be prompted with images

    • https://www.youtube.com/watch?v=fWUwDEi1qlA

  • Lumiere, Google 2024
    • Big step up in quality

    • Space-time Unet (STUNet) generates entire temporal duration of the video at once

      • Contrast with generating distant keyframes. Claim one problem is that frames can be locally coherent (between two keyframes) but not globally (across different keyframe interpolated segments). Consider walking steps:

      Untitled

      • Looks like UNet, but has new T dimension. They chop resolution in half in downsample stages.

      Untitled

      • Built on pretrained text-to-image diffusion model, using “inflation”. Start with pretrained spatial layer(s), add another 2D conv, then 1D temporal (so (2+1)D conv). Similar with attention at bottom layers.
    • Second stage is superresolution network. Completely separate problem. 128 × 128 -> 1024 × 1024

      • Base model is 80 x 128 x 128 (16fps)
      • At boundaries, don’t make them disjoint, allow overlap. Use “multidiffusion” which just means you allow some overlap. Just mean of those overlapping parts is what minimizes MSE loss.

      Untitled

    • Limited to <5s

    • Dataset: 30M text-video pairs

    • https://www.youtube.com/watch?v=Pl8BET_K1mc

  • WALT, Google 2023
    • 3D CNN autoencoder
      • Downsamples over space and time, so HWTC becomes 32 32 4 4
      • First frame is encoded as image with no temporal downsampling, to support joint image-video representations
    • Diffusion transformer model on latent space
      • Space-time factorized attn for better efficiency
      • Zero terminal SNR
      • Diff variant of AdaLN block
      • Latent self-conditioning
      • 3B params
    • https://youtu.be/DsEDMjdxOv4?si=yJ1OtkDsBRPP_wVe&t=5855
  • Video Poet, Google 2023
    • Some autoregressive model using autoencoder
    • VQ tokenizer
  • Emu Video, Meta 2023
    • Editing videos
  • ControlVideo, 2023
  • CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers ICLR2023
    • https://github.com/THUDM/CogVideo?tab=readme-ov-file
  • Stable Video Diffusion, Stability 2023
  • Video LDM (”Align your latents”), 2023
  • Phenaki, Google 2022
  • Imagen video, Google 2022
  • Make-A-Video, Meta 2022
  • MAGVIT
  • Image evaluation metrics
  • Video evaluation metrics