Video generation

References
- Video diffusion models survey tutorial, 2023
- Berkeley CS294-158 lecture
Approaches
- (2+1)D factorization as a cascade: break up into smaller indep pieces consisting of spatial/temporal superres models.
  - Complex pipelines of various models means more moving pieces
  - E.g. Imagen Video, Make-A-Video, Lumiere [Lumiere doesn’t need the temporal SR though]
- Per-frame latent space models, i.e. (2+1)D factorization over latent space. No spatial superres needed.
  - Learn per-frame autoencoder (spatial), e.g. VQGAN (for AR), VAE (diffusion)
  - Learn base video model on keyframes (temporal)
  - Learn frame interpolation model(s) to upsample FPS
  - E.g. Align Your Latents, Stable Video Diffusion, Emu Video
- Both the above can leverage text-image pretraining. Freeze the spatial convs/attns, train only the new 1D (or 3D) conv/attn. Alt is full joint training.
- 3D autoencoders. More popular recently. Downsample over time and space together, e.g. 16 × 256 × 256 -> 4 x 16 x 16.
  - E.g. VideoGPT (2021) / TATS (2022) which use 3D CNN VQ-VAE/VQ-GAN, learn an AR prior
  - E.g. LVDM (2022), which uses 3D CNN VAE, learn a diffusion prior
  - Problem: how to leverage image-text data, which had more diversity/independence, better text data? Approach: treat first frame differently, e.g. 17 x 256 × 256 -> (1 + 4) x 16 x 16.
    - E.g. Phenaki (Sept 2022), 3D ViT-VQ, learn a MaskGit prior
    - E.g. MAGVIT-v2 (Oct 2023) / VideoPoet (Dec 2023), Causal 3D CNN LFQ, learn a MaskGit prior / AR prior
    - E.g. WALT (Dec 2023), Causal 3D CNN VAE, learn a diffusion prior
Improvement areas: scale, repr (latent spaces), data (filters, recaptioning, finetune on extremely high quality)
Possibly look into
- https://videogigagan.github.io/: Towards Detail-rich Video Super-Resolution (HN)
Open sora
- Bold name given results, but at least it's an open attempt
- Uses MAGVIT and MAGVAE
- Details here, easy read https://github.com/hpcaitech/Open-Sora/blob/main/docs/report_v1.md
  - However, we found that there is no open-source high-quality spatial-temporal VAE model. MAGVIT's 4x4x4 VAE is not open-sourced, while VideoGPT's 2x4x4 VAE has a low quality in our experiments. Thus, we decided to use a 2D VAE (from Stability-AI) in our first version
  - As shown in the figure, we insert a temporal attention right after each spatial attention in STDiT (ST stands for spatial-temporal). This is similar to variant 3 in Latte's paper
  - PixArt-α is an efficiently trained high-quality image generation model with T5-conditioned DiT structure. We initialize our model with PixArt-α and initialize the projection layer of inserted temporal attention with zero
- There’s also other attempts: https://github.com/PKU-YuanGroup/Open-Sora-Plan, https://github.com/mini-sora/minisora
Sora, OpenAI 2024
- Another big step up in quality, duration
- Scaled up diffusion transformer (see Image generation)
  - Diffusion transformer over space time patches representing codes in latent space
  - Learn a VAE → encode → patchify → flatten
  - spatial-temporal VAE to reduce the temporal dimensions. Boo good open st vae
- Flatten can do diff. resolutions and timescales
  - Trains at native resolutions for better results
  - vs. trying to do this with a UNet
  - Probably requires 2D position embedding / 1 position embedding per axis
- Sora is not an autoregressive model like GPT, but a diffusion transformer. From the technical report[1], it is clear that it predicts the entire sequence of spatiotemporal patches at once. [Like Lumiere]
- Speculation: trained with random conditioning (masking) to support video prediction and infilling. [Like Lumiere, Phenaki]
  - Image to video
  - Extend video
  - Extend generated video (can also loop)
- Uses recaptioning technique
- Can be prompted with images
- https://www.youtube.com/watch?v=fWUwDEi1qlA
Lumiere, Google 2024
- Big step up in quality
- Space-time Unet (STUNet) generates entire temporal duration of the video at once
  - Contrast with generating distant keyframes. Claim one problem is that frames can be locally coherent (between two keyframes) but not globally (across different keyframe interpolated segments). Consider walking steps:
  - Looks like UNet, but has new T dimension. They chop resolution in half in downsample stages.
  - Built on pretrained text-to-image diffusion model, using “inflation”. Start with pretrained spatial layer(s), add another 2D conv, then 1D temporal (so (2+1)D conv). Similar with attention at bottom layers.
- Second stage is superresolution network. Completely separate problem. 128 × 128 -> 1024 × 1024
  - Base model is 80 x 128 x 128 (16fps)
  - At boundaries, don’t make them disjoint, allow overlap. Use “multidiffusion” which just means you allow some overlap. Just mean of those overlapping parts is what minimizes MSE loss.
- Limited to <5s
- Dataset: 30M text-video pairs
- https://www.youtube.com/watch?v=Pl8BET_K1mc
WALT, Google 2023
- 3D CNN autoencoder
  - Downsamples over space and time, so HWTC becomes 32 32 4 4
  - First frame is encoded as image with no temporal downsampling, to support joint image-video representations
- Diffusion transformer model on latent space
  - Space-time factorized attn for better efficiency
  - Zero terminal SNR
  - Diff variant of AdaLN block
  - Latent self-conditioning
  - 3B params
- https://youtu.be/DsEDMjdxOv4?si=yJ1OtkDsBRPP_wVe&t=5855
Video Poet, Google 2023
- Some autoregressive model using autoencoder
- VQ tokenizer
Emu Video, Meta 2023
- Editing videos
ControlVideo, 2023
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers ICLR2023
- https://github.com/THUDM/CogVideo?tab=readme-ov-file
Stable Video Diffusion, Stability 2023
Video LDM (”Align your latents”), 2023
Phenaki, Google 2022
Imagen video, Google 2022
Make-A-Video, Meta 2022
MAGVIT
Image evaluation metrics
Video evaluation metrics