References
Approaches
Improvement areas: scale, repr (latent spaces), data (filters, recaptioning, finetune on extremely high quality)
Possibly look into
Open sora
However, we found that there is no open-source high-quality spatial-temporal VAE model. MAGVIT's 4x4x4 VAE is not open-sourced, while VideoGPT's 2x4x4 VAE has a low quality in our experiments. Thus, we decided to use a 2D VAE (from Stability-AI) in our first version
As shown in the figure, we insert a temporal attention right after each spatial attention in STDiT (ST stands for spatial-temporal). This is similar to variant 3 in Latte's paper
PixArt-α is an efficiently trained high-quality image generation model with T5-conditioned DiT structure. We initialize our model with PixArt-α and initialize the projection layer of inserted temporal attention with zero
Sora, OpenAI 2024
Another big step up in quality, duration
Scaled up diffusion transformer (see Image generation)
Flatten can do diff. resolutions and timescales
Sora is not an autoregressive model like GPT, but a diffusion transformer. From the technical report[1], it is clear that it predicts the entire sequence of spatiotemporal patches at once. [Like Lumiere]
Speculation: trained with random conditioning (masking) to support video prediction and infilling. [Like Lumiere, Phenaki]
Image to video
Extend video
Extend generated video (can also loop)
Uses recaptioning technique
Can be prompted with images
Lumiere, Google 2024
Big step up in quality
Space-time Unet (STUNet) generates entire temporal duration of the video at once
Second stage is superresolution network. Completely separate problem. 128 × 128 -> 1024 × 1024
Limited to <5s
Dataset: 30M text-video pairs
WALT, Google 2023
Video Poet, Google 2023
Emu Video, Meta 2023
ControlVideo, 2023
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers ICLR2023
Stable Video Diffusion, Stability 2023
Initialize from SD2.1
Same first author as Video LDM work. Same architecture.
Add temporal layers, initialized from scratch
Same approach as prior work like Emu Video or Lumiere, which also start from image model, but those train only temporal layers with spatial params frozen
Finetune on video data
Proprietary dataset, Large Video Dataset (LVD)
Video LDM (”Align your latents”), 2023
Phenaki, Google 2022
Transformers, no diffusion
First, C-ViViT to encode videos into ‘video tokens’
Autoencoder
Autoregressive model trained on short videos can produce arbitrary length
Patches from first frame, and temporal patches
Spatial transformer layers allow mixing across spatial patches. Temporal transformer layers allow looking back over time.
Use decoder to produce output in pixel space again.
MaskGIT: text vectors ++ video vectors → continued video vectors
Imagen video, Google 2022
Make-A-Video, Meta 2022
MAGVIT
Image evaluation metrics
Video evaluation metrics