However, we found that there is no open-source high-quality spatial-temporal VAE model. MAGVIT's 4x4x4 VAE is not open-sourced, while VideoGPT's 2x4x4 VAE has a low quality in our experiments. Thus, we decided to use a 2D VAE (from Stability-AI) in our first version
As shown in the figure, we insert a temporal attention right after each spatial attention in STDiT (ST stands for spatial-temporal). This is similar to variant 3 in Latte's paper
PixArt-α is an efficiently trained high-quality image generation model with T5-conditioned DiT structure. We initialize our model with PixArt-α and initialize the projection layer of inserted temporal attention with zero
Another big step up in quality, duration
Scaled up diffusion transformer (see Image generation)
Flatten can do diff. resolutions and timescales
Sora is not an autoregressive model like GPT, but a diffusion transformer. From the technical report[1], it is clear that it predicts the entire sequence of spatiotemporal patches at once. [Like Lumiere]
Speculation: trained with random conditioning (masking) to support video prediction and infilling. [Like Lumiere, Phenaki]
Image to video
Extend video
Extend generated video (can also loop)
Uses recaptioning technique
Can be prompted with images
Big step up in quality
Space-time Unet (STUNet) generates entire temporal duration of the video at once
Second stage is superresolution network. Completely separate problem. 128 × 128 -> 1024 × 1024
Limited to <5s
Dataset: 30M text-video pairs