CoCa, 2022 | Notion

Concretely explain how the core technique works. What are they doing differently than the main approaches that came before?

Edit

CoCa introduces a fundamentally different approach to image-text foundation models by unifying three previously separate pretraining paradigms into a single model. Here's what they're doing differently:

Core Technical Innovation: Decoupled Decoder Architecture

CoCa proposes a simple model family named Contrastive Captioners (CoCa) with a modified encoder-decoder architecture trained with both contrastive loss and captioning (generative) loss. We decouple the decoder transformer into two parts, a unimodal decoder and a multimodal decoder.

The key innovation is splitting the text decoder into two distinct parts:

Unimodal decoder layers (bottom half): We omit cross-attention in unimodal decoder layers to encode text-only representations
Multimodal decoder layers (top half): cascade multimodal decoder layers cross-attending to image encoder outputs to learn multimodal image-text representations

How It Works Differently

Previous Approaches' Limitations:

Single-encoder models (e.g., ResNet, ViT): These models rely heavily on image annotations as labeled vectors and do not bake in knowledge of free-form human natural language
Dual-encoder models (e.g., CLIP, ALIGN): these models are not directly applicable for joint vision-language understanding tasks such as visual question answering (VQA), due to missing joint components to learn fused image and text representations
Encoder-decoder models (e.g., SimVLM): they do not produce text-only representations aligned with image embeddings, thereby being less feasible and efficient for crossmodal alignment tasks

CoCa's Unified Approach:

We apply both the contrastive objective between outputs of the image encoder and unimodal text decoder, and the captioning objective at the output of the multimodal decoder. Furthermore, CoCa is trained on both image annotation data and noisy image-text data by treating all labels simply as text.

The architecture processes inputs as follows:

Image path: Image → Image Encoder → Visual Features