Vision and vision-language models

Comparing: AIMv2 (Apple 2024), PaliGemma (Google 2024), and Florence 2 (Microsoft 2024), Chameleon (Meta 2024)
- Summary:
  
  Architecture:
  - AIMV2: Uses a vision encoder (trained with prefix attention masking) and multimodal decoder in a autoregressive setup. Much simpler than typical vision-language models.
  - PaliGemma: Combines a SigLIP vision encoder (400M params) with a Gemma-2B language model decoder, connected by a simple linear projection layer.
  - Florence-2: Employs a DaViT vision encoder and standard transformer encoder-decoder architecture for multimodal processing.
  Architectural Approach:
  - Chameleon: Pure token-based early fusion. Everything becomes tokens that flow through one unified architecture
  - AIMV2: Two-stage with vision encoder and multimodal decoder, autoregressive for both images and text
  - PaliGemma: Combines pre-trained SigLIP vision encoder with Gemma LLM through simple linear projection
  - Florence-2: Vision encoder (DaViT) with transformer encoder-decoder for multimodal processing
  Training Objectives:
  - Chameleon: Long unified training with text/image tokens together from the start
  - AIMV2: Trained autoregressively to generate both image patches and text tokens, with unified loss function
  - PaliGemma: Uses sequence-to-sequence training with text prompts as input and generated text as output
  - Florence-2: Trained with multiple objectives spanning different granularities (image-level, region-level, and fine-grained visual-semantic alignment)
  Training Data:
  - Chameleon: ~9.2T tokens total across modalities
  - AIMV2: Trained on 12B examples with mix of image-text pairs using simple data filtering
  - PaliGemma: Built on pre-trained SigLIP and Gemma models, fine-tuned with additional multimodal data
  - Florence-2: Trained on custom FLD-5B dataset with 5.4B comprehensive annotations across 126M images
  Key Differentiating Aspects:
  - AIMV2 focuses on simplicity and unified autoregressive training
  - PaliGemma leverages strong pre-trained components with minimal additional architecture
  - Florence-2 emphasizes comprehensive multi-task training with carefully curated diverse annotations
  Model Size:
  - AIMV2: Ranges from 300M to 3B parameters
  - PaliGemma: ~2.4B parameters total (400M vision + 2B language)
  - Florence-2: Two variants - 232M and 771M parameters
  The main philosophical differences are:
  - AIMV2 prioritizes architectural simplicity and unified training
  - PaliGemma focuses on effective composition of pre-trained models
  - Florence-2 emphasizes comprehensive multi-task capabilities through diverse training data
  Architecture/Loss Details: AIMV2:
  - Uses prefix attention in vision encoder allowing bidirectional attention for "input" tokens
  - Single unified autoregressive loss for both image patches and text
  - Generates both image patches and text tokens autoregressively with same decoder
  - Simple linear projection connects vision encoder to decoder
  PaliGemma:
  - Linear projection layer connects SigLIP vision tokens to Gemma LLM
  - Uses sequence-to-sequence training with standard cross-entropy loss
  - Takes image + text prompt as input, generates text output
  - Special location tokens for spatial tasks (1000 quantized coordinate bins)
  Architectural Simplicity of AIMV2:
  - Single unified autoregressive objective vs multiple specialized objectives
  - No need for complex contrastive learning or specialized heads
  - Same decoder handles both modalities
  - Simpler training process without need for large batches or special communication between GPUs
  Tasks Each Can Handle: AIMV2:
  - Image/text generation
  - Captioning
  - Visual question answering
  - Object detection
  - Visual grounding
  PaliGemma:
  - Captioning
  - Visual question answering
  - Object detection
  - Referring expression segmentation
  - Visual grounding
  - Video understanding
  Florence-2:
  - Image-level understanding (classification, captioning)
  - Region-level tasks (detection, segmentation)
  - Fine-grained visual-semantic alignment
  - Visual grounding
  Image Generation:
  - Yes, only AIMV2 can generate images since it's trained to predict image patches
  - PaliGemma and Florence-2 are focused on understanding/analysis rather than generation
  Training Approach:
  - AIMV2 trains vision encoder and decoder jointly from scratch with unified objective
  - PaliGemma leverages pre-trained SigLIP and Gemma models
  - Florence-2 uses comprehensive multi-task pre-training on custom dataset
  Handling Segmentation: AIMV2:
  - Can generate segmentation masks as part of autoregressive sequence
  PaliGemma:
  - Uses special VQVAE tokenized mask tokens (<seg000> to <seg127>)
  - Predicts mask tokens as part of text sequence
  Florence-2:
  - Handles segmentation through polygon representation with coordinate sequences
  - Outputs coordinates as text sequence using special location tokens
  Key Difference Summary:
  - AIMV2 is truly generative and trained from scratch with unified objective
  - PaliGemma leverages strong pre-trained components effectively
  - Florence-2 focuses on comprehensive understanding through diverse supervision
  Key Differentiators: Chameleon:
  - True early fusion with unified token space
  - Most flexible for mixed modal generation
  - Simpler architecture but harder to train
  AIMV2:
  - Truly generative for both modalities
  - Single unified autoregressive objective
  - More efficient training process
  PaliGemma:
  - Leverages strong pre-trained components
  - Simple but effective connection strategy
  - Most efficient to train
  Florence-2:
  - Focus on comprehensive annotations
  - Broad task coverage through multi-task learning
  - Strong zero-shot capabilities
PaliGemma, 2024
Florence 2, Microsoft 2023
SAM2
Segment Anything Model (SAM)
ViVIT
Meta Chameleon, Meta 2024
CogVLM, 2023
Detection transformers (DETR): End-to-End Object Detection with Transformers, Meta 2020
Llava-Interactive: visual prompting
Llava-Plus:
Obsidian: 3B, MLLM vision, based on StableLM
TinyGPT-V: 2.7B, MLLM vision, based on Phi 2
BakLLaVA: By implementing better base models, modified training process, custom datasets, and significant architecture changes to the original LLaVA implementation.
LLaVA: 7B, MLLM vision
MiniGPT-4: 7B, MLLM vision

More