-
Comparing: AIMv2 (Apple 2024), PaliGemma (Google 2024), and Florence 2 (Microsoft 2024), Chameleon (Meta 2024)
-
Summary:
Architecture:
- AIMV2: Uses a vision encoder (trained with prefix attention masking) and multimodal decoder in a autoregressive setup. Much simpler than typical vision-language models.
- PaliGemma: Combines a SigLIP vision encoder (400M params) with a Gemma-2B language model decoder, connected by a simple linear projection layer.
- Florence-2: Employs a DaViT vision encoder and standard transformer encoder-decoder architecture for multimodal processing.
Architectural Approach:
- Chameleon: Pure token-based early fusion. Everything becomes tokens that flow through one unified architecture
- AIMV2: Two-stage with vision encoder and multimodal decoder, autoregressive for both images and text
- PaliGemma: Combines pre-trained SigLIP vision encoder with Gemma LLM through simple linear projection
- Florence-2: Vision encoder (DaViT) with transformer encoder-decoder for multimodal processing
Training Objectives:
- Chameleon: Long unified training with text/image tokens together from the start
- AIMV2: Trained autoregressively to generate both image patches and text tokens, with unified loss function
- PaliGemma: Uses sequence-to-sequence training with text prompts as input and generated text as output
- Florence-2: Trained with multiple objectives spanning different granularities (image-level, region-level, and fine-grained visual-semantic alignment)
Training Data:
- Chameleon: ~9.2T tokens total across modalities
- AIMV2: Trained on 12B examples with mix of image-text pairs using simple data filtering
- PaliGemma: Built on pre-trained SigLIP and Gemma models, fine-tuned with additional multimodal data
- Florence-2: Trained on custom FLD-5B dataset with 5.4B comprehensive annotations across 126M images
Key Differentiating Aspects:
- AIMV2 focuses on simplicity and unified autoregressive training
- PaliGemma leverages strong pre-trained components with minimal additional architecture
- Florence-2 emphasizes comprehensive multi-task training with carefully curated diverse annotations
Model Size:
- AIMV2: Ranges from 300M to 3B parameters
- PaliGemma: ~2.4B parameters total (400M vision + 2B language)
- Florence-2: Two variants - 232M and 771M parameters
The main philosophical differences are:
- AIMV2 prioritizes architectural simplicity and unified training
- PaliGemma focuses on effective composition of pre-trained models
- Florence-2 emphasizes comprehensive multi-task capabilities through diverse training data
Architecture/Loss Details:
AIMV2:
- Uses prefix attention in vision encoder allowing bidirectional attention for "input" tokens
- Single unified autoregressive loss for both image patches and text
- Generates both image patches and text tokens autoregressively with same decoder
- Simple linear projection connects vision encoder to decoder
PaliGemma:
- Linear projection layer connects SigLIP vision tokens to Gemma LLM
- Uses sequence-to-sequence training with standard cross-entropy loss
- Takes image + text prompt as input, generates text output
- Special location tokens for spatial tasks (1000 quantized coordinate bins)
Architectural Simplicity of AIMV2:
- Single unified autoregressive objective vs multiple specialized objectives
- No need for complex contrastive learning or specialized heads
- Same decoder handles both modalities
- Simpler training process without need for large batches or special communication between GPUs
Tasks Each Can Handle:
AIMV2:
- Image/text generation
- Captioning
- Visual question answering
- Object detection
- Visual grounding
PaliGemma:
- Captioning
- Visual question answering
- Object detection
- Referring expression segmentation
- Visual grounding
- Video understanding
Florence-2:
- Image-level understanding (classification, captioning)
- Region-level tasks (detection, segmentation)
- Fine-grained visual-semantic alignment
- Visual grounding
Image Generation:
- Yes, only AIMV2 can generate images since it's trained to predict image patches
- PaliGemma and Florence-2 are focused on understanding/analysis rather than generation
Training Approach:
- AIMV2 trains vision encoder and decoder jointly from scratch with unified objective
- PaliGemma leverages pre-trained SigLIP and Gemma models
- Florence-2 uses comprehensive multi-task pre-training on custom dataset
Handling Segmentation:
AIMV2:
- Can generate segmentation masks as part of autoregressive sequence
PaliGemma:
- Uses special VQVAE tokenized mask tokens (<seg000> to <seg127>)
- Predicts mask tokens as part of text sequence
Florence-2:
- Handles segmentation through polygon representation with coordinate sequences
- Outputs coordinates as text sequence using special location tokens
Key Difference Summary:
- AIMV2 is truly generative and trained from scratch with unified objective
- PaliGemma leverages strong pre-trained components effectively
- Florence-2 focuses on comprehensive understanding through diverse supervision
Key Differentiators:
Chameleon:
- True early fusion with unified token space
- Most flexible for mixed modal generation
- Simpler architecture but harder to train
AIMV2:
- Truly generative for both modalities
- Single unified autoregressive objective
- More efficient training process
PaliGemma:
- Leverages strong pre-trained components
- Simple but effective connection strategy
- Most efficient to train
Florence-2:
- Focus on comprehensive annotations
- Broad task coverage through multi-task learning
- Strong zero-shot capabilities
-
PaliGemma, 2024

-
SigLIP vision model and the Gemma language model
-
Decoder only transformer model, with full attention within image and prefix

-
Florence 2, Microsoft 2023
- Unified vision tasks model: captioning, object detection, grounding or segmentation
- Paper
-
SAM2
-
Segment Anything Model (SAM)

-
ViVIT
-
Meta Chameleon, Meta 2024
-
CogVLM, 2023
-
Detection transformers (DETR): End-to-End Object Detection with Transformers, Meta 2020
-
Llava-Interactive: visual prompting
-
Llava-Plus:
-
Obsidian: 3B, MLLM vision, based on StableLM
-
TinyGPT-V: 2.7B, MLLM vision, based on Phi 2
-
BakLLaVA: By implementing better base models, modified training process, custom datasets, and significant architecture changes to the original LLaVA implementation.
-
LLaVA: 7B, MLLM vision
-
MiniGPT-4: 7B, MLLM vision
-
SigLIP, Google, 2023
-
BLIP-2 / Q-Former TODO
-
BLIP TODO
-
CLIP (Contrastive Language-Image Pre-Training) is a binary classifier trained on (image, text) pairs and negative pairs.