ViVIT
Meta Chameleon
Llava-Interactive: visual prompting
Llava-Plus:
Obsidian: 3B, MLLM vision, based on StableLM
TinyGPT-V: 2.7B, MLLM vision, based on Phi 2
It stands out by requiring merely a 24G GPU for training and an 8G GPU or CPU for inference.
We use a single 3090 GPU(24G).
Stage 1 About 8 hours of training, Stage 2 About 4 hours, Stage 3 About 20 mins, Stage 4 About 8 hours or more.
Currently Stage 4 is still in a testing state, as it is not actually performing well (although it is able to perform well on the assessment results), and it is recommended that you only study up to Stage 3.
BakLLaVA: By implementing better base models, modified training process, custom datasets, and significant architecture changes to the original LLaVA implementation.
LLaVA: 7B, MLLM vision
Vision encoder = CLIP/BLIP. LM = Llama. First train the projection W. Then unfreeze LM, train ‘end-to-end’ (but vision encode always frozen).
LLaVA: LLaVA utilizes the LLaMA model, renowned for its efficacy in open-source language-only instruction-tuning projects. For visual content processing, LLaVA relies on the pre-trained CLIP visual encoder ViT-L/14, which excels in the realm of visual comprehension. The encoder extracts visual features from input images and connects them to language embeddings through a trainable projection matrix. This projection effectively translates visual features into language embedding tokens, thereby bridging the gap between text and images.
LLaVA-1.5 achieves SoTA on 11 benchmarks, with just simple modifications to the original LLaVA, utilizes all public data, completes training in ~1 day on a single 8-A100 node, and surpasses methods that use billion-scale data.
MiniGPT-4: 7B, MLLM vision
BLIP-2 / Q-Former TODO
BLIP TODO
CLIP (Contrastive Language-Image Pre-Training) is a binary classifier trained on (image, text) pairs and negative pairs.
I_e @ I_t.T
, also scale by temperature $e^t$arange(N)
loss = (crossent(sim, labels, axis=0) + crossent(sim, labels, axis=1)) / 2
KOSMOS-1: multimodal (image and text) (HN)