Vocab

Open models

They're all transformers. Two camps: (1) fine-tune a big pretrained VLM and output actions as discrete text tokens, or (2) smaller transformer from scratch with a diffusion/flow action head. Size range is 27M → 55B.

image.png

Key patterns: the VLM-init approach wins at scale (RT-2 found 5B-from-scratch "very poor"). Action output is the big fork — discrete tokens are simple (just next-token) but coarse; diffusion/flow handles multi-modal continuous distributions better. Training compute is modest by LLM standards — OpenVLA's 21.5k A100-hours is a small finetune, not a pretrain.

Diffusion for robotics

Diffusion policy (Octo, Diffusion Policy paper, Chi et al '23)

Same math as image gen — literally. Just the thing being denoised is a tiny vector instead of a million pixels.

In image gen: start with N(0,I) noise the shape of an image, learn ε_θ(x_noisy, t, cond) to predict the noise, iteratively subtract it → clean image.

In robotics: start with N(0,I) noise the shape of an action chunk — say [H, 7] for H future timesteps of 7-DOF actions. Same network architecture (usually a small transformer or UNet-1d), conditioned on the observation (image + instruction embedding) instead of a text prompt. Iteratively denoise → a clean action sequence. Send the first few actions to the robot, re-plan.

Why bother? Action distributions are multi-modal. A brick can be grasped from the left or right — two equally valid answers. MSE regression averages them to "grasp from the middle" which fails. A Gaussian policy (what alwin's path does) is unimodal too. Diffusion happily represents "50% left, 50% right" because it's modeling the full p(action | obs) distribution. Same reason it works for images: "a cat" has many valid pixels.

Flow matching (π₀)

Instead of learning to predict noise at each step, learn a velocity field v_θ(x, t, cond) that transports a sample from noise to data along a straight-ish path. Sample by ODE-integrating dx/dt = v_θ(x, t) from t=0 (noise) to t=1 (action). Same conditioning setup, fewer sampling steps (π₀ uses ~10 vs DDPM's 50-100) → higher control frequency, which matters when the robot needs actions at 50-100Hz.