Vocab

7-DOF — 7 degrees of freedom. Standard single-arm action: [Δx, Δy, Δz, Δroll, Δpitch, Δyaw, gripper]. 45/54 OXE datasets are nominally 7-DOF.
Δ-EEF — delta end-effector. "Change gripper pose by this much" vs abs = "move to this absolute pose." Same 7 floats, completely different control mode.
π₀, Octo, RT-2, OpenVLA — other groups' VLA models. Octo matters because they pre-standardized OXE's action format.
LeRobot v3.0 — the file format. Parquet for actions/states, mp4 for video, meta/stats.json for normalization.
Cross-embodiment / x-embodiment. Training one policy across many different robot bodies. OXE spans Franka arms, UR5s, Sawyers, wheeled bases, bimanual rigs — different joint counts, cameras, control rates. The promise: one model generalizes across hardware. The catch: their action spaces don't mean the same thing (this is the entire second half of the thread).
VLA = Vision-Language-Action model. A transformer that takes an image + text instruction ("pick up the red block") and outputs a continuous action vector instead of text tokens. The action gets sent directly to a robot controller. Think of it as an LLM whose "next token" is "move the arm 2mm in +x, close the gripper."
sim2real: general approach of transferring simulation-trained models to reality

Open models

They're all transformers. Two camps: (1) fine-tune a big pretrained VLM and output actions as discrete text tokens, or (2) smaller transformer from scratch with a diffusion/flow action head. Size range is 27M → 55B.

Key patterns: the VLM-init approach wins at scale (RT-2 found 5B-from-scratch "very poor"). Action output is the big fork — discrete tokens are simple (just next-token) but coarse; diffusion/flow handles multi-modal continuous distributions better. Training compute is modest by LLM standards — OpenVLA's 21.5k A100-hours is a small finetune, not a pretrain.

Diffusion for robotics

Diffusion policy (Octo, Diffusion Policy paper, Chi et al '23)

Same math as image gen — literally. Just the thing being denoised is a tiny vector instead of a million pixels.

In image gen: start with N(0,I) noise the shape of an image, learn ε_θ(x_noisy, t, cond) to predict the noise, iteratively subtract it → clean image.

In robotics: start with N(0,I) noise the shape of an action chunk — say [H, 7] for H future timesteps of 7-DOF actions. Same network architecture (usually a small transformer or UNet-1d), conditioned on the observation (image + instruction embedding) instead of a text prompt. Iteratively denoise → a clean action sequence. Send the first few actions to the robot, re-plan.

Why bother? Action distributions are multi-modal. A brick can be grasped from the left or right — two equally valid answers. MSE regression averages them to "grasp from the middle" which fails. A Gaussian policy (what alwin's path does) is unimodal too. Diffusion happily represents "50% left, 50% right" because it's modeling the full p(action | obs) distribution. Same reason it works for images: "a cat" has many valid pixels.

Flow matching (π₀)

Instead of learning to predict noise at each step, learn a velocity field v_θ(x, t, cond) that transports a sample from noise to data along a straight-ish path. Sample by ODE-integrating dx/dt = v_θ(x, t) from t=0 (noise) to t=1 (action). Same conditioning setup, fewer sampling steps (π₀ uses ~10 vs DDPM's 50-100) → higher control frequency, which matters when the robot needs actions at 50-100Hz.