Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search, 2025

Break down COTs into steps, choosing to continue/reflect/explore alt branches at each step
Reward = outcome + correction / anti-correction bonuses + “outcome PM”
Randomly rewind past trajectories (both correct & incorrect)

The core technique presented in this paper is Chain-of-Action-Thought (COAT) reasoning, which enhances LLMs' ability to perform autoregressive search through self-reflection and self-exploration during problem-solving.

Core Technique: COAT Reasoning

COAT introduces special meta-action tokens that guide the model's reasoning process:

<|continue|> - Continue building on current reasoning
<|reflect|> - Pause to verify correctness of prior steps
<|explore|> - Identify flaws and explore alternative solutions

The key innovation is training the model through a two-stage process:

Stage 1: Format Tuning - Small-scale supervised learning (10K samples) to internalize COAT format
Stage 2: Self-improvement - Large-scale reinforcement learning with "Restart and Explore" (RAE) technique

Concrete Example from the Paper

Here's a mathematical reasoning example showing COAT in action:

Problem: Every morning Aya goes for a 9-kilometer walk at speed s km/h (takes 4 hours including coffee). At speed s+2 km/h, it takes 2.4 hours. Find time at speed s+0.5 km/h.

Satori's Solution with COAT: