The core technique presented in this paper is Chain-of-Action-Thought (COAT) reasoning, which enhances LLMs' ability to perform autoregressive search through self-reflection and self-exploration during problem-solving.
COAT introduces special meta-action tokens that guide the model's reasoning process:
<|continue|>
- Continue building on current reasoning<|reflect|>
- Pause to verify correctness of prior steps<|explore|>
- Identify flaws and explore alternative solutionsThe key innovation is training the model through a two-stage process:
Here's a mathematical reasoning example showing COAT in action:
Problem: Every morning Aya goes for a 9-kilometer walk at speed s km/h (takes 4 hours including coffee). At speed s+2 km/h, it takes 2.4 hours. Find time at speed s+0.5 km/h.
Satori's Solution with COAT: