Hierarchical Reasoning Model, 2025 | Notion

At the start of training, the recurrently computed hidden states are essentially random noise.

The Bootstrap Problem and Its Solution

Initial Training Dynamics

At the very beginning:

The hidden states z_L and z_H are indeed gibberish
The networks receive: (garbage_state, other_garbage_state, actual_input)
But crucially, they still have access to the actual input x̃

What likely happens early in training:

Networks learn to mostly ignore the noisy hidden states
They focus on the actual input signal x̃
Essentially become feedforward networks at first
Gradually learn to incorporate hidden state information as it becomes meaningful

Why This Can Work