Paper

Claude

Required: see Hyper-Connections, 2024

Problem: training instability (loss spikes, exploding gradient norms)

mHC's key insight is to project H^res onto the set of doubly stochastic matrices—matrices where:

This is achieved using the Sinkhorn-Knopp algorithm: repeatedly normalize rows to sum to 1, then columns to sum to 1, until convergence (they use 20 iterations).

(This is a differentiable part of the forward pass.)

The result: gain magnitudes stay around 1.0-1.6 instead of 3000 (Figure 7).


My Q: did they try just normal layer norm etc.