Required: see Hyper-Connections, 2024
Problem: training instability (loss spikes, exploding gradient norms)
mHC's key insight is to project H^res onto the set of doubly stochastic matrices—matrices where:
This is achieved using the Sinkhorn-Knopp algorithm: repeatedly normalize rows to sum to 1, then columns to sum to 1, until convergence (they use 20 iterations).
(This is a differentiable part of the forward pass.)
The result: gain magnitudes stay around 1.0-1.6 instead of 3000 (Figure 7).
My Q: did they try just normal layer norm etc.