mHC: Manifold-Constrained Hyper-Connections, Deepseek 2025

Required: see Hyper-Connections, 2024

Problem: training instability (loss spikes, exploding gradient norms)

mHC's key insight is to project H^res onto the set of doubly stochastic matrices—matrices where:

All entries are non-negative
Every row sums to 1
Every column sums to 1

This is achieved using the Sinkhorn-Knopp algorithm: repeatedly normalize rows to sum to 1, then columns to sum to 1, until convergence (they use 20 iterations).

(This is a differentiable part of the forward pass.)

The result: gain magnitudes stay around 1.0-1.6 instead of 3000 (Figure 7).

My Q: did they try just normal layer norm etc.