Hyper-Connections, 2024

Instead of a single hidden vector h ∈ ℝᵈ, create n copies to form a "hyper hidden matrix":

H = [h₁, h₂, ..., hₙ]ᵀ ∈ ℝⁿˣᵈ

At initialization, all n copies are identical to the input embedding.

The hyper-connection is parameterized by a learnable matrix, per layer:

HC = | 0 β₁ β₂ ... βₙ | | α₁,₀ α₁,₁ α₁,₂ ... α₁,ₙ | | α₂,₀ α₂,₁ α₂,₂ ... α₂,ₙ | | ... | | αₙ,₀ αₙ,₁ αₙ,₂ ... αₙ,ₙ |

This decomposes into:

Initialized as:

"where eₙₓₙ denotes an n × n identity matrix, eᵢ ∈ ℝⁿˣ¹ represents the i-th column of eₙₓₙ"

For each layer T (attention or FFN):