
Instead of a single hidden vector h ∈ ℝᵈ, create n copies to form a "hyper hidden matrix":
H = [h₁, h₂, ..., hₙ]ᵀ ∈ ℝⁿˣᵈ
At initialization, all n copies are identical to the input embedding.
The hyper-connection is parameterized by a learnable matrix:
HC = | 0 β₁ β₂ ... βₙ | | α₁,₀ α₁,₁ α₁,₂ ... α₁,ₙ | | α₂,₀ α₂,₁ α₂,₂ ... α₂,ₙ | | ... | | αₙ,₀ αₙ,₁ αₙ,₂ ... αₙ,ₙ |
This decomposes into:
For each layer T (attention or FFN):
Width-connection — Compute the layer input by weighted sum:
h₀ = Aₘᵀ · H (combine n hidden vectors into one input)
Layer computation:
output = T(h₀) (pass through attention or FFN)