
Instead of a single hidden vector h ∈ ℝᵈ, create n copies to form a "hyper hidden matrix":
H = [h₁, h₂, ..., hₙ]ᵀ ∈ ℝⁿˣᵈ
At initialization, all n copies are identical to the input embedding.
The hyper-connection is parameterized by a learnable matrix, per layer:

HC = | 0 β₁ β₂ ... βₙ | | α₁,₀ α₁,₁ α₁,₂ ... α₁,ₙ | | α₂,₀ α₂,₁ α₂,₂ ... α₂,ₙ | | ... | | αₙ,₀ αₙ,₁ αₙ,₂ ... αₙ,ₙ |
This decomposes into:
Initialized as:

"where eₙₓₙ denotes an n × n identity matrix, eᵢ ∈ ℝⁿˣ¹ represents the i-th column of eₙₓₙ"
For each layer T (attention or FFN):