Paper

image.png

Step 1: Expand the Hidden State

Instead of a single hidden vector h ∈ ℝᵈ, create n copies to form a "hyper hidden matrix":

H = [h₁, h₂, ..., hₙ]ᵀ ∈ ℝⁿˣᵈ

At initialization, all n copies are identical to the input embedding.

Step 2: Define the Connection Matrix

The hyper-connection is parameterized by a learnable matrix, per layer:

image.png

HC = | 0 β₁ β₂ ... βₙ | | α₁,₀ α₁,₁ α₁,₂ ... α₁,ₙ | | α₂,₀ α₂,₁ α₂,₂ ... α₂,ₙ | | ... | | αₙ,₀ αₙ,₁ αₙ,₂ ... αₙ,ₙ |

This decomposes into:

Initialized as:

image.png

"where eₙₓₙ denotes an n × n identity matrix, eᵢ ∈ ℝⁿˣ¹ represents the i-th column of eₙₓₙ"

Step 3: The Forward Pass

For each layer T (attention or FFN):