Paper

image.png

Step 1: Expand the Hidden State

Instead of a single hidden vector h ∈ ℝᵈ, create n copies to form a "hyper hidden matrix":

H = [h₁, h₂, ..., hₙ]ᵀ ∈ ℝⁿˣᵈ

At initialization, all n copies are identical to the input embedding.

Step 2: Define the Connection Matrix

The hyper-connection is parameterized by a learnable matrix:

HC = | 0 β₁ β₂ ... βₙ | | α₁,₀ α₁,₁ α₁,₂ ... α₁,ₙ | | α₂,₀ α₂,₁ α₂,₂ ... α₂,ₙ | | ... | | αₙ,₀ αₙ,₁ αₙ,₂ ... αₙ,ₙ |

This decomposes into:

Step 3: The Forward Pass

For each layer T (attention or FFN):

  1. Width-connection — Compute the layer input by weighted sum:

    h₀ = Aₘᵀ · H (combine n hidden vectors into one input)

  2. Layer computation:

    output = T(h₀) (pass through attention or FFN)