Background

Linear attention

image.png

Delta rule

image.png

Widrow-Hoff

image.png

image.png

Parallelizing

The paper describes a technique for parallelizing DeltaNet, which traditionally uses a sequential update rule that's difficult to train efficiently on modern hardware. The core innovation is a reparameterization that enables chunkwise parallel training.

Here's how the core technique works:

  1. The Problem: DeltaNet uses the delta rule for updates: St = St-1 - βt(St-1kt - vt)kt^T, which requires materializing the hidden state St-1 to compute the update. This sequential dependency prevents parallelization.
  2. Key Insight #1 - Additive Representation: The authors observe that St can be rewritten as a sum of outer products: St = Σt ui*ki^T, where ui are "pseudo" value vectors defined as ui = βi(vi - Si-1ki).
  3. Key Insight #2 - Householder Transformation: The delta rule update can be viewed as applying a generalized Householder transformation to the previous state: St = St-1(I - βtktkt^T) + βtvtkt^T.
  4. The WY Representation: They exploit a memory-efficient representation for products of Householder matrices, which allows them to compute the pseudo values ui without explicitly materializing the hidden states.
  5. Chunkwise Parallel Form: They split the sequence into chunks and derive chunk-level recurrences. Within each chunk, they use: