Linear Transformers Are Secretly Fast Weight Programmers, 2021: paper
Parallelizing Linear Transformers with the Delta Rule over Sequence Length, 2024: paper

Background

Linear attention

Delta rule

Widrow-Hoff

Parallelizing

The paper describes a technique for parallelizing DeltaNet, which traditionally uses a sequential update rule that's difficult to train efficiently on modern hardware. The core innovation is a reparameterization that enables chunkwise parallel training.

Here's how the core technique works:

The Problem: DeltaNet uses the delta rule for updates: St = St-1 - βt(St-1kt - vt)kt^T, which requires materializing the hidden state St-1 to compute the update. This sequential dependency prevents parallelization.
Key Insight #1 - Additive Representation: The authors observe that St can be rewritten as a sum of outer products: St = Σt ui*ki^T, where ui are "pseudo" value vectors defined as ui = βi(vi - Si-1ki).
Key Insight #2 - Householder Transformation: The delta rule update can be viewed as applying a generalized Householder transformation to the previous state: St = St-1(I - βtktkt^T) + βtvtkt^T.
The WY Representation: They exploit a memory-efficient representation for products of Householder matrices, which allows them to compute the pseudo values ui without explicitly materializing the hidden states.
Chunkwise Parallel Form: They split the sequence into chunks and derive chunk-level recurrences. Within each chunk, they use:
- Pr[t] = Πri=1(I - βikiki^T): The decay factor
- Hr[t] = Σri=1 uiki^T: The chunk contributions
- Sr[t] = S0[t]Pr[t] + Hr[t]: The chunk-level update