Linear attention, 2020

Setup: generalize attention in transformers to similarity functions, where usu. sim = exp.

With similarity kernels functions, do the kernel trick—map the vectors to a different space so that attention is just inner product:

They convene on this mapping for their experiments.

Actually exp corresponds to an infinite-dimensional projection.

When viewed like this, can describe the transformer as an RNN.

Translates directly to sampling
Constant memory ($d \times n$ where $d$ is hidden dims and $n$ is feature dims of $\phi$—not dependent on sequence length)
Thousands of times faster than attention