Setup: generalize attention in transformers to similarity functions, where usu. sim = exp.

image.png

With similarity kernels functions, do the kernel trick—map the vectors to a different space so that attention is just inner product:

image.png

They convene on this mapping for their experiments.

image.png

Actually exp corresponds to an infinite-dimensional projection.

When viewed like this, can describe the transformer as an RNN.

image.png

image.png

transformer-rnn-diagram.svg