Setup: generalize attention in transformers to similarity functions, where usu. sim = exp.
With similarity kernels functions, do the kernel trick—map the vectors to a different space so that attention is just inner product:
They convene on this mapping for their experiments.
Actually exp corresponds to an infinite-dimensional projection.
When viewed like this, can describe the transformer as an RNN.