Explainer

1. Locality-Sensitive Hashing (LSH) Attention

The main innovation is replacing the O(L²) attention mechanism with an O(L log L) approximation:

How it works:

The LSH mechanism:

  1. Shared Q-K: Set queries and keys to be identical (Q = K), normalized
  2. Hash function: Use random projections to assign vectors to buckets
  3. Attention within buckets: Only compute attention between items in the same hash bucket

Implementation details:

2. Reversible Layers

Eliminates the need to store activations for backpropagation:

Standard Transformer: