Surveys
Possibly of interest
Mamba, 2023
State space models
Memorizing transformers, 2022
With fixed context size of standard transformers, must chunk up attention when dealing with long docs:
Instead, allow attending over full history (past chunks), but with some sparser sort of attention, using knn search that’s O(log n) for total of O(n log n)
On first chunk, start stashing memory, but runs identically to standard attention
On next chunk, keep stashing, but now can attend over first chunk
Perform top-k attention over memory
Simply use a sigmoid gate (per attn head) to switch between context and memory. [Not sure what this gate is a function of—imagine it would be at least the token]
Overall architecture
Their initial TPU impl (bounded by 16GB/s mem bandwidth) handles 500k tokens. Next version will be “much larger”, can use external stores/web scale data/etc.
Improves perf—here, context window is 500 tokens
You can just add memory as fine tuning step! (Or upgrade to larger mem)
State space models
State space models
Transformers as search index
$\infty$-former, 2022
Idea: compress all the logits at the same dimension but across all tokens with curves (they use a bunch of RBFs, but imagine it’s a degree-k polynomial)—i.e. uses a fixed number of params (k polynomial degrees) to capture, say, the 3rd logit across all tokens (thus any number of tokens)
Can also re-encode [historical encoding ++ new data]—this starts feeling much more like an RNN/LSTM
Also need updated attention that attends over this. Augment your (”current”) discrete tokens with tokens from your RBFs history. (Note your attention is now non-square, and has different input vs output sizes.) This gives you keys. For values, integrate over Gaussian regions.
“Sticky memory” is simply adjusting the resampling from the historical re-encoding + concatenation. Instead of uniform, oversample the areas that have had a lot of attention (a lot of the Gaussians).
Longformer, Allen AI 2021
Introduces sliding window attention, sparse global attention
Global “pins” special tokens like [CLS] and [SEP]—anyone else can read this and it can read from anyone else. CLS for instance is the special reserved token for the class label output from a classification transformer.
Sparse transformer, 2019