Surveys
Possibly of interest
Byte Latent Transformers, Meta 2024 (blog)
Let me walk through a concrete example to illustrate how BLT (Byte Latent Transformer) works, focusing on its key innovation of dynamic patching based on entropy.
Let's analyze how BLT processes the example shown in the paper for the sequence "Daenerys Targaryen is in Game of Thrones, a fantasy epic by George R.R. Martin."
Here's how it works step by step:
The key insight is that BLT dynamically adjusts its processing granularity based on the complexity of the text, unlike fixed tokenizers that always process predefined units regardless of their predictability. This allows it to be both efficient (using larger patches for predictable sequences) and precise (breaking into smaller units when needed for complex predictions).
This approach helps solve issues with traditional tokenizers while maintaining computational efficiency - it can handle unusual inputs robustly since it works at the byte level, but still processes predictable sequences efficiently by grouping them into larger patches.
Mamba, 2023
State space models
Memorizing transformers, 2022
With fixed context size of standard transformers, must chunk up attention when dealing with long docs:
Instead, allow attending over full history (past chunks), but with some sparser sort of attention, using knn search that’s O(log n) for total of O(n log n)
On first chunk, start stashing memory, but runs identically to standard attention
On next chunk, keep stashing, but now can attend over first chunk
Perform top-k attention over memory
Simply use a sigmoid gate (per attn head) to switch between context and memory. [Not sure what this gate is a function of—imagine it would be at least the token]
Overall architecture
Their initial TPU impl (bounded by 16GB/s mem bandwidth) handles 500k tokens. Next version will be “much larger”, can use external stores/web scale data/etc.
Improves perf—here, context window is 500 tokens
You can just add memory as fine tuning step! (Or upgrade to larger mem)
State space models
State space models
Transformers as search index
$\infty$-former, 2022
Idea: compress all the logits at the same dimension but across all tokens with curves (they use a bunch of RBFs, but imagine it’s a degree-k polynomial)—i.e. uses a fixed number of params (k polynomial degrees) to capture, say, the 3rd logit across all tokens (thus any number of tokens)
Can also re-encode [historical encoding ++ new data]—this starts feeling much more like an RNN/LSTM
Also need updated attention that attends over this. Augment your (”current”) discrete tokens with tokens from your RBFs history. (Note your attention is now non-square, and has different input vs output sizes.) This gives you keys. For values, integrate over Gaussian regions.
“Sticky memory” is simply adjusting the resampling from the historical re-encoding + concatenation. Instead of uniform, oversample the areas that have had a lot of attention (a lot of the Gaussians).
Longformer, Allen AI 2021
Introduces sliding window attention, sparse global attention
Global “pins” special tokens like [CLS] and [SEP]—anyone else can read this and it can read from anyone else. CLS for instance is the special reserved token for the class label output from a classification transformer.
Sparse transformer, 2019