Surveys
- https://lilianweng.github.io/posts/2020-04-07-the-transformer-family/
Possibly of interest
- DenseFormers: just add skip connections from internal representations, not just x0
- Universal transformers, 2019
- Memformer
- Transformer XL (RMT)
- RWKV
- H4
- Multiple independent sample heads
State space models
- Two views of SSMs: "linear RNN" or long convolutions

Byte Latent Transformers, Meta 2024

My summary: key idea is to define “tokens” “dynamically”. Define patches as groups of bytes that are runs of low-entropy and end in high-entropy byte, as determined by a small byte level LM. Once you have these patches, still need to figure out their embeddings/encoding. This is done by a “Local Encoder” that is just N cross attention layers over the byte level embeddings (also produced by that small byte level LM).

Let's analyze how BLT processes the example shown in the paper for the sequence "Daenerys Targaryen is in Game of Thrones, a fantasy epic by George R.R. Martin."

Here's how it works step by step:

Initial Byte Processing
- The sequence is first processed at the byte level, which looks at raw bytes instead of predefined tokens
Entropy-Based Patching
- A small byte-level language model estimates the entropy (uncertainty) of predicting each next byte
- The model identifies high-entropy points where prediction is more difficult, for example:
  - The "G" in "George" has high entropy because it's the start of a new name
  - The "e" later in the name also shows high entropy
- These high-entropy points become patch boundaries when they exceed a threshold
Dynamic Patch Creation
- Between these boundaries, the bytes are grouped into patches of varying sizes:
  - "G" becomes a single-byte patch because it's a high-uncertainty prediction
  - "eorge" becomes a larger patch since the following characters have lower entropy
  - This creates efficient, context-aware groupings rather than fixed-size tokens
Patch Representation
- It groups bytes into patches based on entropy
- It uses cross-attention to create patch representations from byte representations
- The cross-attention is part of the Local Encoder module
Patch Processing
- The Latent Transformer processes these patches as units
- More compute is allocated to the difficult predictions (like the start of names)
- Less compute is needed for more predictable sequences
Byte-Level Decoding
- The Local Decoder converts the patch representations back to bytes
- This maintains byte-level precision while being computationally efficient

The key insight is that BLT dynamically adjusts its processing granularity based on the complexity of the text, unlike fixed tokenizers that always process predefined units regardless of their predictability. This allows it to be both efficient (using larger patches for predictable sequences) and precise (breaking into smaller units when needed for complex predictions).

This approach helps solve issues with traditional tokenizers while maintaining computational efficiency - it can handle unusual inputs robustly since it works at the byte level, but still processes predictable sequences efficiently by grouping them into larger patches.