Byte Latent Transformers, Meta 2024
My summary: key idea is to define “tokens” “dynamically”. Define patches as groups of bytes that are runs of low-entropy and end in high-entropy byte, as determined by a small byte level LM. Once you have these patches, still need to figure out their embeddings/encoding. This is done by a “Local Encoder” that is just N cross attention layers over the byte level embeddings (also produced by that small byte level LM).
Let's analyze how BLT processes the example shown in the paper for the sequence "Daenerys Targaryen is in Game of Thrones, a fantasy epic by George R.R. Martin."
Here's how it works step by step:
- Initial Byte Processing
- The sequence is first processed at the byte level, which looks at raw bytes instead of predefined tokens
- Entropy-Based Patching
- A small byte-level language model estimates the entropy (uncertainty) of predicting each next byte
- The model identifies high-entropy points where prediction is more difficult, for example:
- The "G" in "George" has high entropy because it's the start of a new name
- The "e" later in the name also shows high entropy
- These high-entropy points become patch boundaries when they exceed a threshold
- Dynamic Patch Creation
- Between these boundaries, the bytes are grouped into patches of varying sizes:
- "G" becomes a single-byte patch because it's a high-uncertainty prediction
- "eorge" becomes a larger patch since the following characters have lower entropy
- This creates efficient, context-aware groupings rather than fixed-size tokens
- Patch Representation
- It groups bytes into patches based on entropy
- It uses cross-attention to create patch representations from byte representations
- The cross-attention is part of the Local Encoder module
- Patch Processing
- The Latent Transformer processes these patches as units
- More compute is allocated to the difficult predictions (like the start of names)
- Less compute is needed for more predictable sequences
- Byte-Level Decoding
- The Local Decoder converts the patch representations back to bytes
- This maintains byte-level precision while being computationally efficient