Byte Latent Transformers, Meta 2024

My summary: key idea is to define “tokens” “dynamically”. Define patches as groups of bytes that are runs of low-entropy and end in high-entropy byte, as determined by a small byte level LM. Once you have these patches, still need to figure out their embeddings/encoding. This is done by a “Local Encoder” that is just N cross attention layers over the byte level embeddings (also produced by that small byte level LM).

Let's analyze how BLT processes the example shown in the paper for the sequence "Daenerys Targaryen is in Game of Thrones, a fantasy epic by George R.R. Martin."

Here's how it works step by step:

  1. Initial Byte Processing
  2. Entropy-Based Patching
  3. Dynamic Patch Creation
  4. Patch Representation
  5. Patch Processing
  6. Byte-Level Decoding