- Surveys
- Possibly of interest
- DenseFormers: just add skip connections from internal representations, not just x0
- Universal transformers, 2019
- Memformer
- Transformer XL (RMT)
- RWKV
- H4
- Multiple independent sample heads
- State space models
- Two views of SSMs: "linear RNN" or long convolutions
- Recurrent Memory Transformer, 2022
- mHC: Manifold-Constrained Hyper-Connections, Deepseek 2025
- BigMac, 2025
- Before MLP (MOE), global down-proj, do comms, then up-proj; instead of do comms, each expert doing own down/up. Can double the MLP size to maintain same isoflops.
- Native Sparse Attention, DeepSeek 2005
- DeepSeekMoE, 2024
- Hyper-Connections, 2024
- resid stream = learned combo of all prior layers
- Focused Transformer (FoT), 2023
- Make Memorizing Transformer work better by, instead of incorporating frozen KVs, incorporate KV chunks from irrelevant docs to force contrastive divergence training (backpropping through them)
Byte Latent Transformers, Meta 2024
My summary: key idea is to define “tokens” “dynamically”. Define patches as groups of bytes that are runs of low-entropy and end in high-entropy byte, as determined by a small byte level LM. Once you have these patches, still need to figure out their embeddings/encoding. This is done by a “Local Encoder” that is just N cross attention layers over the byte level embeddings (also produced by that small byte level LM).
Let's analyze how BLT processes the example shown in the paper for the sequence "Daenerys Targaryen is in Game of Thrones, a fantasy epic by George R.R. Martin."
Here's how it works step by step: