Large language models (LLMs)

This page is for more research and “noteworthy”/”landmark” LLMs. See also More LLMs
References
- Stella’s model list https://docs.google.com/spreadsheets/d/1gc6yse74XCwBx028HV_cvdxwXkmXejVjkO-Mz2uwE0k/edit#gid=0
Foundations
- BLEU score (see math notes): measures precision of generated words against expected (geometric average precision), while penalizing shortness
GPT-4.5
- 8T params
GPT 4o
- 170 token per 512x512 image patch
- 500B params
- Speculation
  - https://www.oranlooney.com/post/gpt-cnn/
    - representing 512x512 images as 170 embedding vectors, using an CNN architecture that’s a mixture of CLIP and YOLO to embed the image directly into the transformer’s semantic vector space.
    - performance on the 5x5 Zener task all but confirms that they’re doing some kind of grid
    - And off the shelf OCR prepass
Gemma
Olma
Google Switch Transformers C - 2048 experts (1.6T parameters for 3.1 TB)
Phi-2: 2.7B
Llama 2
MPT 64k context alibi https://news.ycombinator.com/item?id=35910175
RedPajama INCITE: Pythia + RedPajama dataset
GPT-4
- Model sizing has been repeated in several places (geohot, jensen huang)
OpenAI's GPT-4 details have apparently been leaked! Looks very detailed and I suspect it's the real deal - given all I know about how these systems work. Here is a summary (extractive+abstractive) I made based on the original thread (see bottom of the post) + some additional pointers from me. (Disclaimer there are probably errors here)

---ARCH---
- GPT-4 is more than 10x the size of GPT-3 (175 B). We believe it has a total of ~1.8 trillion parameters across 120 layers. Mixture Of Experts (16 experts, each ~111B). Not a dense transformer like e.g. PaLM (or GPT-3). They use MQA instead of MHA (classic at this point).
- Each forward pass (generation of 1 token) only utilizes ~280B parameters and ~560 TFLOPs. This contrasts with the ~1.8 trillion parameters and ~3,700 TFLOP that would be required per forward pass of a purely dense model.
---DISTRIBUTED---
- To parallelize across all their A100s GPUs They utilized 8-way tensor parallelism. Beyond that, they are using 15-way pipeline parallelism. Also apparently they used DeepSpeed ZeRo Stage 1 or block-level FSDP.
(You can check out my video on all of these strategies here: https://youtube.com/watch?v=hc0u4avAkuM… 3D parallelism is what you're looking for & ZeRo)

---VISION---

They have a separate vision encoder from the text encoder, with cross-attention. The architecture is similar to Google DeepMind's Flamingo (I used to work on this project :) ). This adds more parameters on top of the 1.8T of GPT-4. It is fine-tuned with another ~2 trillion tokens, after the text-only pre-training.

---DATA---
- Trained on ~13T tokens (multiple epochs, not unique). Plus millions of rows of instruction fine-tuning data from ScaleAI & internally (I guess acquired through ChatGPT + their API before they changed the policy).
- 8k context length for the pre-training phase. The 32k seqlen version of GPT-4 is based on fine-tuning of the 8k after the pre-training. See e.g. MosaicML's blog on how to achieve this: https://github.com/mosaicml/llm-foundry/blob/main/TUTORIAL.md…)
---COST---
- OpenAI’s training FLOPS for GPT-4 is ~2.15e25, on ~25,000 A100s for 90 to 100 days at about 32% to 36% MFU. Part of this extremely low utilization is due to an absurd number of failures requiring checkpoints that needed to be restarted from.
If their cost in the cloud was about $1 per A100 hour, the training costs for this run alone would be about $63 million.

(Today, the pre-training could be done with ~8,192 H100 in ~55 days for $21.5 million at $2 per H100 hour)

---INFERENCE---

OpenAI might be using speculative decoding on GPT-4's inference. See this paper: https://arxiv.org/abs/2211.17192

The inference runs on a cluster of 128 GPUs. There are multiple of these clusters in multiple datacenters in different locations (it'll be hard for Elizier to nuke these xD). 8-way tensor parallelism and 16-way pipeline parallelism.

Original thread: https://archive.is/2RQ8X (strictly speaking the original one has been removed).
- GPT-4 did not scale up substantially in depth from GPT-3, going from 175 b to 220 b per transformer.
GPT-2
Alpaca from Stanford applies self instruct to llama
- Self instruct can avoid needing human input but get close to instructgpt, tried on gpt3 https://arxiv.org/abs/2212.10560
ChatGPT / InstructGPT
- Everything We Know About ChatGPT So Far - by swyx
Pythia
- 70M–12B params
DistilBERT
- 65M params
BERT
- 110M params, up to 340M
- 512 tokens
T5 or Flan-T5 (instruction tuned)
- 16M–11B params
- 2048 tokens
- https://twitter.com/amanrsanger/status/1589273062637465602?t=FscugAt7aHB0vLD7XLAtgA&s=19