This page is for more research and “noteworthy”/”landmark” LLMs. See also More LLMs
References
Foundations
Gemma
Olma
Google Switch Transformers C - 2048 experts (1.6T parameters for 3.1 TB)
Phi-2: 2.7B
Llama 2
MPT 64k context alibi https://news.ycombinator.com/item?id=35910175
RedPajama INCITE: Pythia + RedPajama dataset
GPT-4
OpenAI's GPT-4 details have apparently been leaked! Looks very detailed and I suspect it's the real deal - given all I know about how these systems work. Here is a summary (extractive+abstractive) I made based on the original thread (see bottom of the post) + some additional pointers from me. (Disclaimer there are probably errors here)
---ARCH---
GPT-4 is more than 10x the size of GPT-3 (175 B). We believe it has a total of ~1.8 trillion parameters across 120 layers. Mixture Of Experts (16 experts, each ~111B). Not a dense transformer like e.g. PaLM (or GPT-3). They use MQA instead of MHA (classic at this point).
Each forward pass (generation of 1 token) only utilizes ~280B parameters and ~560 TFLOPs. This contrasts with the ~1.8 trillion parameters and ~3,700 TFLOP that would be required per forward pass of a purely dense model.
---DISTRIBUTED---
- To parallelize across all their A100s GPUs They utilized 8-way tensor parallelism. Beyond that, they are using 15-way pipeline parallelism. Also apparently they used DeepSpeed ZeRo Stage 1 or block-level FSDP.
(You can check out my video on all of these strategies here: https://youtube.com/watch?v=hc0u4avAkuM… 3D parallelism is what you're looking for & ZeRo)
---VISION---
They have a separate vision encoder from the text encoder, with cross-attention. The architecture is similar to Google DeepMind's Flamingo (I used to work on this project :) ). This adds more parameters on top of the 1.8T of GPT-4. It is fine-tuned with another ~2 trillion tokens, after the text-only pre-training.
---DATA---
Trained on ~13T tokens (multiple epochs, not unique). Plus millions of rows of instruction fine-tuning data from ScaleAI & internally (I guess acquired through ChatGPT + their API before they changed the policy).
8k context length for the pre-training phase. The 32k seqlen version of GPT-4 is based on fine-tuning of the 8k after the pre-training. See e.g. MosaicML's blog on how to achieve this: https://github.com/mosaicml/llm-foundry/blob/main/TUTORIAL.md…)
---COST---
- OpenAI’s training FLOPS for GPT-4 is ~2.15e25, on ~25,000 A100s for 90 to 100 days at about 32% to 36% MFU. Part of this extremely low utilization is due to an absurd number of failures requiring checkpoints that needed to be restarted from.
If their cost in the cloud was about $1 per A100 hour, the training costs for this run alone would be about $63 million.
(Today, the pre-training could be done with ~8,192 H100 in ~55 days for $21.5 million at $2 per H100 hour)
---INFERENCE---
OpenAI might be using speculative decoding on GPT-4's inference. See this paper: https://arxiv.org/abs/2211.17192
The inference runs on a cluster of 128 GPUs. There are multiple of these clusters in multiple datacenters in different locations (it'll be hard for Elizier to nuke these xD). 8-way tensor parallelism and 16-way pipeline parallelism.
Original thread: https://archive.is/2RQ8X (strictly speaking the original one has been removed).
GPT-2
Alpaca from Stanford applies self instruct to llama
ChatGPT / InstructGPT
Pythia
DistilBERT
BERT
T5 or Flan-T5 (instruction tuned)
UL2 or Flan-UL2