• This page is for more research and “noteworthy”/”landmark” LLMs. See also More LLMs

  • References

    • Stella’s model list https://docs.google.com/spreadsheets/d/1gc6yse74XCwBx028HV_cvdxwXkmXejVjkO-Mz2uwE0k/edit#gid=0
  • Foundations

    • BLEU score (see math notes): measures precision of generated words against expected (geometric average precision), while penalizing shortness
  • GPT-4.5

    • 8T params
  • GPT 4o

    • 170 token per 512x512 image patch
    • 500B params
    • Speculation
      • https://www.oranlooney.com/post/gpt-cnn/
        • representing 512x512 images as 170 embedding vectors, using an CNN architecture that’s a mixture of CLIP and YOLO to embed the image directly into the transformer’s semantic vector space.
        • performance on the 5x5 Zener task all but confirms that they’re doing some kind of grid
        • And off the shelf OCR prepass
  • Gemma

  • Olma

  • Google Switch Transformers C - 2048 experts (1.6T parameters for 3.1 TB)

  • Phi-2: 2.7B

  • Llama 2

  • MPT 64k context alibi https://news.ycombinator.com/item?id=35910175

  • RedPajama INCITE: Pythia + RedPajama dataset

  • GPT-4

    • Model sizing has been repeated in several places (geohot, jensen huang)

    OpenAI's GPT-4 details have apparently been leaked! Looks very detailed and I suspect it's the real deal - given all I know about how these systems work. Here is a summary (extractive+abstractive) I made based on the original thread (see bottom of the post) + some additional pointers from me. (Disclaimer there are probably errors here)

    ---ARCH---

    • GPT-4 is more than 10x the size of GPT-3 (175 B). We believe it has a total of ~1.8 trillion parameters across 120 layers. Mixture Of Experts (16 experts, each ~111B). Not a dense transformer like e.g. PaLM (or GPT-3). They use MQA instead of MHA (classic at this point).

    • Each forward pass (generation of 1 token) only utilizes ~280B parameters and ~560 TFLOPs. This contrasts with the ~1.8 trillion parameters and ~3,700 TFLOP that would be required per forward pass of a purely dense model.

    ---DISTRIBUTED---

    • To parallelize across all their A100s GPUs They utilized 8-way tensor parallelism. Beyond that, they are using 15-way pipeline parallelism. Also apparently they used DeepSpeed ZeRo Stage 1 or block-level FSDP.

    (You can check out my video on all of these strategies here: https://youtube.com/watch?v=hc0u4avAkuM… 3D parallelism is what you're looking for & ZeRo)

    ---VISION---

    They have a separate vision encoder from the text encoder, with cross-attention. The architecture is similar to Google DeepMind's Flamingo (I used to work on this project :) ). This adds more parameters on top of the 1.8T of GPT-4. It is fine-tuned with another ~2 trillion tokens, after the text-only pre-training.

    ---DATA---

    • Trained on ~13T tokens (multiple epochs, not unique). Plus millions of rows of instruction fine-tuning data from ScaleAI & internally (I guess acquired through ChatGPT + their API before they changed the policy).

    • 8k context length for the pre-training phase. The 32k seqlen version of GPT-4 is based on fine-tuning of the 8k after the pre-training. See e.g. MosaicML's blog on how to achieve this: https://github.com/mosaicml/llm-foundry/blob/main/TUTORIAL.md…)

    ---COST---

    • OpenAI’s training FLOPS for GPT-4 is ~2.15e25, on ~25,000 A100s for 90 to 100 days at about 32% to 36% MFU. Part of this extremely low utilization is due to an absurd number of failures requiring checkpoints that needed to be restarted from.

    If their cost in the cloud was about $1 per A100 hour, the training costs for this run alone would be about $63 million.

    (Today, the pre-training could be done with ~8,192 H100 in ~55 days for $21.5 million at $2 per H100 hour)

    ---INFERENCE---

    OpenAI might be using speculative decoding on GPT-4's inference. See this paper: https://arxiv.org/abs/2211.17192

    The inference runs on a cluster of 128 GPUs. There are multiple of these clusters in multiple datacenters in different locations (it'll be hard for Elizier to nuke these xD). 8-way tensor parallelism and 16-way pipeline parallelism.

    Original thread: https://archive.is/2RQ8X (strictly speaking the original one has been removed).

    • GPT-4 did not scale up substantially in depth from GPT-3, going from 175 b to 220 b per transformer.
  • GPT-2

  • Alpaca from Stanford applies self instruct to llama

    • Self instruct can avoid needing human input but get close to instructgpt, tried on gpt3 https://arxiv.org/abs/2212.10560
  • ChatGPT / InstructGPT

    • Everything We Know About ChatGPT So Far - by swyx

    Untitled

  • Pythia

    • 70M–12B params
  • DistilBERT

    • 65M params
  • BERT

    • 110M params, up to 340M
    • 512 tokens
  • T5 or Flan-T5 (instruction tuned)

    • 16M–11B params
    • 2048 tokens
    • https://twitter.com/amanrsanger/status/1589273062637465602?t=FscugAt7aHB0vLD7XLAtgA&s=19

    20230326_113836.jpg