Lighter

For comparison, for a 3B-parameter model, like “t5-3b”:
- A standard AdamW optimizer will need 24GB of GPU memory because it uses 8 bytes for each parameter (8*3 => 24GB)
- Adafactor optimizer will need more than 12GB. It uses slightly more than 4 bytes for each parameter, so 4*3 and then some extra.
- 8bit BNB quantized optimizer will use only (2*3) 6GB if all optimizer states are quantized.
- Something like GPT-NeoX (20B) requires ~40GB for just weights, ~45GB for inference—no single consumer GPU for this, so need (say) 2x 3090s (source)

Small/fast language models

llm.c, 2024 (source, HN)

Reproducing GPT-2 (124M) in llm.c in 90 minutes for $20 #481

With llm.c, which is quite efficient at up to ~60% model flops utilization, reproducing this model on one 8X A100 80GB SXM node takes ~90 minutes. For example, on Lambda this node goes for ~$14/hr, so the total cost of reproducing this model today is about $20. You can train the model with a single GPU too, it would just take proportionally longer (e.g. ~4-24 hours depending on the GPU). In addition, llm.c still has a lot of pending optimizations and people haven't tried to tune the training in the style of cramming, so I'd say we're likely to see significant improvements on this number. So here is the run, training the 12-layer, 12-headed, 768-dimension, 124M Transformer on 10 billion tokens of FineWeb. … Keep in mind that here we trained for 10B tokens, while GPT-3 models were all trained for 300B tokens. [...] GPT-3 actually didn't change too much at all about the model (context size 1024 -> 2048, I think that's it?). The 350M model I trained last night was 30B tokens, 14 hours, ~$200. Conveniently, 300B is exactly 10X the tokens so ~$2K would be the estimate. You'd have to wait 140 hours on one box though [6 days]. Getting an H100 box instead of A100 will already cut the time latency down probably by a factor of 2-3X, for free, even without going to fp8 (which we do plan to support) [2 days].
1.5-Pints Technical Report: Pretraining in Days, Not Months -- Your Language Model Thrives on Quality Data, 2024 - trains Llama 2 arch on 57B tokens in 9d on 8 A100s [$4k at $2/A100/h]
HN, 2024

If you read around, training a 7B model costs on the order of $85,000
Simon W, 2023:

Could you train a ChatGPT-beating model for $85,000 and run it in a browser?

LLaMA-7B, was trained on 82,432 hours of A100-80GB GPUs, costing 36MWh

a simple rule of thumb for A100 cloud costs is $1/hour.

llama2.c: GPT2 on tinystories

For the sake of examples of smaller, from-scratch models, I trained a small model series on TinyStories. All of these trained in a few hours on my training setup (4X A100 40GB GPUs). The 110M took around 24 hours.

model dim n_layers n_heads n_kv_heads max context length parameters val loss download

260K 64 5 8 4 512 260K 1.297 stories260K

OG 288 6 6 6 256 15M 1.072 stories15M.bin

42M 512 8 8 8 1024 42M 0.847 stories42M.bin

110M 768 12 12 12 1024 110M 0.760 stories110M.bin

‣

train.py reproduces GPT-2 (124M) on OpenWebText, running on a single 8XA100 40GB node in about 4 days
TinyLlama

Thanks to those optimizations, we achieve a throughput of 24k tokens per second per A100-40G GPU, which translates to 56% model flops utilization without activation checkpointing (We expect the MFU to be even higher on A100-80G). It means you can train a chinchilla-optimal TinyLlama (1.1B param, 22B tokens) in 32 hours with 8 A100. Those optimizations also greatly reduce the memory footprint, allowing us to stuff our 1.1B model into 40GB GPU RAM and train with a per-gpu batch size of 16k tokens. You can also pretrain TinyLlama on 3090/4090 GPUs with a smaller per-gpu batch size. Below is a comparison of the training speed of our codebase with that of Pythia and MPT. (source)
- https://news.ycombinator.com/item?id=37379984
  
  ' Why would pretraining a 1.1B model for so long make sense? Doesn't it contradict the Chinchilla Scaling Law?
  
  Above is the training loss curve taken from the Llama 2 paper. Here I quote from that paper: "We observe that after pretraining on 2T Tokens, the models still did not show any sign of saturation". That is why we believe pretraining a 1.1B model for 3T tokens is a reasonable thing to do. Even if the loss curve does not go down eventually, we can still study the phenomenon of saturation and learn something from it.'
1.3B model on 1B tokens takes 7.1 hr on 1 node (8xA100) ~= $62 (source: MosaicML)
- Numbers

Small/fast vision models
- moondream
  
  Each training run currently takes 20 hours on 4x4090, but it was 4x that before I wrote a bunch of custom CUDA kernels to speed up training. And it took 172 experiments to get to this point, so I'm about $20K in the hole on this project. 😬 (source)
- TinyGPT-V
  
  We use a single 3090 GPU(24G).
  
  Stage 1 About 8 hours of training, Stage 2 About 4 hours, Stage 3 About 20 mins, Stage 4 About 8 hours or more.
  
  Currently Stage 4 is still in a testing state, as it is not actually performing well (although it is able to perform well on the assessment results), and it is recommended that you only study up to Stage 3.
  
  (source)
- Obsidian
  
  Pretrain takes around 2.5 hours for Obsidian-3B-V0.5 on 4x A100 (80G), at 336px for the vision module. (source)
- LLaVA
  Please download the 558K subset of the LAION-CC-SBU dataset with BLIP captions we use in the paper here.
  
  Pretrain takes around 5.5 hours for LLaVA-v1.5-13B on 8x A100 (80G), due to the increased resolution to 336px. It takes around 3.5 hours for LLaVA-v1.5-7B.
  
  Training script with DeepSpeed ZeRO-2: pretrain.sh.
  - -mm_projector_type mlp2x_gelu: the two-layer MLP vision-language connector.
  - -vision_tower openai/clip-vit-large-patch14-336: CLIP ViT-L/14 336px.
  Pretrain takes around 20 hours for LLaVA-7B on 8x V100 (32G)
  
  (source)
- MiniGPT-4
  
  We train MiniGPT-4 with two stages. The first traditional pretraining stage is trained using roughly 5 million aligned image-text pairs in 10 hours using 4 A100s. After the first stage, Vicuna is able to understand the image. But the generation ability of Vicuna is heavilly impacted.
  
  The second finetuning stage is trained on this dataset in a conversation template to significantly improve its generation reliability and overall usability. To our surprise, this stage is computationally efficient and takes only around 7 minutes with a single A100.
- CLIP
  
  Yes if you run https://github.com/mlfoundations/open_clip#sample-single-process-running-code for 6000 GPUs hours on 400m samples for 32 epochs you should get good results
  
  training CLIP models, especially ViT ones from scratch is a data and compute hungry endevour for decent results. The README has a SLURM example script that'd reproduce B/32 training on LAION-400m, and enough details to change the batch size # GPUs for the larger models.
  
  There is also an example commandline 4-GPU torchrun command that'd reproduce the 36.5% results on CC12m in the README. For smaller datasets in the 10-40M range you're likely only going to be able to train smaller ResNets well, the ViTs don't perform until using much larger datasets.
  
  Models like CLIP only make sense if trained on a large amount of data with many GPUs, after having been trained that way they can perform okay at zero shot classification and retrieval.
  
  (source)
  
  When I run on a machine with 8xV100 using the settings from the README, it takes about 12 hours to finish training 30 epochs.
  
  Also, 4xV100 (32 GB) was 3 days for cc12m
  
  I have twice the training size (20m), and twice the number of GPUs (8 v100 GPUs), it takes 2 hours / epoch. So total is 60 hours, which is 2.5 days
  
  for 8 v100 GPUs, (4 nodes), it takes 2 hours per epoch for training size of 20m using RN50.
  
  (source)
Small/fast video models
- Latte
  1. in my experience, two to three days (8 A100s) will yield acceptable results on FFS. Perhaps training to 150k can achieve acceptable results [on UFC] (source)
Small/fast audio models
- parler-tts (source)
  - calling to multiple GPUs using distributed data parallelism (DDP) is trivial: simply run accelerate config and select the multi-GPU option, specifying the IDs of the GPUs you wish to use. The above script can then be run using DDP with no code changes. In our case, we used 4 nodes of 8 H100 80GB to train Parler-TTS Mini for around 1.5 days.
  - with noticeable improvements seen after around 20k training steps and significant progress at 35k steps.

model	dim	n_layers	n_heads	n_kv_heads	max context length	parameters	val loss	download
260K	64	5	8	4	512	260K	1.297	stories260K
OG	288	6	6	6	256	15M	1.072	stories15M.bin
42M	512	8	8	8	1024	42M	0.847	stories42M.bin
110M	768	12	12	12	1024	110M	0.760	stories110M.bin

Heavier

Language models
- OLMo
  
  what was the total time to train 2.5T tokens for 7B model on the 216 GPUs? about ~50 days. (source)
- Phi-2 (2.7B) took 14 days on 96 A100 GPUs (source
- LLaMA training for the 7B model required 83k A100 80GB hours. (source)
VLM
Video

Lighter

Heavier

References