For comparison, for a 3B-parameter model, like “t5-3b”:
Small/fast language models
Reproducing GPT-2 (124M) in llm.c in 90 minutes for $20 #481
With llm.c, which is quite efficient at up to ~60% model flops utilization, reproducing this model on one 8X A100 80GB SXM node takes ~90 minutes. For example, on Lambda this node goes for ~$14/hr, so the total cost of reproducing this model today is about $20. You can train the model with a single GPU too, it would just take proportionally longer (e.g. ~4-24 hours depending on the GPU). In addition, llm.c still has a lot of pending optimizations and people haven't tried to tune the training in the style of cramming, so I'd say we're likely to see significant improvements on this number. So here is the run, training the 12-layer, 12-headed, 768-dimension, 124M Transformer on 10 billion tokens of FineWeb. … Keep in mind that here we trained for 10B tokens, while GPT-3 models were all trained for 300B tokens. [...] GPT-3 actually didn't change too much at all about the model (context size 1024 -> 2048, I think that's it?). The 350M model I trained last night was 30B tokens, 14 hours, ~$200. Conveniently, 300B is exactly 10X the tokens so ~$2K would be the estimate. You'd have to wait 140 hours on one box though [6 days]. Getting an H100 box instead of A100 will already cut the time latency down probably by a factor of 2-3X, for free, even without going to fp8 (which we do plan to support) [2 days].
1.5-Pints Technical Report: Pretraining in Days, Not Months -- Your Language Model Thrives on Quality Data, 2024 - trains Llama 2 arch on 57B tokens in 9d on 8 A100s [$4k at $2/A100/h]
HN, 2024
If you read around, training a 7B model costs on the order of $85,000
Could you train a ChatGPT-beating model for $85,000 and run it in a browser?
LLaMA-7B, was trained on 82,432 hours of A100-80GB GPUs, costing 36MWh
a simple rule of thumb for A100 cloud costs is $1/hour.
llama2.c: GPT2 on tinystories
For the sake of examples of smaller, from-scratch models, I trained a small model series on TinyStories. All of these trained in a few hours on my training setup (4X A100 40GB GPUs). The 110M took around 24 hours.
model dim n_layers n_heads n_kv_heads max context length parameters val loss download 260K 64 5 8 4 512 260K 1.297 stories260K OG 288 6 6 6 256 15M 1.072 stories15M.bin 42M 512 8 8 8 1024 42M 0.847 stories42M.bin 110M 768 12 12 12 1024 110M 0.760 stories110M.bin
‣
train.py reproduces GPT-2 (124M) on OpenWebText, running on a single 8XA100 40GB node in about 4 days
TinyLlama
Thanks to those optimizations, we achieve a throughput of 24k tokens per second per A100-40G GPU, which translates to 56% model flops utilization without activation checkpointing (We expect the MFU to be even higher on A100-80G). It means you can train a chinchilla-optimal TinyLlama (1.1B param, 22B tokens) in 32 hours with 8 A100. Those optimizations also greatly reduce the memory footprint, allowing us to stuff our 1.1B model into 40GB GPU RAM and train with a per-gpu batch size of 16k tokens. You can also pretrain TinyLlama on 3090/4090 GPUs with a smaller per-gpu batch size. Below is a comparison of the training speed of our codebase with that of Pythia and MPT. (source)
https://news.ycombinator.com/item?id=37379984
' Why would pretraining a 1.1B model for so long make sense? Doesn't it contradict the Chinchilla Scaling Law?
Above is the training loss curve taken from the Llama 2 paper. Here I quote from that paper: "We observe that after pretraining on 2T Tokens, the models still did not show any sign of saturation". That is why we believe pretraining a 1.1B model for 3T tokens is a reasonable thing to do. Even if the loss curve does not go down eventually, we can still study the phenomenon of saturation and learn something from it.'
1.3B model on 1B tokens takes 7.1 hr on 1 node (8xA100) ~= $62 (source: MosaicML)
Numbers
Small/fast vision models
moondream
Each training run currently takes 20 hours on 4x4090, but it was 4x that before I wrote a bunch of custom CUDA kernels to speed up training. And it took 172 experiments to get to this point, so I'm about $20K in the hole on this project. 😬 (source)
TinyGPT-V
We use a single 3090 GPU(24G).
Stage 1 About 8 hours of training, Stage 2 About 4 hours, Stage 3 About 20 mins, Stage 4 About 8 hours or more.
Currently Stage 4 is still in a testing state, as it is not actually performing well (although it is able to perform well on the assessment results), and it is recommended that you only study up to Stage 3.
(source)
Obsidian
Pretrain takes around 2.5 hours for Obsidian-3B-V0.5 on 4x A100 (80G), at 336px for the vision module. (source)
LLaVA
Please download the 558K subset of the LAION-CC-SBU dataset with BLIP captions we use in the paper here.
Pretrain takes around 5.5 hours for LLaVA-v1.5-13B on 8x A100 (80G), due to the increased resolution to 336px. It takes around 3.5 hours for LLaVA-v1.5-7B.
Training script with DeepSpeed ZeRO-2:
pretrain.sh
.
-mm_projector_type mlp2x_gelu
: the two-layer MLP vision-language connector.-vision_tower openai/clip-vit-large-patch14-336
: CLIP ViT-L/14 336px.Pretrain takes around 20 hours for LLaVA-7B on 8x V100 (32G)
(source)
MiniGPT-4
We train MiniGPT-4 with two stages. The first traditional pretraining stage is trained using roughly 5 million aligned image-text pairs in 10 hours using 4 A100s. After the first stage, Vicuna is able to understand the image. But the generation ability of Vicuna is heavilly impacted.
The second finetuning stage is trained on this dataset in a conversation template to significantly improve its generation reliability and overall usability. To our surprise, this stage is computationally efficient and takes only around 7 minutes with a single A100.
CLIP
Yes if you run https://github.com/mlfoundations/open_clip#sample-single-process-running-code for 6000 GPUs hours on 400m samples for 32 epochs you should get good results
training CLIP models, especially ViT ones from scratch is a data and compute hungry endevour for decent results. The README has a SLURM example script that'd reproduce B/32 training on LAION-400m, and enough details to change the batch size # GPUs for the larger models.
There is also an example commandline 4-GPU torchrun command that'd reproduce the 36.5% results on CC12m in the README. For smaller datasets in the 10-40M range you're likely only going to be able to train smaller ResNets well, the ViTs don't perform until using much larger datasets.
Models like CLIP only make sense if trained on a large amount of data with many GPUs, after having been trained that way they can perform okay at zero shot classification and retrieval.
(source)
When I run on a machine with 8xV100 using the settings from the README, it takes about 12 hours to finish training 30 epochs.
Also, 4xV100 (32 GB) was 3 days for cc12m
I have twice the training size (20m), and twice the number of GPUs (8 v100 GPUs), it takes 2 hours / epoch. So total is 60 hours, which is 2.5 days
for 8 v100 GPUs, (4 nodes), it takes 2 hours per epoch for training size of 20m using RN50.
(source)
Small/fast video models
Latte
- in my experience, two to three days (8 A100s) will yield acceptable results on FFS. Perhaps training to 150k can achieve acceptable results [on UFC] (source)
Small/fast audio models
Language models
VLM
Video
CogVideo
We use 13*8 A100 to train the model. The two stages were trained for ~100k iterations in total, which took ~20 days. It takes around 25GB GPU memory to inference with batchsize=1 (on our A100). (source)