For comparison, for a 3B-parameter model, like “t5-3b”:
Small/fast language models
llama2.c: GPT2 on tinystories
For the sake of examples of smaller, from-scratch models, I trained a small model series on TinyStories. All of these trained in a few hours on my training setup (4X A100 40GB GPUs). The 110M took around 24 hours.
model dim n_layers n_heads n_kv_heads max context length parameters val loss download 260K 64 5 8 4 512 260K 1.297 stories260K OG 288 6 6 6 256 15M 1.072 stories15M.bin 42M 512 8 8 8 1024 42M 0.847 stories42M.bin 110M 768 12 12 12 1024 110M 0.760 stories110M.bin
‣
train.py reproduces GPT-2 (124M) on OpenWebText, running on a single 8XA100 40GB node in about 4 days
TinyLlama
Thanks to those optimizations, we achieve a throughput of 24k tokens per second per A100-40G GPU, which translates to 56% model flops utilization without activation checkpointing (We expect the MFU to be even higher on A100-80G). It means you can train a chinchilla-optimal TinyLlama (1.1B param, 22B tokens) in 32 hours with 8 A100. Those optimizations also greatly reduce the memory footprint, allowing us to stuff our 1.1B model into 40GB GPU RAM and train with a per-gpu batch size of 16k tokens. You can also pretrain TinyLlama on 3090/4090 GPUs with a smaller per-gpu batch size. Below is a comparison of the training speed of our codebase with that of Pythia and MPT. (source)
https://news.ycombinator.com/item?id=37379984
' Why would pretraining a 1.1B model for so long make sense? Doesn't it contradict the Chinchilla Scaling Law?
Above is the training loss curve taken from the Llama 2 paper. Here I quote from that paper: "We observe that after pretraining on 2T Tokens, the models still did not show any sign of saturation". That is why we believe pretraining a 1.1B model for 3T tokens is a reasonable thing to do. Even if the loss curve does not go down eventually, we can still study the phenomenon of saturation and learn something from it.'
1.3B model on 1B tokens takes 7.1 hr on 1 node (8xA100) ~= $62 (source: MosaicML)
Numbers
Small/fast vision models
moondream
Each training run currently takes 20 hours on 4x4090, but it was 4x that before I wrote a bunch of custom CUDA kernels to speed up training. And it took 172 experiments to get to this point, so I'm about $20K in the hole on this project. 😬 (source)
TinyGPT-V
We use a single 3090 GPU(24G).
Stage 1 About 8 hours of training, Stage 2 About 4 hours, Stage 3 About 20 mins, Stage 4 About 8 hours or more.
Currently Stage 4 is still in a testing state, as it is not actually performing well (although it is able to perform well on the assessment results), and it is recommended that you only study up to Stage 3.
(source)
Obsidian
Pretrain takes around 2.5 hours for Obsidian-3B-V0.5 on 4x A100 (80G), at 336px for the vision module. (source)
LLaVA
Please download the 558K subset of the LAION-CC-SBU dataset with BLIP captions we use in the paper here.
Pretrain takes around 5.5 hours for LLaVA-v1.5-13B on 8x A100 (80G), due to the increased resolution to 336px. It takes around 3.5 hours for LLaVA-v1.5-7B.
Training script with DeepSpeed ZeRO-2:
pretrain.sh
.
-mm_projector_type mlp2x_gelu
: the two-layer MLP vision-language connector.-vision_tower openai/clip-vit-large-patch14-336
: CLIP ViT-L/14 336px.Pretrain takes around 20 hours for LLaVA-7B on 8x V100 (32G)
(source)
MiniGPT-4
We train MiniGPT-4 with two stages. The first traditional pretraining stage is trained using roughly 5 million aligned image-text pairs in 10 hours using 4 A100s. After the first stage, Vicuna is able to understand the image. But the generation ability of Vicuna is heavilly impacted.
The second finetuning stage is trained on this dataset in a conversation template to significantly improve its generation reliability and overall usability. To our surprise, this stage is computationally efficient and takes only around 7 minutes with a single A100.
CLIP
Yes if you run https://github.com/mlfoundations/open_clip#sample-single-process-running-code for 6000 GPUs hours on 400m samples for 32 epochs you should get good results
training CLIP models, especially ViT ones from scratch is a data and compute hungry endevour for decent results. The README has a SLURM example script that'd reproduce B/32 training on LAION-400m, and enough details to change the batch size # GPUs for the larger models.
There is also an example commandline 4-GPU torchrun command that'd reproduce the 36.5% results on CC12m in the README. For smaller datasets in the 10-40M range you're likely only going to be able to train smaller ResNets well, the ViTs don't perform until using much larger datasets.
Models like CLIP only make sense if trained on a large amount of data with many GPUs, after having been trained that way they can perform okay at zero shot classification and retrieval.
(source)
When I run on a machine with 8xV100 using the settings from the README, it takes about 12 hours to finish training 30 epochs.
Also, 4xV100 (32 GB) was 3 days for cc12m
I have twice the training size (20m), and twice the number of GPUs (8 v100 GPUs), it takes 2 hours / epoch. So total is 60 hours, which is 2.5 days
for 8 v100 GPUs, (4 nodes), it takes 2 hours per epoch for training size of 20m using RN50.
(source)
Small/fast video models
Latte
- in my experience, two to three days (8 A100s) will yield acceptable results on FFS. Perhaps training to 150k can achieve acceptable results [on UFC] (source)
Language models
VLM
Video
CogVideo
We use 13*8 A100 to train the model. The two stages were trained for ~100k iterations in total, which took ~20 days. It takes around 25GB GPU memory to inference with batchsize=1 (on our A100). (source)