Datasets | Notion

Language
- FineWeb, HF 2024: high-quality filtered 15T tokens
- Common Crawl ~ a snapshot of the internet ~ 1e15 words
- English Wikipedia ~ 3e9 words
- Library of Congress ~ 1e7 * 15 ~ 1e12 words or more (but I think no one has this; available books maybe 10^11 words?)
- Tiny Stories: 2+ GB
- Dolma
- OpenWebText 9B: repro OAI’s WebText, part of ThePile
- EleutherAi/The Pile:
  
  Most notably, the Pile-CC is a modified version of the Common Crawl in which the data was filtered to remove parts that are not text, such as HTML formatting and links. Some potential sub-datasets were excluded for various reasons, such as the US Congressional Record, which was excluded due to its racist content.
- ‣
- RedPajama 1.2T: repro LLaMA
- RedPajama v2 30T
- RedPajama-Data-V2 30T deduped dataset
- StarCoder 250B
- https://huggingface.co/datasets/tiiuae/falcon-refinedweb 600B
- OASST1
- CommonCrawl 320TB
- Colossal Clean Crawled Corpus (C4) dataset 356 billion
- CommitPack: 4 terabytes of Git commits across 350 programming languages. We benchmark CommitPack against other natural and synthetic code instructions (xP3x, Self-Instruct, OASST)
- SlimPajama
  - Today we are releasing SlimPajama – the largest extensively deduplicated, multi-corpora, open-source dataset for training large language models. SlimPajama was created by cleaning and deduplicating the 1.2T token RedPajama dataset from Together. By filtering out low quality data and duplicates, we were able to remove 49.6% of bytes, slimming down the dataset from 1210B to 627B tokens. We believe SlimPajama offers the highest quality and most compute efficient data to train on for runs up to 627B tokens. When upsampled, we expect SlimPajama to perform equal to or better than RedPajama-1T when training at trillion token scale.
Speech
Code
- See above language section, includes code
- The stack, v2
Class labeled images
- MNIST: handwritten digits, 60,000 training images and 10,000 testing images, 28x28
- Fashion MNIST: MNIST-like dataset of apparel products
- ImageNet: More than 14 million[1][2] images have been hand-annotated by the project to indicate what objects are pictured and in at least one million of the images, bounding boxes are also provided.[3] ImageNet contains more than 20,000 categories,[2] with a typical category, such as "balloon" or "strawberry", consisting of several hundred images
- CIFAR10: The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.
- CIFAR100: This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs).
Captioned images
- LLaVA Visual Instruct Pretrain LCS-558K is a subset of LAION/CC/SBU dataset, filtered with a more balanced concept coverage distribution. Captions are also associated with BLIP synthetic caption for reference. It is constructed for the pretraining stage for feature alignment in visual instruction tuning. We aim to build large multimodal towards GPT-4 vision/language capability.
- COCO 2017: 118,287 images and 591,753 captions
- CC12M
Visual instruction tuning
- liuhaotian/LLaVA-Instruct-150K · Datasets at Hugging Face: LLaVA Visual Instruct 150K is a set of GPT-generated multimodal instruction-following data. It is constructed for visual instruction tuning and for building large multimodal towards GPT-4 vision/language capability.
Class labeled videos
- Youtube-8M 2019
- WebVid-2M 2021
- WebVid-10M
- VidProM million scale

Speech