Common Crawl ~ a snapshot of the internet ~ 1e15 words
English Wikipedia ~ 3e9 words
Library of Congress ~ 1e7 * 15 ~ 1e12 words or more (but I think no one has this; available books maybe 10^11 words?)
Tiny Stories: 2+ GB
OpenWebText 9B: repro OAI’s WebText, part of ThePile
EleutherAi/The Pile:
Most notably, the Pile-CC is a modified version of the Common Crawl in which the data was filtered to remove parts that are not text, such as HTML formatting and links. Some potential sub-datasets were excluded for various reasons, such as the US Congressional Record, which was excluded due to its racist content.
‣
RedPajama 1.2T: repro LLaMA
RedPajama v2 30T
StarCoder 250B
https://huggingface.co/datasets/tiiuae/falcon-refinedweb 600B
OASST1
CommonCrawl 320TB
Colossal Clean Crawled Corpus (C4) dataset 356 billion
CommitPack: 4 terabytes of Git commits across 350 programming languages. We benchmark CommitPack against other natural and synthetic code instructions (xP3x, Self-Instruct, OASST)