Analysis
- started with The Stack (a 3 TB collection of code) and text from StackOverflow
- used a LLM to select 6B "high-quality" tokens from (1)
- used GPT-3.5 to generate 1B tokens of text similar to textbooks
- trained a small (1.3B parameter) model ("phi-1") on (2) and (3)
- used GPT-3.5 to generate text similar to textbook exercises
- fine-tuned phi-1 on (5)
- tested phi-1 on HumanEval to evaluate its programming ability
The results were pretty good, better than models 10x the size trained on 100x the data. So, it seems that scaling up isn't the only thing that matters, and data quality can be more important than data quantity or parameter count. (You hear that, gwern?)
Technique
- Use GPT-4 to evaluate a small fraction of it.
- Use a much smaller code-specific model to generate embeddings.
- Use a classifier to predict which embeddings are from what GPT-4 evaluates as good content.