Frontiers in synthetic data

Interconnects - A podcast by Nathan Lambert

Categories:

Synthetic data is known to be a super powerful tool for every level of the language modeling stack. It's documented as being used for expanding vanilla pretraining data and creating large swaths of fine-tuning data. Many, many more rumors surround its use, Anthropic's pretraining-scale constitutional AI, Mistral AI's first models being pretrained on OpenAI outputs, Q-star's hopes as OpenAI's remaining moat, and much more. The diversity of use cases for synthetic data makes planning around the role of synthetic data in solving specific goals.This is AI generated audio with Python and 11Labs.Source code: https://github.com/natolambert/interconnects-toolsOriginal post: https://www.interconnects.ai/p/frontiers-in-synthetic-data00:00 Frontiers in synthetic data01:14 1. Direct distillation is still king02:54 2. Are Gemini Flash and Claude Haiku distilled?04:03 3. Filtering prevents collapse06:30 4. Synthetic data strategy taxes07:32 5. Pros and cons of training on multi-output-source synthetic datasets08:54 6. Structured synthetic data09:42 7. Weak-to-strong generalization is maybe real10:27 8. Creating synthetic prompts is overlooked again This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Visit the podcast's native language site