请问HN:合成数据生成在学术界之外是否实用?
我不断看到合成数据管道推动最新的LLM“突破”:
• TinyZero的30美元微调工作流程
• Sky-T1的450美元推理模型构建
• Meta AI的Llama 3群体(2024年论文详细介绍了他们的合成数据训练)
• 伯克利开放思想(“推理模型的数据配方”),昨天发布
还有一些开源工具包可以供你实验:
https://github.com/meta-llama/synthetic-data-kit
https://github.com/bespokelabsai/curator
但这仍然感觉非常以研究为导向。我没有找到很多这些管道在实际产品中运行的例子。
我很好奇:
1. 目前谁在生产中使用合成数据管道?
2. 它实际上改善了哪些任务。例如,为特定任务微调较小的模型?
任何真实世界的案例、指引或进一步阅读的建议都将非常感激。谢谢!
查看原文
I keep seeing synthetic data pipelines powering the latest LLM “breakthroughs”:
• TinyZero’s $30 fine-tuning workflow
• Sky-T1’s $450 reasoning-model build
• Meta AI’s Llama 3 herd (2024 paper detailing their synthetic-data training)
• Berkeley OpenThoughts (“Data Recipes for Reasoning Models”), published yesterday<p>There are also open-source toolkits you can experiment with:<p>https://github.com/meta-llama/synthetic-data-kit
https://github.com/bespokelabsai/curator<p>But it still feels very research-oriented. I haven’t found many examples of these pipelines running in real-world products.<p>I’m curious:<p>1. Who is using synthetic-data pipelines in production today?<p>2. What tasks does it actually improve. E.g. fine-tuning smaller models for specific tasks?<p>Any real-world stories, pointers, or further reading would be hugely appreciated. Thanks!