请问HN:合成数据生成在学术界之外是否实用?

1作者: cpard8 个月前原帖
我不断看到合成数据管道推动最新的LLM“突破”: • TinyZero的30美元微调工作流程 • Sky-T1的450美元推理模型构建 • Meta AI的Llama 3群体(2024年论文详细介绍了他们的合成数据训练) • 伯克利开放思想(“推理模型的数据配方”),昨天发布 还有一些开源工具包可以供你实验: https://github.com/meta-llama/synthetic-data-kit https://github.com/bespokelabsai/curator 但这仍然感觉非常以研究为导向。我没有找到很多这些管道在实际产品中运行的例子。 我很好奇: 1. 目前谁在生产中使用合成数据管道? 2. 它实际上改善了哪些任务。例如,为特定任务微调较小的模型? 任何真实世界的案例、指引或进一步阅读的建议都将非常感激。谢谢!
查看原文
I keep seeing synthetic data pipelines powering the latest LLM “breakthroughs”: • TinyZero’s $30 fine-tuning workflow • Sky-T1’s $450 reasoning-model build • Meta AI’s Llama 3 herd (2024 paper detailing their synthetic-data training) • Berkeley OpenThoughts (“Data Recipes for Reasoning Models”), published yesterday<p>There are also open-source toolkits you can experiment with:<p>https:&#x2F;&#x2F;github.com&#x2F;meta-llama&#x2F;synthetic-data-kit https:&#x2F;&#x2F;github.com&#x2F;bespokelabsai&#x2F;curator<p>But it still feels very research-oriented. I haven’t found many examples of these pipelines running in real-world products.<p>I’m curious:<p>1. Who is using synthetic-data pipelines in production today?<p>2. What tasks does it actually improve. E.g. fine-tuning smaller models for specific tasks?<p>Any real-world stories, pointers, or further reading would be hugely appreciated. Thanks!