HackerNews中文版

我不断看到合成数据管道推动最新的LLM“突破”： • TinyZero的30美元微调工作流程 • Sky-T1的450美元推理模型构建 • Meta AI的Llama 3群体（2024年论文详细介绍了他们的合成数据训练） • 伯克利开放思想（“推理模型的数据配方”），昨天发布还有一些开源工具包可以供你实验： https://github.com/meta-llama/synthetic-data-kit https://github.com/bespokelabsai/curator 但这仍然感觉非常以研究为导向。我没有找到很多这些管道在实际产品中运行的例子。我很好奇： 1. 目前谁在生产中使用合成数据管道？ 2. 它实际上改善了哪些任务。例如，为特定任务微调较小的模型？任何真实世界的案例、指引或进一步阅读的建议都将非常感激。谢谢！

查看原文

I keep seeing synthetic data pipelines powering the latest LLM “breakthroughs”: • TinyZero’s $30 fine-tuning workflow • Sky-T1’s $450 reasoning-model build • Meta AI’s Llama 3 herd (2024 paper detailing their synthetic-data training) • Berkeley OpenThoughts (“Data Recipes for Reasoning Models”), published yesterdayThere are also open-source toolkits you can experiment with:https://github.com/meta-llama/synthetic-data-kit https://github.com/bespokelabsai/curatorBut it still feels very research-oriented. I haven’t found many examples of these pipelines running in real-world products.I’m curious:1. Who is using synthetic-data pipelines in production today?2. What tasks does it actually improve. E.g. fine-tuning smaller models for specific tasks?Any real-world stories, pointers, or further reading would be hugely appreciated. Thanks!

请问HN：合成数据生成在学术界之外是否实用？