问HN:你们会对大语言模型的提示进行A/B测试吗?
我正在探索一个开发工具的想法,旨在帮助你进行提示的A/B测试,但我不确定是否真的有这个需求。你可以在一个网页界面中编写和版本化你的提示,然后进行A/B测试,并查看你定义的指标结果。
例如,对于一个撰写冷邮件的机器人,你可以验证系统提示的版本1(v1)或版本2(v2)哪个能带来更好的回复率。
目前有没有人正在做类似的事情,或者有谁想要这样的工具?
查看原文
I'm exploring a dev tool idea that helps you A/B test your prompts, but I'm not sure if there's a need for it. You'd be able to write and version your prompts in a web UI, then A/B test them and see results with metrics you define.<p>So for example, with a bot that writes cold outbound emails, you can verify whether v1 or v2 of your system prompt results in a better reply rate.<p>Does anybody currently do something like this or want something like this?