我们对人工智能代理的评估是否存在误区?

1作者: imshashank大约 2 个月前原帖
我过去一年一直在构建人工智能代理,发现了一个令人担忧的问题:我与每个人交谈时,他们评估代理的方式都是一样的——只看最终输出,并问“这个结果正确吗?” 但这完全是错误的。 一个代理可能通过错误的路径得出正确答案。它在中间步骤中可能会出现幻觉,但仍然能够得出正确的结论。它可能在技术上达成目标的同时违反约束条件。 传统的机器学习指标(准确率、精确率、召回率)忽视了这一切,因为它们只关注最终输出。 我一直在尝试一种不同的方法:将代理的系统提示作为真实标准,评估整个过程(而不仅仅是最终输出),并使用多维评分(而不仅仅是单一指标)。 结果截然不同。突然间,我能够看到幻觉、约束违反、低效路径和一致性问题,而这些都是传统指标完全忽视的。 我是不是疯了?还是整个行业都在错误地评估代理? 我很想听听其他构建代理的人的看法。你们是如何评估它们的?遇到了什么问题?
查看原文
I&#x27;ve been building AI agents for the past year, and I&#x27;ve noticed something troubling: everyone I talk to is evaluating their agents the same way—by looking at the final output and asking &quot;Is it correct?&quot;<p>But that&#x27;s completely wrong.<p>An agent can get the right answer through the wrong path. It can hallucinate in intermediate steps but still reach the correct conclusion. It can violate constraints while technically achieving the goal.<p>Traditional ML metrics (accuracy, precision, recall) miss all of this because they only look at the final output.<p>I&#x27;ve been experimenting with a different approach: using the agent&#x27;s system prompt as ground truth, evaluating the entire trajectory (not just the final output), and using multi-dimensional scoring (not just a single metric).<p>The results are night and day. Suddenly I can see hallucinations, constraint violations, inefficient paths, and consistency issues that traditional metrics completely missed.<p>Am I crazy? Or is the entire industry evaluating agents wrong?<p>I&#x27;d love to hear from others who are building agents. How are you evaluating them? What problems have you run into?