人工智能代理测试
在我从事人工智能代理的工作时,我发现自己不断思考如何有效地对它们进行测试。随着我们整合更多的知识来源并扩展代理的能力,测试变得越来越复杂。作为一种标准做法,我们使用评估来确保质量得以维持。但老实说,我觉得缺少了一些东西。
我所看到的问题是,作为工程师,我们有时缺乏足够的领域知识来准确评估代理的响应。同时,现有的工具限制了与领域专家合作进行测试的可能性。例如,当前的工具更注重仪表板,而不是实际结果的可读性。
这就是我迄今为止的经验——我很想听听你对此的看法。
查看原文
As I work on AI agents, I find myself constantly thinking about how to effectively test them.
As we integrate more knowledge sources and expand our agents' capabilities, testing becomes increasingly complex. As a standard practice, we use evals to ensure quality is maintained. But honestly, I feel like something is missing.
The issue I’m seeing is that we, as engineers, sometimes lack sufficient domain knowledge to assess an agent's response accurately. At the same time, current tooling limits the possibility of collaborating with domain experts to perform testing together. For example, current tooling gives priority to dashboards over the readability of actual outcomes
This has been my experience so far—I would love to hear your thoughts on this.