展示HN:Τ³-Bench发布了——代理能否处理复杂文档和实时通话?

5作者: victorbarres大约 5 小时前原帖
τ-Bench 是一个开放的基准测试,用于评估 AI 代理在具有可验证结果的基础上进行的多轮客户服务任务。自发布以来,社区的积极采用让人倍感欣慰——这已经是第三个版本。通过 τ³-Bench,我们将其扩展到两个新的设置:知识密集型检索和全双工语音。 τ-知识:代理必须在大约 700 个相互关联的政策文件中导航,以完成多步骤任务。最佳前沿模型(GPT-5.2,高推理能力)的表现约为 25%。令人惊讶的是,即使你将模型所需的确切文件提供给它,性能也仅能达到约 40%。我们发现瓶颈并不在于检索,而是在于对复杂的、相互关联的政策进行推理,并以正确的顺序执行正确的操作。 τ-语音:相同的基础任务,但在实时全双工语音环境中进行,音频真实——包括口音、背景噪音、干扰和压缩电话线路。在清晰音频条件下,语音代理的得分为 31% 到 51%,而在真实环境中则为 26% 到 38%。在不同提供商(OpenAI、Gemini、xAI)中存在一致的失败模式:代理在身份验证过程中误听了姓名或电子邮件,导致后续所有操作失败。 我们还对原有的航空、零售和电信领域进行了 75 个以上的任务修正——许多基于社区审核和 PR(包括来自亚马逊和 Anthropic 的贡献)。我们相信,一个基准的质量取决于其维护,我们对社区在改进基准方面的帮助表示感谢。 代码和排行榜是开放的——我们欢迎社区的提交和反馈。 博客文章(论文、代码、排行榜):[https://sierra.ai/blog/bench-advancing-agent-benchmarking-to-knowledge-and-voice](https://sierra.ai/blog/bench-advancing-agent-benchmarking-to-knowledge-and-voice)
查看原文
τ-Bench is an open benchmark for evaluating AI agents on grounded, multi-turn customer service tasks with verifiable outcomes. It&#x27;s been great to see the community adopt it since launch — this is now the third iteration. With τ³-Bench, we&#x27;re extending it to two new settings: knowledge-intensive retrieval and full-duplex voice.<p>τ-Knowledge: agents must navigate ~700 interconnected policy documents to complete multi-step tasks. Best frontier model (GPT-5.2, high reasoning) hits ~25%. The surprising part: even when you hand the model the exact documents it needs, performance only reaches ~40%. We found that the bottleneck isn&#x27;t retrieval — it&#x27;s reasoning over complex, interlinked policies and executing the right actions in the right order.<p>τ-Voice: same grounded tasks, but over live full-duplex voice with realistic audio — accents, background noise, interruptions, compressed phone lines. Voice agents score 31–51% in clean audio conditions and 26–38% in realistic ones. A consistent failure pattern across providers (OpenAI, Gemini, xAI): agent mishears a name or email during authentication, and everything downstream fails.<p>We also incorporated 75+ task fixes to the original airline, retail, and telecom domains — many based on community audits and PRs (including contributions from Amazon and Anthropic). We believe a benchmark is only as good as its maintenance, and we&#x27;re grateful for the community&#x27;s help improving it.<p>Code and leaderboard are open — we&#x27;d welcome community submissions and feedback.<p>Blog post (papers, code, leaderboard): <a href="https:&#x2F;&#x2F;sierra.ai&#x2F;blog&#x2F;bench-advancing-agent-benchmarking-to-knowledge-and-voice" rel="nofollow">https:&#x2F;&#x2F;sierra.ai&#x2F;blog&#x2F;bench-advancing-agent-benchmarking-to...</a>