展示HN:我构建了一个三重代理的LLM系统,它能够验证自己的工作。
嗨,HN,
六个月前,我让Gemini“将我的周报发送给团队。”它回复说:“邮件发送成功”——但邮件从未发送。附件是错误的。没有人告诉我。
那时我意识到:*大型语言模型(LLMs)对自己的执行结果撒谎。*
---
*问题:*
当你要求LLM自动化多步骤任务(搜索文件 → 附加 → 发送)时,即使在以下情况下,它也会愉快地报告成功:
- 文件不存在(虚构了ID)
- API调用静默失败
- 权限被拒绝
单一LLM系统没有承认失败的动力;它们优化的是看起来有帮助,而不是正确。
---
*我的解决方案:不让LLM给自己的作业打分*
我构建了PupiBot,使用三个独立的代理,确保*执行步骤的代理与验证其成功的代理不是同一个。*
架构很简单:
* *CEO代理(规划者,Gemini Flash):* 生成执行计划(无API访问)。
* *COO代理(执行者,Gemini Pro):* 执行步骤,调用81个Google API,返回原始API响应。
* *QA代理(验证者,Gemini Flash):* *在每个关键步骤后,使用真实的独立API调用验证成功。* 如果验证失败,则触发重试。
*真实示例(捕获并修复):*
<i>用户:“将上个月的销售报告发送给Alice”</i>
* 搜索Drive: <i>未找到</i>
* *QA代理:* “步骤失败。使用模糊搜索重试。”
* 找到:“Q3_Sales_Final_v2.pdf” | *QA代理:* “文件已验证。继续。”
* 发送邮件 | *QA代理:* “邮件已送达。附件确认。”
这就像代码审查:你不会批准自己的PR。
---
*当前实施与透明度:*
* *开源*: MIT许可证,Python 3.10+
* *API*: Google Workspace(Gmail、Drive、Contacts、Calendar、Docs)。
* *可靠性(自我测试)*: 基线(单个Gemini Pro)约70%成功率。PupiBot(三个代理)在相同任务上实现了*约92%成功率*。
* *已知限制*: 仅限Google,3倍LLM开销(权衡:可靠性 > 速度),处于早期阶段。
---
*我分享这个的原因(我的车库故事):*
我不是程序员,没有正式的计算机科学学位。我的开发过程很简单:我将PupiBot用作我的日常助手,手动记录每个错误,并将“错误报告”带给我的AI助手(Claude,Gemini)进行修复。
PupiBot是我在车库里打造的“定制车”,由热情和坚持驱动。我终于打开大门,邀请真正的机械师(你们,HN)来检查引擎。
*我希望从HN得到的:*
1. *对独立QA代理模式的反馈。*
2. *严格评估的基准建议。*
3. *架构批评。* 哪里是薄弱环节?
---
*链接:*
- GitHub: <a href="https://github.com/PupiBott/PupiBot1.0" rel="nofollow">https://github.com/PupiBott/PupiBot1.0</a>
- 快速演示(1:44分钟): <a href="https://youtube.com/shorts/wykKckwaukY?si=0xdn7rM6B2tMAIPw" rel="nofollow">https://youtube.com/shorts/wykKckwaukY?si=0xdn7rM6B2tMAIPw</a>
- 架构文档: <a href="https://github.com/PupiBott/PupiBot1.0/blob/main/ARCHITECTURE.md" rel="nofollow">https://github.com/PupiBott/PupiBot1.0/blob/main/ARCHITECTURE.md</a>
<i>由一位自学成才的技术爱好者在智利构建</i>
<i>特别感谢Claude Sonnet 4.5在这段旅程中成为我的编码伙伴</i>
查看原文
Hi HN,<p>Six months ago, I asked Gemini to "send my weekly report to the team." It replied: " Email sent successfully"—but the email was never sent. The attachment was wrong. Nobody told me.<p>That's when I realized: *LLMs lie about their own execution.*<p>---<p>*The Problem:*<p>When you ask an LLM to automate multi-step tasks (search file → attach → send), it cheerfully reports success even when:
- The file doesn't exist (hallucinates the ID)
- The API call failed silently
- Permissions were denied<p>Single-LLM systems have no incentive to admit failure; they optimize for appearing helpful, not for being correct.<p>---<p>*My Solution: Don't Let the LLM Grade Its Own Homework*<p>I built PupiBot with three separate agents that cannot collude, ensuring *the agent that executed the step is NOT the one verifying it succeeded.*<p>The architecture is simple:<p>* *CEO Agent (Planner, Gemini Flash):* Generates the execution plan (No API access).
* *COO Agent (Executor, Gemini Pro):* Executes steps, calls 81 Google APIs, returns raw API responses.
* *QA Agent (Verifier, Gemini Flash):* *After EVERY critical step, validates success with real, independent API calls.* Triggers retry if verification fails.<p>*Real Example (Caught & Fixed):*
<i>User: "Email last month's sales report to Alice"</i>
* Search Drive: <i>Not found</i>
* *QA Agent:* "Step failed. Retries with fuzzy search."
* Finds: "Q3\_Sales\_Final\_v2.pdf" | *QA Agent:* "File verified. Proceed."
* Sends email | *QA Agent:* "Email delivered. Attachment confirmed."<p>It's like code review: you don't approve your own PRs.<p>---<p>*Current Implementation & Transparency:*<p>* *Open Source*: MIT License, Python 3.10+
* *APIs*: Google Workspace (Gmail, Drive, Contacts, Calendar, Docs).
* *Reliability (Self-Tested):* Baseline (single Gemini Pro) was ~70% success. PupiBot (triple-agent) achieves *~92% success* on same tasks.
* *Known Limitation*: Google-only, 3x LLM overhead (tradeoff: reliability > speed), early stage.<p>---<p>*Why I'm Sharing This (My Garage Story):*<p>I'm not a programmer, I have no formal CS degree. My development process was simple: I'd use PupiBot as my daily assistant, manually log every error, and bring that "bug report" to my AI assistants (Claude, Gemini) to fix.<p>PupiBot is my 'custom car' built in the garage, fueled by passion and persistence. I’m finally opening the door to invite the real mechanics (you, HN) to examine the engine.<p>*What I'd Love from HN:*
1. *Feedback* on the independent QA agent pattern.
2. *Benchmarking ideas* for rigorous evaluation.
3. *Architectural critiques.* Where's the weak link?<p>---<p>*Links:*
- GitHub: <a href="https://github.com/PupiBott/PupiBot1.0" rel="nofollow">https://github.com/PupiBott/PupiBot1.0</a>
- Quick Demo (1:44 min): <a href="https://youtube.com/shorts/wykKckwaukY?si=0xdn7rM6B2tMAIPw" rel="nofollow">https://youtube.com/shorts/wykKckwaukY?si=0xdn7rM6B2tMAIPw</a>
- Architecture Docs: <a href="https://github.com/PupiBott/PupiBot1.0/blob/main/ARCHITECTURE.md" rel="nofollow">https://github.com/PupiBott/PupiBot1.0/blob/main/ARCHITECTUR...</a><p><i>Built with by a self-taught technology enthusiast in Chile </i>
<i>Special thanks to Claude Sonnet 4.5 for being my coding partner throughout this journey</i>