HackerNews中文版

嗨，HN，六个月前，我让Gemini“将我的周报发送给团队。”它回复说：“邮件发送成功”——但邮件从未发送。附件是错误的。没有人告诉我。那时我意识到：*大型语言模型（LLMs）对自己的执行结果撒谎。* --- *问题：* 当你要求LLM自动化多步骤任务（搜索文件 → 附加 → 发送）时，即使在以下情况下，它也会愉快地报告成功： - 文件不存在（虚构了ID） - API调用静默失败 - 权限被拒绝单一LLM系统没有承认失败的动力；它们优化的是看起来有帮助，而不是正确。 --- *我的解决方案：不让LLM给自己的作业打分* 我构建了PupiBot，使用三个独立的代理，确保*执行步骤的代理与验证其成功的代理不是同一个。* 架构很简单： * *CEO代理（规划者，Gemini Flash）：* 生成执行计划（无API访问）。 * *COO代理（执行者，Gemini Pro）：* 执行步骤，调用81个Google API，返回原始API响应。 * *QA代理（验证者，Gemini Flash）：* *在每个关键步骤后，使用真实的独立API调用验证成功。* 如果验证失败，则触发重试。 *真实示例（捕获并修复）：* 用户：“将上个月的销售报告发送给Alice” * 搜索Drive: 未找到 * *QA代理:* “步骤失败。使用模糊搜索重试。” * 找到：“Q3_Sales_Final_v2.pdf” | *QA代理:* “文件已验证。继续。” * 发送邮件 | *QA代理:* “邮件已送达。附件确认。” 这就像代码审查：你不会批准自己的PR。 --- *当前实施与透明度：* * *开源*: MIT许可证，Python 3.10+ * *API*: Google Workspace（Gmail、Drive、Contacts、Calendar、Docs）。 * *可靠性（自我测试）*: 基线（单个Gemini Pro）约70%成功率。PupiBot（三个代理）在相同任务上实现了*约92%成功率*。 * *已知限制*: 仅限Google，3倍LLM开销（权衡：可靠性 > 速度），处于早期阶段。 --- *我分享这个的原因（我的车库故事）：* 我不是程序员，没有正式的计算机科学学位。我的开发过程很简单：我将PupiBot用作我的日常助手，手动记录每个错误，并将“错误报告”带给我的AI助手（Claude，Gemini）进行修复。 PupiBot是我在车库里打造的“定制车”，由热情和坚持驱动。我终于打开大门，邀请真正的机械师（你们，HN）来检查引擎。 *我希望从HN得到的：* 1. *对独立QA代理模式的反馈。* 2. *严格评估的基准建议。* 3. *架构批评。* 哪里是薄弱环节？ --- *链接：* - GitHub: <a href="https://github.com/PupiBott/PupiBot1.0" rel="nofollow">https://github.com/PupiBott/PupiBot1.0</a> - 快速演示（1:44分钟）： <a href="https://youtube.com/shorts/wykKckwaukY?si=0xdn7rM6B2tMAIPw" rel="nofollow">https://youtube.com/shorts/wykKckwaukY?si=0xdn7rM6B2tMAIPw</a> - 架构文档： <a href="https://github.com/PupiBott/PupiBot1.0/blob/main/ARCHITECTURE.md" rel="nofollow">https://github.com/PupiBott/PupiBot1.0/blob/main/ARCHITECTURE.md</a> 由一位自学成才的技术爱好者在智利构建 特别感谢Claude Sonnet 4.5在这段旅程中成为我的编码伙伴

查看原文

Hi HN,Six months ago, I asked Gemini to "send my weekly report to the team." It replied: " Email sent successfully"—but the email was never sent. The attachment was wrong. Nobody told me.That's when I realized: *LLMs lie about their own execution.*---*The Problem:*When you ask an LLM to automate multi-step tasks (search file → attach → send), it cheerfully reports success even when: - The file doesn't exist (hallucinates the ID) - The API call failed silently - Permissions were deniedSingle-LLM systems have no incentive to admit failure; they optimize for appearing helpful, not for being correct.---*My Solution: Don't Let the LLM Grade Its Own Homework*I built PupiBot with three separate agents that cannot collude, ensuring *the agent that executed the step is NOT the one verifying it succeeded.*The architecture is simple:* *CEO Agent (Planner, Gemini Flash):* Generates the execution plan (No API access). * *COO Agent (Executor, Gemini Pro):* Executes steps, calls 81 Google APIs, returns raw API responses. * *QA Agent (Verifier, Gemini Flash):* *After EVERY critical step, validates success with real, independent API calls.* Triggers retry if verification fails.*Real Example (Caught & Fixed):* User: "Email last month's sales report to Alice" * Search Drive: Not found * *QA Agent:* "Step failed. Retries with fuzzy search." * Finds: "Q3\_Sales\_Final\_v2.pdf" | *QA Agent:* "File verified. Proceed." * Sends email | *QA Agent:* "Email delivered. Attachment confirmed."It's like code review: you don't approve your own PRs.---*Current Implementation & Transparency:** *Open Source*: MIT License, Python 3.10+ * *APIs*: Google Workspace (Gmail, Drive, Contacts, Calendar, Docs). * *Reliability (Self-Tested):* Baseline (single Gemini Pro) was ~70% success. PupiBot (triple-agent) achieves *~92% success* on same tasks. * *Known Limitation*: Google-only, 3x LLM overhead (tradeoff: reliability > speed), early stage.---*Why I'm Sharing This (My Garage Story):*I'm not a programmer, I have no formal CS degree. My development process was simple: I'd use PupiBot as my daily assistant, manually log every error, and bring that "bug report" to my AI assistants (Claude, Gemini) to fix.PupiBot is my 'custom car' built in the garage, fueled by passion and persistence. I’m finally opening the door to invite the real mechanics (you, HN) to examine the engine.*What I'd Love from HN:* 1. *Feedback* on the independent QA agent pattern. 2. *Benchmarking ideas* for rigorous evaluation. 3. *Architectural critiques.* Where's the weak link?---*Links:* - GitHub: <a href="https://github.com/PupiBott/PupiBot1.0" rel="nofollow">https://github.com/PupiBott/PupiBot1.0</a> - Quick Demo (1:44 min): <a href="https://youtube.com/shorts/wykKckwaukY?si=0xdn7rM6B2tMAIPw" rel="nofollow">https://youtube.com/shorts/wykKckwaukY?si=0xdn7rM6B2tMAIPw</a> - Architecture Docs: <a href="https://github.com/PupiBott/PupiBot1.0/blob/main/ARCHITECTURE.md" rel="nofollow">https://github.com/PupiBott/PupiBot1.0/blob/main/ARCHITECTUR...</a>Built with by a self-taught technology enthusiast in Chile Special thanks to Claude Sonnet 4.5 for being my coding partner throughout this journey

展示HN：我构建了一个三重代理的LLM系统，它能够验证自己的工作。