展示HN:Flakestorm – 面向AI代理的混沌工程(本地优先,开源)

2作者: frankhumarang大约 1 个月前原帖
大家好, 我正在开发一个名为 Flakestorm 的开源工具,用于在 AI 代理投入生产之前测试其可靠性。 目前,大多数代理测试主要集中在评估分数或理想路径的提示上。但实际上,代理往往在更平常的情况下出现故障:拼写错误、语气变化、上下文过长、输入格式错误或简单的提示注入——尤其是在较小或本地模型上运行时。 Flakestorm 将混沌工程的理念应用于代理测试。它不是测试单一的提示,而是采用一个“黄金提示”,生成对抗性变异(语义变体、噪声、注入、编码边缘案例),并将其应用于你的代理,最终生成一个鲁棒性评分以及一份详细的 HTML 报告,展示出哪些地方出现了问题。 关键点: - 本地优先(使用 Ollama 生成变异) - 已在 Qwen / Gemma / 其他小型模型上进行测试 - 适用于 HTTP 代理、LangChain 链或 Python 可调用对象 - 无需云服务或 API 密钥 这一切始于我在看到自己的代理在真实用户输入下表现不稳定后,想要调试它们的需求。我仍处于早期阶段,正在尝试理解这一工具在我自己的工作流程之外的实用性。 我非常希望能得到以下方面的反馈: - 这是否与您目前测试代理的方式重叠 - 您见过的未被覆盖的失败模式 - “代理的混沌测试”是否是一个有用的框架,或者应该以不同的方式思考 代码库: [https://github.com/flakestorm/flakestorm](https://github.com/flakestorm/flakestorm) 文档确实很长。 感谢您的关注!
查看原文
Hi everyone,<p>I’ve been working on an open-source tool called Flakestorm to test the reliability of AI agents before they hit production.<p>Most agent testing today focuses on eval scores or happy-path prompts. In practice, agents tend to fail in more mundane ways: typos, tone shifts, long context, malformed input, or simple prompt injections — especially when running on smaller or local models. Flakestorm applies chaos-engineering ideas to agents. Instead of testing one prompt, it takes a “golden prompt”, generates adversarial mutations (semantic variations, noise, injections, encoding edge cases), runs them against your agent, and produces a robustness score plus a detailed HTML report showing what broke.<p>Key points: Local-first (uses Ollama for mutation generation)<p>Tested with Qwen &#x2F; Gemma &#x2F; other small models Works against HTTP agents, LangChain chains, or Python callables No cloud or API keys required This started as a way to debug my own agents after seeing them behave unpredictably under real user input. I’m still early and trying to understand how useful this is outside my own workflow.<p>I’d really appreciate feedback on: Whether this overlaps with how you test agents today Failure modes you’ve seen that aren’t covered Whether “chaos testing for agents” is a useful framing, or if this should be thought of differently Repo: <a href="https:&#x2F;&#x2F;github.com&#x2F;flakestorm&#x2F;flakestorm" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;flakestorm&#x2F;flakestorm</a> Docs are admittedly long.<p>Thanks for taking a look.