LLM代理工作流程静默失败。我们希望存在的可靠性层。
在过去几个月里,我和我的联合创始人一直在构建复杂的智能代理工作流,但我们不断遇到相同的可靠性问题:共享状态不一致、无声故障、代理之间的偏离,以及没有清晰的方法在不重启整个工作流的情况下进行恢复。
显然,大多数故障并不是“LLM问题”,而是经典的分布式系统问题在多代理设置中显现出来。
由于当前生态系统中没有任何解决方案能够妥善应对这些问题,我们开始为代理工作流构建一个可靠性层——这个层次为多代理系统增加了结构、安全性和可预测的恢复能力,而不需要开发者重写他们的技术栈。
我们希望与那些遇到类似问题或正在构建生产级代理工作流的人联系。我们的目标是了解其他人在这些系统中如何看待可靠性、故障恢复和工作流一致性。
如果您在这个领域工作或想尝试早期访问,请访问以下链接:
https://tally.so/r/LZDb0j
我们非常欢迎大家分享在代理可靠性方面的想法或经验,特别是关于故障案例或痛点的分享。
查看原文
For the past few months my co-founder and I have been building complex agentic workflows, and we kept hitting the same recurring reliability issues: inconsistent shared state, silent failures, agents diverging from each other, and no clean way to recover without restarting the entire workflow.<p>It became clear that most failures weren’t “LLM problems” but classic distributed-systems problems showing up in multi-agent setups.<p>Since nothing in the current ecosystem addressed this properly, we started building a reliability layer for agent workflows — something that adds structure, safety, and predictable recovery to multi-agent systems without forcing developers to rewrite their stack.<p>We’re looking to connect with people who have run into similar issues or are building production-grade agent workflows. The goal is to understand how others think about reliability, failure recovery, and workflow consistency in these systems.<p>If you’re working on this space or want to try the early access, here’s the link:
https://tally.so/r/LZDb0j<p>Would appreciate any thoughts or experiences others have had around agent reliability, especially failure cases or pain points.