问HN:你的自主软件黑暗工厂是什么样的?

1作者: ElFitz大约 2 小时前原帖
在这里的一些评论线程中,有几位分享了有趣的想法和模式,足以让我相信,所有对马达工程感兴趣的人都在开发某种软件“黑工厂”。<p>我们有OpenAI的Symphony[1]、StrongDM的Factory[2]、Yegge的GasTown[3],还有可能有我遗漏的一些其他项目。<p>所以我很好奇,你们在做什么?你们学到了什么?什么是有效的,什么又是失败的?你们认为接下来会发生什么?<p>我先来分享。让我获得有趣结果的第一件事是,在可能的情况下,为模型提供一个真实的基准或参考,以便进行迭代:例如,UI工作的截图或模型,逻辑的API合同和单元/集成测试。这就是我们都熟知并喜爱的Ralph Loop——一个反馈循环。<p>第二个(显而易见,我知道)是将规划和实施分开。<p>接下来是其他模型的评审和迭代循环,取得了可观的成果。然而,实施代理往往会通过将事情推迟到无尽的未来或声称实际上重要的反馈超出了范围而逃避责任。又一个反馈循环。我发现将这些评审转化为“硬性关卡”会带来一系列问题,因为评审代理总会找到一些细枝末节的问题,使得这种迭代实施方法变成近乎无限的循环。<p>将这些评审与代码一起提交计划,导致了一个有趣的意外:评审代理自发且意外地关注这些内容,并通过比较计划和实施大幅改善了他们的反馈(这本该是显而易见的,你可以想象当GitHub Copilot第一次提供有用反馈而不是通常的拼写错误挑剔时,我的惊讶)。<p>然后这里的一条评论让我想到了一个对抗性的绿色团队/红色团队流程。<p>第一个代理根据我的初步计划创建一个规范(基于StrongDM的NLSpec),并进行评审,包括详细的API。<p>红色团队代理根据这些规范编写单元和集成测试,并进行评审。<p>然后,一个绿色团队代理获得这些相同的规范和API,实施实际的功能或修复,并在没有任何测试访问权限的情况下进行迭代,只知道哪些测试失败以及它们测试的内容。这防止了它对测试的操控。<p>最后,一旦测试通过,评审代理会根据规范对实施进行评审。<p>这个流程很不错。它允许混合和匹配模型、思维层次和提供者。但绿色团队和红色团队有时会偏离初始规范和API,有时是有正当理由的。<p>因此,引入了另一个代理来评估这些偏差,并在它们是有效的改进时,从规范生成点重新启动流程,结合新的见解。又一个反馈循环。<p>最后,将日志、OTel跟踪和堆栈跟踪整合到流程中。这些代理似乎在筛选这些信息方面表现得非常出色,端到端的可观察性显著改善了结果。同样,又是一个反馈循环。<p>这就是我目前的分享。期待看到其他人对此的分享!
查看原文
In some of the comment threads around here a few of you shared interesting ideas and patterns, enough that I believe everyone interesting in harness engineering is working on some sort of software dark factory or another.<p>We have OpenAI’s Symphony[1], StrongDM’s Factory[2], Yegge’s GasTown[3], and probably a few others I’ve missed.<p>So I’m curious. What have you been working on? What have learned? What has worked and what has failed? And what do you think comes after?<p>I’ll go first. The first thing I tried that yielded interesting results was, when possible, providing a ground truth or reference for the model to iterate against: screenshots or mockups for UI work, API contracts and unit &#x2F; integration tests for logic. That’s the Ralph Loop we all know and love. A feedback loop.<p>The second (obvious, I know) was splitting planning and implementation.<p>Reviews by other models and iterative loops came next, with appreciable results. However the implementing agent would often wiggle out by deferring things into oblivion or saying things that were actually important feedback were out of scope. Another feedback loop. I’ve found turning those reviews into &quot;hard gates&quot; has its own set of issue, as reviewing agents will always find something to nitpick about, turning this iterative implementation approaches into near infinite loops.<p>Combining these reviews and committing plans alongside the code led to an interesting accident: reviewing agents spontaneously and unexpectedly picked up on those and drastically improved their feedbacks by comparing plan and implementation (should have been obvious, and you’ll imagine my surprise the first time GitHub Copilot actually provided useful feedbacks instead of the usual typo nitpicks).<p>Then a comment here led me to an adversarial green team &#x2F; red team process.<p>A first agent creates a spec (based on StrongDM’s NLSpec) from my initial plan and gets it reviewed, including a detailed API.<p>A red team agent writes unit and integration test based on these specs, and gets them reviewed.<p>Then a green team agent is given those same specs and API, and implements the actual feature or fix, and iterates against the tests, without any access to the tests themselves, only which tests failed and what they were testing. This prevents it from gaming the tests.<p>Finally, once tests pass, a reviewing agent reviews the implementation against the specs.<p>This was nice. And it allows mixing and matching models, thinking levels, and providers. But both green and red team would sometimes diverge from the initial specs and API, sometimes with good reasons.<p>So another agent was brought in to evaluate those divergences when they occur and, if they are valid improvements, restart the process from the spec generation point, with the new insights. Yet another feedback loop.<p>And finally, integrating logs, OTel traces, and stack traces into the process. These agents seem remarkably capable at sifting through these, and end-to-end observability drastically improved results. Again, a feedback loop.<p>That’s all for me so far. Curious to see what everyone else has to share about this!