展示HN:Herd – 一个Go侧车,用于防止状态性进程Puppeteer/LLMs出现内存溢出(OOM)

2作者: sankalpnarula大约 22 小时前原帖
嘿,HN。 我是一名在滑铁卢大学学习的工程学生,正在构建有状态的人工智能代理,但我一直遇到同样的问题:每当我的 Python 脚本崩溃或断开连接时,底层的 Puppeteer 或 Ollama 进程就会孤立无援地存在,占用内存,直到节点因内存不足被杀掉。标准的负载均衡器会破坏粘性会话,而被动的 HTTP 超时又太慢,无法及时清理。 我找不到一个能够可靠清理死掉的有状态会话的本地进程池,所以我用 Go 构建了 Herd。 它使用持久化的流(gRPC/Unix 套接字)作为一个死人的开关。如果你的客户端脚本崩溃,流就会中断。Herd 会注册 EOF,并立即向工作进程发送 SIGKILL(在 Linux 上依赖 Pdeathsig)。对于实际的大量数据,你只需通过 Herd 的内部代理发送 HTTP 流量,代理会将其直接路由到活动进程的端口。 我真正的目标是将其转变为一个多节点的分布式网格,配备 Redis 注册表,客户端可以掉线,边缘网关将其路由回持有其有状态内存的确切 pod。 但我知道在一个有漏洞的本地引擎上构建分布式网格是自杀行为。单节点的清理必须首先做到完美。 我希望你们能对这个架构进行批评。具体来说:依赖 Pdeathsig 作为本地死人的开关在生产环境中是否足够稳健,还是我太天真了,需要咬紧牙关,把一切都包裹在 cgroups 和微虚拟机中? 代码库链接:[https://github.com/herd-core/herd](https://github.com/herd-core/herd)
查看原文
Hey HN.<p>I&#x27;m an engineering student at Waterloo building stateful AI agents, and I kept hitting the same wall: whenever my Python scripts crashed or dropped a connection, the underlying Puppeteer or Ollama processes would just sit there orphaned, eating RAM until the node OOM-killed itself. Standard load balancers break sticky sessions, and passive HTTP timeouts are too slow for cleanup.<p>I couldn&#x27;t find a good local process pool that actually cleaned up dead stateful sessions reliably, so I built Herd in Go.<p>It uses a persistent stream (gRPC&#x2F;Unix sockets) strictly as a dead-man&#x27;s switch. If your client script dies, the stream breaks. Herd registers the EOF and instantly fires a SIGKILL to the worker process (relying on Pdeathsig on Linux). For the actual heavy data, you just blast HTTP traffic through Herd&#x27;s internal proxy, which routes it directly to the active process port.<p>My actual goal is to turn this into a multi-node distributed mesh with a Redis registry, where a client can drop off and an edge gateway routes them back to the exact pod holding their stateful memory.<p>But I know building a distributed mesh on top of a leaky local engine is a death sentence. The single-node cleanup has to be flawless first.<p>I&#x27;d love for you guys to roast the architecture. Specifically: is relying on Pdeathsig actually robust enough for a local dead-man&#x27;s switch in production, or am I being naive and need to just bite the bullet and wrap everything in cgroups &amp; microvms right now?<p>Repo link: <a href="https:&#x2F;&#x2F;github.com&#x2F;herd-core&#x2F;herd" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;herd-core&#x2F;herd</a>