展示HN:Herd – 一个Go侧车,用于防止状态性进程Puppeteer/LLMs出现内存溢出(OOM)
嘿,HN。
我是一名在滑铁卢大学学习的工程学生,正在构建有状态的人工智能代理,但我一直遇到同样的问题:每当我的 Python 脚本崩溃或断开连接时,底层的 Puppeteer 或 Ollama 进程就会孤立无援地存在,占用内存,直到节点因内存不足被杀掉。标准的负载均衡器会破坏粘性会话,而被动的 HTTP 超时又太慢,无法及时清理。
我找不到一个能够可靠清理死掉的有状态会话的本地进程池,所以我用 Go 构建了 Herd。
它使用持久化的流(gRPC/Unix 套接字)作为一个死人的开关。如果你的客户端脚本崩溃,流就会中断。Herd 会注册 EOF,并立即向工作进程发送 SIGKILL(在 Linux 上依赖 Pdeathsig)。对于实际的大量数据,你只需通过 Herd 的内部代理发送 HTTP 流量,代理会将其直接路由到活动进程的端口。
我真正的目标是将其转变为一个多节点的分布式网格,配备 Redis 注册表,客户端可以掉线,边缘网关将其路由回持有其有状态内存的确切 pod。
但我知道在一个有漏洞的本地引擎上构建分布式网格是自杀行为。单节点的清理必须首先做到完美。
我希望你们能对这个架构进行批评。具体来说:依赖 Pdeathsig 作为本地死人的开关在生产环境中是否足够稳健,还是我太天真了,需要咬紧牙关,把一切都包裹在 cgroups 和微虚拟机中?
代码库链接:[https://github.com/herd-core/herd](https://github.com/herd-core/herd)
查看原文
Hey HN.<p>I'm an engineering student at Waterloo building stateful AI agents, and I kept hitting the same wall: whenever my Python scripts crashed or dropped a connection, the underlying Puppeteer or Ollama processes would just sit there orphaned, eating RAM until the node OOM-killed itself. Standard load balancers break sticky sessions, and passive HTTP timeouts are too slow for cleanup.<p>I couldn't find a good local process pool that actually cleaned up dead stateful sessions reliably, so I built Herd in Go.<p>It uses a persistent stream (gRPC/Unix sockets) strictly as a dead-man's switch. If your client script dies, the stream breaks. Herd registers the EOF and instantly fires a SIGKILL to the worker process (relying on Pdeathsig on Linux). For the actual heavy data, you just blast HTTP traffic through Herd's internal proxy, which routes it directly to the active process port.<p>My actual goal is to turn this into a multi-node distributed mesh with a Redis registry, where a client can drop off and an edge gateway routes them back to the exact pod holding their stateful memory.<p>But I know building a distributed mesh on top of a leaky local engine is a death sentence. The single-node cleanup has to be flawless first.<p>I'd love for you guys to roast the architecture. Specifically: is relying on Pdeathsig actually robust enough for a local dead-man's switch in production, or am I being naive and need to just bite the bullet and wrap everything in cgroups & microvms right now?<p>Repo link: <a href="https://github.com/herd-core/herd" rel="nofollow">https://github.com/herd-core/herd</a>