请问HN:你们是如何关闭在生产环境中表现不当的人工智能的?
如果您在生产环境中运行人工智能工作负载/代理或基于大型语言模型的系统,当它们开始出现异常行为时,您实际上是如何关闭它们的?
这里所说的“异常行为”包括:
- 开销失控
- 延迟问题
- 提示循环
- 工具滥用或意外的外部调用
- 数据泄露风险
- 下游服务的级联故障
在我见过的大多数系统中,观察能力都很好。您可以查看日志、追踪信息和成本仪表板。但实际的关闭机制往往是手动的:禁用功能标志、撤销API密钥、回滚部署、对上游进行速率限制。
我想了解人们在实践中是如何操作的。
- 您的实际关闭机制是什么?
- 它是绑定到模型端点、代理实例、工作流、Kubernetes工作负载,还是其他什么?
- 在某些条件下,关闭是否是自动的,还是总是需要人工批准?
- 您在第一次真实事件后发现了什么?
具体的例子将非常有帮助。
查看原文
If you are running AI workloads/agents or LLM-backed systems in production, how do you actually shut one down when it starts behaving badly?<p>By “misbehaving” I mean things like:
-runaway spend
-latency issues
-prompt loops
-tool abuse or unexpected external calls
-data leakage risks
-cascading failures across downstream services<p>In most systems I’ve seen, there is good observability. You can see logs, traces, cost dashboards. But the actual shutdown mechanism often ends up being manual: disable a feature flag, revoke an API key, roll back a deployment, rate limit something upstream.<p>I am trying to understand what people are doing in practice.<p>-What is your actual kill mechanism?
-Is it bound to a model endpoint, an agent instance, a workflow, a Kubernetes workload, something else?
-Is shutdown automated under certain conditions, or always human-approved?
-What did you discover only after your first real incident?<p>Concrete examples would be extremely helpful.