请问HN:你们是如何关闭在生产环境中表现不当的人工智能的?

1作者: nordic_lion大约 1 个月前原帖
如果您在生产环境中运行人工智能工作负载/代理或基于大型语言模型的系统,当它们开始出现异常行为时,您实际上是如何关闭它们的? 这里所说的“异常行为”包括: - 开销失控 - 延迟问题 - 提示循环 - 工具滥用或意外的外部调用 - 数据泄露风险 - 下游服务的级联故障 在我见过的大多数系统中,观察能力都很好。您可以查看日志、追踪信息和成本仪表板。但实际的关闭机制往往是手动的:禁用功能标志、撤销API密钥、回滚部署、对上游进行速率限制。 我想了解人们在实践中是如何操作的。 - 您的实际关闭机制是什么? - 它是绑定到模型端点、代理实例、工作流、Kubernetes工作负载,还是其他什么? - 在某些条件下,关闭是否是自动的,还是总是需要人工批准? - 您在第一次真实事件后发现了什么? 具体的例子将非常有帮助。
查看原文
If you are running AI workloads&#x2F;agents or LLM-backed systems in production, how do you actually shut one down when it starts behaving badly?<p>By “misbehaving” I mean things like: -runaway spend -latency issues -prompt loops -tool abuse or unexpected external calls -data leakage risks -cascading failures across downstream services<p>In most systems I’ve seen, there is good observability. You can see logs, traces, cost dashboards. But the actual shutdown mechanism often ends up being manual: disable a feature flag, revoke an API key, roll back a deployment, rate limit something upstream.<p>I am trying to understand what people are doing in practice.<p>-What is your actual kill mechanism? -Is it bound to a model endpoint, an agent instance, a workflow, a Kubernetes workload, something else? -Is shutdown automated under certain conditions, or always human-approved? -What did you discover only after your first real incident?<p>Concrete examples would be extremely helpful.