HackerNews中文版

如果您在生产环境中运行人工智能工作负载/代理或基于大型语言模型的系统，当它们开始出现异常行为时，您实际上是如何关闭它们的？这里所说的“异常行为”包括： - 开销失控 - 延迟问题 - 提示循环 - 工具滥用或意外的外部调用 - 数据泄露风险 - 下游服务的级联故障在我见过的大多数系统中，观察能力都很好。您可以查看日志、追踪信息和成本仪表板。但实际的关闭机制往往是手动的：禁用功能标志、撤销API密钥、回滚部署、对上游进行速率限制。我想了解人们在实践中是如何操作的。 - 您的实际关闭机制是什么？ - 它是绑定到模型端点、代理实例、工作流、Kubernetes工作负载，还是其他什么？ - 在某些条件下，关闭是否是自动的，还是总是需要人工批准？ - 您在第一次真实事件后发现了什么？具体的例子将非常有帮助。

查看原文

If you are running AI workloads/agents or LLM-backed systems in production, how do you actually shut one down when it starts behaving badly?By “misbehaving” I mean things like: -runaway spend -latency issues -prompt loops -tool abuse or unexpected external calls -data leakage risks -cascading failures across downstream servicesIn most systems I’ve seen, there is good observability. You can see logs, traces, cost dashboards. But the actual shutdown mechanism often ends up being manual: disable a feature flag, revoke an API key, roll back a deployment, rate limit something upstream.I am trying to understand what people are doing in practice.-What is your actual kill mechanism? -Is it bound to a model endpoint, an agent instance, a workflow, a Kubernetes workload, something else? -Is shutdown automated under certain conditions, or always human-approved? -What did you discover only after your first real incident?Concrete examples would be extremely helpful.

请问HN：你们是如何关闭在生产环境中表现不当的人工智能的？