HackerNews中文版

我希望向那些在实际应用中运行代理的人学习，而不仅仅是演示。如果你有一个生产环境的设置，能否分享一下哪些有效，哪些出现了问题？我最感兴趣的内容包括： - 编排工具的选择及原因：LangGraph、Temporal、Airflow、Prefect、自定义队列。 - 状态和检查点：你在哪里持久化步骤，如何重放，如何处理模式变化。 - 并发控制：并行工具调用、背压、超时、重试的幂等性。 - 自动扩展和成本：保持延迟和支出合理的策略，现货与按需，GPU共享。 - 内存和检索：向量数据库与键值存储，驱逐策略，防止过时的上下文。 - 可观察性：追踪、指标、实际预测事件的评估。 - 安全性和隔离：沙箱工具、速率限制、滥用过滤器、个人身份信息处理。 - 战斗故事：教会你一课的事件及其解决方案。背景（以便不是随便问的）：小团队，使用Python，k8s，MongoDB作为状态存储，Redis作为队列，所有内容都是自定义的，正在尝试LangGraph和Temporal。乐意在评论中分享配置和交流经验。请回答任何一个子问题。即使是你技术栈的快速概述和一个小问题，也会对其他阅读此内容的人有所帮助。谢谢！

查看原文

I’m looking to learn from people running agents beyond demos. If you have a production setup, would you share what works and what broke?What I’m most curious about:- Orchestrator choice and why: LangGraph, Temporal, Airflow, Prefect, custom queues.- State and checkpointing: where do you persist steps, how do you replay, how do you handle schema changes.- Concurrency control: parallel tool calls, backpressure, timeouts, idempotency for retries.- Autoscaling and cost: policies that kept latency and spend sane, spot vs on-demand, GPU sharing.- Memory and retrieval: vector DB vs KV store, eviction policies, preventing stale context.- Observability: tracing, metrics, evals that actually predicted incidents.- Safety and isolation: sandboxing tools, rate limits, abuse filters, PII handling.- A war story: the incident that taught you a lesson and the fix.Context (so it’s not a drive-by): small team, Python, k8s, MongoDB for state, Redis for queues, everything custom, experimenting with LangGraph and Temporal. Happy to share configs and trade notes in the comments.Answer any subset. Even a quick sketch of your stack and one gotcha would help others reading this. Thanks!

请问HN：你们是如何在生产环境中可靠地扩展AI代理的？