请问HN:你们是如何在生产环境中可靠地扩展AI代理的?
我希望向那些在实际应用中运行代理的人学习,而不仅仅是演示。如果你有一个生产环境的设置,能否分享一下哪些有效,哪些出现了问题?
我最感兴趣的内容包括:
- 编排工具的选择及原因:LangGraph、Temporal、Airflow、Prefect、自定义队列。
- 状态和检查点:你在哪里持久化步骤,如何重放,如何处理模式变化。
- 并发控制:并行工具调用、背压、超时、重试的幂等性。
- 自动扩展和成本:保持延迟和支出合理的策略,现货与按需,GPU共享。
- 内存和检索:向量数据库与键值存储,驱逐策略,防止过时的上下文。
- 可观察性:追踪、指标、实际预测事件的评估。
- 安全性和隔离:沙箱工具、速率限制、滥用过滤器、个人身份信息处理。
- 战斗故事:教会你一课的事件及其解决方案。
背景(以便不是随便问的):小团队,使用Python,k8s,MongoDB作为状态存储,Redis作为队列,所有内容都是自定义的,正在尝试LangGraph和Temporal。乐意在评论中分享配置和交流经验。
请回答任何一个子问题。即使是你技术栈的快速概述和一个小问题,也会对其他阅读此内容的人有所帮助。谢谢!
查看原文
I’m looking to learn from people running agents beyond demos. If you have a production setup, would you share what works and what broke?<p>What I’m most curious about:<p>- Orchestrator choice and why: LangGraph, Temporal, Airflow, Prefect, custom queues.<p>- State and checkpointing: where do you persist steps, how do you replay, how do you handle schema changes.<p>- Concurrency control: parallel tool calls, backpressure, timeouts, idempotency for retries.<p>- Autoscaling and cost: policies that kept latency and spend sane, spot vs on-demand, GPU sharing.<p>- Memory and retrieval: vector DB vs KV store, eviction policies, preventing stale context.<p>- Observability: tracing, metrics, evals that actually predicted incidents.<p>- Safety and isolation: sandboxing tools, rate limits, abuse filters, PII handling.<p>- A war story: the incident that taught you a lesson and the fix.<p>Context (so it’s not a drive-by): small team, Python, k8s, MongoDB for state, Redis for queues, everything custom, experimenting with LangGraph and Temporal. Happy to share configs and trade notes in the comments.<p>Answer any subset. Even a quick sketch of your stack and one gotcha would help others reading this. Thanks!