HackerNews中文版

我们一直在开发一个以GPU为优先的推理平台，专注于可预测的延迟和生产AI工作负载的成本控制。我们遇到的一些工程问题包括： - GPU冷启动和队列调度 - 多租户隔离而不浪费显存 - 模型加载与容器加载的权衡 - 批量推理与实时推理的路由 - 处理突发工作负载而无需长期保留GPU - 成本可预测性与自动扩展行为之间的平衡我们记录了架构决策、失败的经验以及成功的做法。欢迎提出技术问题，特别是关于GPU调度、推理优化和工作负载隔离方面的。

查看原文

We’ve been working on a GPU-first inference platform focused on predictable latency and cost control for production AI workloads.<p>Some of the engineering problems we ran into:<p>- GPU cold starts and queue scheduling - Multi-tenant isolation without wasting VRAM - Model loading vs container loading tradeoffs - Batch vs real-time inference routing - Handling burst workloads without long-term GPU reservation - Cost predictability vs autoscaling behavior<p>We wrote up the architecture decisions, what failed, and what worked.<p>Happy to answer technical questions - especially around GPU scheduling, inference optimization, and workload isolation.

我们构建了一个具有可预测延迟的无服务器 GPU 推理平台。