HackerNews中文版

我们开发了InferX，这是一种模型运行时，可以快照完整的GPU执行状态、权重、内存布局和KV缓存，并在大约2秒内恢复任何模型。无需重新初始化，无需重新加载权重，也无需容器。借助这一技术，我们在仅使用2个A1000 GPU的情况下运行了50多个大型语言模型，消除了冷启动问题，并将内存像线程一样进行调度。传统上，如果为每个模型分配单独的GPU，这将需要70个以上的GPU。我们并没有进行推测性批处理或模型合并，而是在运行时层面进行原生调度。这一技术旨在支持： • 代理堆栈（每个代理使用自己的模型） • 特定租户的微调 • 长尾工作负载，即模型不持续接收流量的情况我们非常希望听到其他人在这个多模型推理领域的解决方案或观察到的情况。乐意深入讨论快照、内存管理或调度策略的技术细节。请随时问我任何问题。

查看原文

We built InferX, a model runtime that snapshots the full GPU execution state, weights, memory layout, KV cache and resumes any model in ~2 seconds. No reinitialization, no weight reloading, no containers.With this, we’re running 50+ LLMs on just 2 A1000 GPUs, with cold starts eliminated and memory orchestrated like threads. Traditionally, this would take 70+ GPUs if you pinned each model.We’re not doing speculative batching or model merging. this is native orchestration at the runtime layer.This was built to support:•agent stacks (each agent using its own model)•tenant-specific fine-tunes•long-tail workloads where models don’t get constant trafficWould love to hear how others are solving this or what you’re seeing in the multi-model inference space. Happy to go into technical detail on snapshotting, memory management, or orchestration strategy.Ask me anything.

我们在2个GPU上运行50个大型语言模型（LLM）——没有冷启动，也没有过度配置。