展示HN:ElasticMM – 4.2倍更快的多模态大语言模型服务(NeurIPS 2025 口头报告)

1作者: PaperWeekly大约 2 个月前原帖
ElasticMM 是一个新发布的开源服务系统,专为现代多模态大型语言模型(MLLMs)设计。该研究在 2025 年 NeurIPS 会议上被选为口头报告。 与现有的服务堆栈(如 vLLM,主要针对文本工作负载进行优化)不同,ElasticMM 引入了弹性多模态并行性(EMP),这是一种新的执行范式,能够在不同的推理阶段和模态之间自适应并行性。 论文的主要发现包括: - TTFT(总推理时间)减少高达 4.2 倍 - 在混合多模态工作负载下,吞吐量提高 3.2 倍至 4.5 倍 - 具模态感知的调度、弹性阶段划分、统一前缀缓存和非阻塞编码 论文(OpenReview PDF): [https://openreview.net/pdf?id=Zd6VyjmN1S](https://openreview.net/pdf?id=Zd6VyjmN1S) GitHub 仓库: [https://github.com/hpdps-group/ElasticMM](https://github.com/hpdps-group/ElasticMM) 期待听到 HN 社区的看法,特别是那些正在构建 LLM/MLLM 推理堆栈或在生产中处理多模态服务的朋友们。
查看原文
ElasticMM is a newly released open-source serving system designed for modern multimodal large language models (MLLMs). The work was selected as an Oral presentation at NeurIPS 2025.<p>Unlike existing serving stacks such as vLLM—which are primarily optimized for text-only workloads—ElasticMM introduces Elastic Multimodal Parallelism (EMP), a new execution paradigm that adapts parallelism across different inference stages and modalities.<p>Key findings from the paper:<p>Up to 4.2× reduction in TTFT<p>3.2×–4.5× higher throughput under mixed multimodal workloads<p>Modality-aware scheduling, elastic stage partitioning, unified prefix caching, and non-blocking encoding<p>Paper (OpenReview PDF): <a href="https:&#x2F;&#x2F;openreview.net&#x2F;pdf?id=Zd6VyjmN1S" rel="nofollow">https:&#x2F;&#x2F;openreview.net&#x2F;pdf?id=Zd6VyjmN1S</a><p>GitHub repo: <a href="https:&#x2F;&#x2F;github.com&#x2F;hpdps-group&#x2F;ElasticMM" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;hpdps-group&#x2F;ElasticMM</a><p>Curious to hear what the HN community thinks, especially from those building LLM&#x2F;MLLM inference stacks or dealing with multimodal serving in production.