HackerNews中文版

ElasticMM 是一个新发布的开源服务系统，专为现代多模态大型语言模型（MLLMs）设计。该研究在 2025 年 NeurIPS 会议上被选为口头报告。与现有的服务堆栈（如 vLLM，主要针对文本工作负载进行优化）不同，ElasticMM 引入了弹性多模态并行性（EMP），这是一种新的执行范式，能够在不同的推理阶段和模态之间自适应并行性。论文的主要发现包括： - TTFT（总推理时间）减少高达 4.2 倍 - 在混合多模态工作负载下，吞吐量提高 3.2 倍至 4.5 倍 - 具模态感知的调度、弹性阶段划分、统一前缀缓存和非阻塞编码论文（OpenReview PDF）： [https://openreview.net/pdf?id=Zd6VyjmN1S](https://openreview.net/pdf?id=Zd6VyjmN1S) GitHub 仓库： [https://github.com/hpdps-group/ElasticMM](https://github.com/hpdps-group/ElasticMM) 期待听到 HN 社区的看法，特别是那些正在构建 LLM/MLLM 推理堆栈或在生产中处理多模态服务的朋友们。

查看原文

ElasticMM is a newly released open-source serving system designed for modern multimodal large language models (MLLMs). The work was selected as an Oral presentation at NeurIPS 2025.Unlike existing serving stacks such as vLLM—which are primarily optimized for text-only workloads—ElasticMM introduces Elastic Multimodal Parallelism (EMP), a new execution paradigm that adapts parallelism across different inference stages and modalities.Key findings from the paper:Up to 4.2× reduction in TTFT3.2×–4.5× higher throughput under mixed multimodal workloadsModality-aware scheduling, elastic stage partitioning, unified prefix caching, and non-blocking encodingPaper (OpenReview PDF): <a href="https://openreview.net/pdf?id=Zd6VyjmN1S" rel="nofollow">https://openreview.net/pdf?id=Zd6VyjmN1S</a>GitHub repo: <a href="https://github.com/hpdps-group/ElasticMM" rel="nofollow">https://github.com/hpdps-group/ElasticMM</a>Curious to hear what the HN community thinks, especially from those building LLM/MLLM inference stacks or dealing with multimodal serving in production.

展示HN：ElasticMM – 4.2倍更快的多模态大语言模型服务（NeurIPS 2025 口头报告）