展示HN:Arch-Router – 基于偏好而非基准的1.5亿参数LLM路由模型
嗨,HN——我们是Arch团队(<a href="https://github.com/katanemo/archgw">https://github.com/katanemo/archgw</a>),一个用Rust编写的开源LLM代理。今天,我们发布了Arch-Router(<a href="https://huggingface.co/katanemo/Arch-Router-1.5B" rel="nofollow">https://huggingface.co/katanemo/Arch-Router-1.5B</a>),这是一个用于基于偏好的路由的1.5B路由模型,现已集成到代理中。随着团队整合多个LLM——每个模型具有不同的优势、风格或成本/延迟特征——将正确的提示路由到合适的模型成为应用设计中的关键部分。但这仍然是一个未解决的问题。大多数路由系统可分为两类:
- 基于嵌入的路由器使用意图分类器——将提示标记为“支持”、“SQL”或“数学”,然后路由到匹配的模型。这在简单任务中有效,但在真实对话中会失效。用户在对话中会改变话题,任务边界模糊,产品变更需要重新训练分类器。
- 基于性能的路由器根据基准测试(如MMLU或MT-Bench)或延迟/成本曲线选择模型。但基准测试往往忽视了生产中重要的因素:特定领域的质量或主观偏好,比如“法律会接受这个条款吗?”
Arch-Router采取了不同的方法:通过用自然语言书写的偏好进行路由。您可以编写规则,例如“合同条款 → GPT-4o”或“快速旅行提示 → Gemini Flash”。路由器使用一个轻量级的1.5B自回归模型将提示(和对话上下文)映射到这些规则上。无需重新训练,也没有脆弱的if/else链。我们在Twilio和Atlassian团队的输入下构建了这个系统。它能够处理意图漂移,支持多轮对话,并允许您通过一行更改路由策略来更换模型。完整细节请参见我们的论文(<a href="https://arxiv.org/abs/2506.16655" rel="nofollow">https://arxiv.org/abs/2506.16655</a>),以下是一些快照:
规格:
- 1.5B参数——可在单个GPU(或CPU测试)上运行
- 无需重新训练——可以指向任何组合的LLM
- 考虑成本和延迟——将重任务路由到昂贵模型,将轻任务路由到更快/更便宜的模型
- 在我们的对话路由基准测试中表现优于更大的封闭模型(详细信息见论文)
链接:
- Arch Proxy(开源):<a href="https://github.com/katanemo/archgw">https://github.com/katanemo/archgw</a>
- 模型 + 代码:<a href="https://huggingface.co/katanemo/Arch-Router-1.5B" rel="nofollow">https://huggingface.co/katanemo/Arch-Router-1.5B</a>
- 论文:<a href="https://arxiv.org/abs/2506.16655" rel="nofollow">https://arxiv.org/abs/2506.16655</a>
查看原文
Hi HN — we're the team behind Arch (<a href="https://github.com/katanemo/archgw">https://github.com/katanemo/archgw</a>), an open-source proxy for LLMs written in Rust. Today we're releasing Arch-Router (<a href="https://huggingface.co/katanemo/Arch-Router-1.5B" rel="nofollow">https://huggingface.co/katanemo/Arch-Router-1.5B</a>), a 1.5B router model for preference-based routing, now integrated into the proxy. As teams integrate multiple LLMs - each with different strengths, styles, or cost/latency profiles — routing the right prompt to the right model becomes a critical part of the application design. But it's still an open problem. Most routing systems fall into two camps:<p>- Embedding-based routers use intent classifiers — label a prompt as “support,” “SQL,” or “math,” then route to a matching model. This works for simple tasks but breaks down in real conversations. Users shift topics mid-conversation, task boundaries blur, and product changes require retraining classifiers.<p>- Performance-based routers pick models based on benchmarks like MMLU or MT-Bench, or based on latency or cost curves. But benchmarks often miss what matters in production: domain-specific quality or subjective preferences like “Will legal accept this clause?”<p>Arch-Router takes a different approach: route by preferences written in plain language. You write rules like “contract clauses → GPT-4o” or “quick travel tips → Gemini Flash.” The router maps the prompt (and conversation context) to those rules using a lightweight 1.5B autoregressive model. No retraining, no fragile if/else chains. We built this with input from teams at Twilio and Atlassian. It handles intent drift, supports multi-turn conversations, and lets you swap in or out models with a one-line change to the routing policy. Full details are in our paper (<a href="https://arxiv.org/abs/2506.16655" rel="nofollow">https://arxiv.org/abs/2506.16655</a>), but here's a snapshot:<p>Specs:<p>- 1.5B params — runs on a single GPU (or CPU for testing)<p>- No retraining needed — point it at any mix of LLMs<p>- Cost and latency aware — route heavy tasks to expensive models, light tasks to faster/cheaper ones<p>- Outperforms larger closed models on our conversational routing benchmarks (details in the paper)<p>Links:<p>- Arch Proxy (open source): <a href="https://github.com/katanemo/archgw">https://github.com/katanemo/archgw</a><p>- Model + code: <a href="https://huggingface.co/katanemo/Arch-Router-1.5B" rel="nofollow">https://huggingface.co/katanemo/Arch-Router-1.5B</a><p>- Paper: <a href="https://arxiv.org/abs/2506.16655" rel="nofollow">https://arxiv.org/abs/2506.16655</a>