展示HN:AutoThink – 通过自适应推理提升本地LLM性能43%

2作者: codelion大约 1 个月前原帖
我开发了AutoThink,这是一种通过根据查询复杂性自适应分配计算资源,使本地大型语言模型(LLM)更高效推理的技术。 核心思想:不是给每个查询相同的“思考时间”,而是将查询分类为高复杂度或低复杂度,并相应地分配思考令牌。复杂推理获得70-90%的令牌,而简单查询则获得20-40%。 我还实现了源自Pivotal Token Search(最初来自微软的Phi-4论文)的引导向量,这些向量在生成过程中引导模型的推理模式。这些向量鼓励诸如数字准确性、自我纠正和全面探索等行为。 在DeepSeek-R1-Distill-Qwen-1.5B上的结果: - GPQA-Diamond: 31.06% 对比基线的21.72%(相对提升43%) - MMLU-Pro: 26.38% 对比基线的25.58% - 使用的令牌数量少于基线方法 该技术适用于任何本地推理模型——DeepSeek、Qwen、自定义微调模型。没有API依赖。 该技术建立在我开发的两个基础上:一个能够在不重新训练的情况下学习新复杂性类别的自适应分类框架,以及Pivotal Token Search的开源实现。 技术论文: [https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5253327](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5253327) 代码和示例: [https://github.com/codelion/optillm/tree/main/optillm/autothink](https://github.com/codelion/optillm/tree/main/optillm/autothink) PTS实现: [https://github.com/codelion/pts](https://github.com/codelion/pts) 我对您对AI推理的自适应资源分配的看法很感兴趣。您是否尝试过在您的本地模型中使用类似的方法?
查看原文
I built AutoThink, a technique that makes local LLMs reason more efficiently by adaptively allocating computational resources based on query complexity.<p>The core idea: instead of giving every query the same &quot;thinking time,&quot; classify queries as HIGH or LOW complexity and allocate thinking tokens accordingly. Complex reasoning gets 70-90% of tokens, simple queries get 20-40%.<p>I also implemented steering vectors derived from Pivotal Token Search (originally from Microsoft&#x27;s Phi-4 paper) that guide the model&#x27;s reasoning patterns during generation. These vectors encourage behaviors like numerical accuracy, self-correction, and thorough exploration.<p>Results on DeepSeek-R1-Distill-Qwen-1.5B:<p>- GPQA-Diamond: 31.06% vs 21.72% baseline (+43% relative improvement) - MMLU-Pro: 26.38% vs 25.58% baseline - Uses fewer tokens than baseline approaches<p>Works with any local reasoning model - DeepSeek, Qwen, custom fine-tuned models. No API dependencies.<p>The technique builds on two things I developed: an adaptive classification framework that can learn new complexity categories without retraining, and an open source implementation of Pivotal Token Search.<p>Technical paper: <a href="https:&#x2F;&#x2F;papers.ssrn.com&#x2F;sol3&#x2F;papers.cfm?abstract_id=5253327" rel="nofollow">https:&#x2F;&#x2F;papers.ssrn.com&#x2F;sol3&#x2F;papers.cfm?abstract_id=5253327</a><p>Code and examples: <a href="https:&#x2F;&#x2F;github.com&#x2F;codelion&#x2F;optillm&#x2F;tree&#x2F;main&#x2F;optillm&#x2F;autothink">https:&#x2F;&#x2F;github.com&#x2F;codelion&#x2F;optillm&#x2F;tree&#x2F;main&#x2F;optillm&#x2F;autoth...</a><p>PTS implementation: <a href="https:&#x2F;&#x2F;github.com&#x2F;codelion&#x2F;pts">https:&#x2F;&#x2F;github.com&#x2F;codelion&#x2F;pts</a><p>I&#x27;m curious about your thoughts on adaptive resource allocation for AI reasoning. Have you tried similar approaches with your local models?