本地SLM作为云API调用的压缩层

1作者: asong563 天前原帖
我之前发现了Caveman(https://github.com/JuliusBrussee/caveman),并对此产生了些许痴迷。它的输出信息密集,但无法直接向真实用户展示。因此,我一直在尝试将其隐藏在中间——本地的SLM(序列语言模型)压缩输入,云端模型以Caveman风格进行推理,然后本地SLM再将其扩展回去。用户永远看不到被压缩的部分。 我通过Candle运行Phi-3进行压缩,速度还算快。云端调用的时间也更短。不过,我还没有进行真正的令牌计数。 扩展步骤是个问题。将Caveman的输出重新转换为可读文本比压缩输入要困难得多,而本地模型在这方面犯的错误也更多。我不确定这是否是提示问题,还是说对于这样规模的模型来说是个瓶颈。 此外,我也不确定在低API调用量的情况下这样做是否有意义。增加的复杂性可能并不值得。
查看原文
Found caveman a while back (https:&#x2F;&#x2F;github.com&#x2F;JuliusBrussee&#x2F;caveman) and got kind of obsessed with it. Dense outputs, but you can&#x27;t show them to a real user. So I&#x27;ve been trying to hide that in the middle -- local SLM compresses the input, cloud model reasons in caveman-style, local SLM expands it back. User never sees the compressed parts.<p>Running Phi-3 via Candle for compression. Fast enough. Cloud calls are shorter. Haven&#x27;t done real token counting yet.<p>The expansion step is the problem. Re-hydrating caveman output into readable text is harder than compressing the input and the local model makes more mistakes there. Not sure if that&#x27;s a prompting issue or just a ceiling for a model this size.<p>Also not sure this makes sense at low API volumes. The added complexity might not be worth it.