本地SLM作为云API调用的压缩层
我之前发现了Caveman(https://github.com/JuliusBrussee/caveman),并对此产生了些许痴迷。它的输出信息密集,但无法直接向真实用户展示。因此,我一直在尝试将其隐藏在中间——本地的SLM(序列语言模型)压缩输入,云端模型以Caveman风格进行推理,然后本地SLM再将其扩展回去。用户永远看不到被压缩的部分。
我通过Candle运行Phi-3进行压缩,速度还算快。云端调用的时间也更短。不过,我还没有进行真正的令牌计数。
扩展步骤是个问题。将Caveman的输出重新转换为可读文本比压缩输入要困难得多,而本地模型在这方面犯的错误也更多。我不确定这是否是提示问题,还是说对于这样规模的模型来说是个瓶颈。
此外,我也不确定在低API调用量的情况下这样做是否有意义。增加的复杂性可能并不值得。
查看原文
Found caveman a while back (https://github.com/JuliusBrussee/caveman) and got kind of obsessed with it. Dense outputs, but you can't show them to a real user. So I've been trying to hide that in the middle -- local SLM compresses the input, cloud model reasons in caveman-style, local SLM expands it back. User never sees the compressed parts.<p>Running Phi-3 via Candle for compression. Fast enough. Cloud calls are shorter. Haven't done real token counting yet.<p>The expansion step is the problem. Re-hydrating caveman output into readable text is harder than compressing the input and the local model makes more mistakes there. Not sure if that's a prompting issue or just a ceiling for a model this size.<p>Also not sure this makes sense at low API volumes. The added complexity might not be worth it.