展示HN:状态化LLM推理(输入令牌无成本,无提示缓存)

1作者: arkonrad6 天前原帖
嗨,HN, 我对当前云端大语言模型(LLM)推理的工作方式感到沮丧。每次API调用都是从头开始:你需要重新发送整个提示和对话历史,并且每个输入的token都会收费,即使模型之前已经“看过”这些上下文。 这导致了两个主要问题: 1. 性能与成本——不断重新发送输入token是浪费的。 2. 质量损失——因为每次都在新的GPU上重建状态,模型会丢失很多内部上下文,而不仅仅是你的文本。 行业中大多数所谓的“优化”其实只是提示缓存。这对于减少重复输入的成本是有用的,但我们都见过其副作用:输出无法匹配提示中的细微变化,或者模型自信地“跳转”到错误的缓存响应,因为它认为你的查询是近似重复的。 我们在ark-labs.cloud采取了不同的方法: 1. 真正的有状态推理——当你开始一个会话时,所有请求都在同一组GPU上处理,模型的完整内部状态(提示、历史、推理痕迹)在调用之间得以保留。 2. 零输入token成本——因为模型不需要你在每次请求中重新发送输入。你只需为生成的输出付费。 3. 更好的响应,而不仅仅是更便宜的响应——维护内部状态可以提高一致性和推理质量,而不仅仅是节省成本。 从开发者的角度来看,这很简单:启用cookies,API将保持会话活跃(ark_session_id)。没有SDK魔法,没有黑客手段。会话在不活动后会过期以释放资源,但在活跃期间,你与一个实际上能够内部记忆的模型对话,而不仅仅是通过字符串连接提示。 文档链接:[https://ark-labs.cloud/documentation/](https://ark-labs.cloud/documentation/) 我们希望听到你的想法——特别是那些曾经为“为什么我为已经发送的tokens支付10倍费用”这个问题而苦恼的人,或者那些遇到提示与输出不匹配的缓存系统的人。这个方法对你来说有意义吗?
查看原文
Hi HN,<p>I’ve been frustrated for a while with how LLM inference works in the cloud today. Every API call starts from scratch: you resend your entire prompt + conversation history, and you’re charged for every input token, even if the model has already “seen” that context before.<p>This leads to two big problems:<p>Performance &amp; cost – constantly resending input tokens is wasteful.<p>Quality loss – because the state is rebuilt on a new GPU each time, the model loses a lot of internal context beyond just your text.<p>Most “optimizations” offered in the industry are really just prompt-caching. That’s useful for cutting repeated input costs, but we’ve all seen the side-effects: outputs that don’t match subtle variations in the prompt, or the model confidently “jumping” to the wrong cached response because it thought your query was a near-duplicate.<p>We’re taking a different approach with ark-labs.cloud:<p>True stateful inference – when you start a session, all requests are processed on the same set of GPUs, and the full internal state of the model (prompt, history, reasoning traces) is preserved between calls.<p>Zero input token cost – because the model doesn’t need you to resend your input on each request. You pay only for generated output.<p>Better responses, not just cheaper ones – maintaining the internal state can improve consistency and reasoning quality, not just save money.<p>From a developer perspective, it’s simple: enable cookies, and the API will keep a session alive (ark_session_id). No SDK magic, no hacks. Sessions do expire after inactivity to free resources, but while they’re active, you’re talking to a model that actually remembers internally, not just through string concatenation of prompts.<p>Docs <a href="https:&#x2F;&#x2F;ark-labs.cloud&#x2F;documentation&#x2F;" rel="nofollow">https:&#x2F;&#x2F;ark-labs.cloud&#x2F;documentation&#x2F;</a><p>We’d love your thoughts — especially from those who’ve wrestled with the “why am I paying 10x for tokens I already sent” problem, or who’ve hit caching systems that mismatched prompts to outputs. Does this approach make sense to you?