HackerNews中文版

嗨，HN，我对当前云端大语言模型（LLM）推理的工作方式感到沮丧。每次API调用都是从头开始：你需要重新发送整个提示和对话历史，并且每个输入的token都会收费，即使模型之前已经“看过”这些上下文。这导致了两个主要问题： 1. 性能与成本——不断重新发送输入token是浪费的。 2. 质量损失——因为每次都在新的GPU上重建状态，模型会丢失很多内部上下文，而不仅仅是你的文本。行业中大多数所谓的“优化”其实只是提示缓存。这对于减少重复输入的成本是有用的，但我们都见过其副作用：输出无法匹配提示中的细微变化，或者模型自信地“跳转”到错误的缓存响应，因为它认为你的查询是近似重复的。我们在ark-labs.cloud采取了不同的方法： 1. 真正的有状态推理——当你开始一个会话时，所有请求都在同一组GPU上处理，模型的完整内部状态（提示、历史、推理痕迹）在调用之间得以保留。 2. 零输入token成本——因为模型不需要你在每次请求中重新发送输入。你只需为生成的输出付费。 3. 更好的响应，而不仅仅是更便宜的响应——维护内部状态可以提高一致性和推理质量，而不仅仅是节省成本。从开发者的角度来看，这很简单：启用cookies，API将保持会话活跃（ark_session_id）。没有SDK魔法，没有黑客手段。会话在不活动后会过期以释放资源，但在活跃期间，你与一个实际上能够内部记忆的模型对话，而不仅仅是通过字符串连接提示。文档链接：[https://ark-labs.cloud/documentation/](https://ark-labs.cloud/documentation/) 我们希望听到你的想法——特别是那些曾经为“为什么我为已经发送的tokens支付10倍费用”这个问题而苦恼的人，或者那些遇到提示与输出不匹配的缓存系统的人。这个方法对你来说有意义吗？

查看原文

Hi HN,I’ve been frustrated for a while with how LLM inference works in the cloud today. Every API call starts from scratch: you resend your entire prompt + conversation history, and you’re charged for every input token, even if the model has already “seen” that context before.This leads to two big problems:Performance & cost – constantly resending input tokens is wasteful.Quality loss – because the state is rebuilt on a new GPU each time, the model loses a lot of internal context beyond just your text.Most “optimizations” offered in the industry are really just prompt-caching. That’s useful for cutting repeated input costs, but we’ve all seen the side-effects: outputs that don’t match subtle variations in the prompt, or the model confidently “jumping” to the wrong cached response because it thought your query was a near-duplicate.We’re taking a different approach with ark-labs.cloud:True stateful inference – when you start a session, all requests are processed on the same set of GPUs, and the full internal state of the model (prompt, history, reasoning traces) is preserved between calls.Zero input token cost – because the model doesn’t need you to resend your input on each request. You pay only for generated output.Better responses, not just cheaper ones – maintaining the internal state can improve consistency and reasoning quality, not just save money.From a developer perspective, it’s simple: enable cookies, and the API will keep a session alive (ark_session_id). No SDK magic, no hacks. Sessions do expire after inactivity to free resources, but while they’re active, you’re talking to a model that actually remembers internally, not just through string concatenation of prompts.Docs <a href="https://ark-labs.cloud/documentation/" rel="nofollow">https://ark-labs.cloud/documentation/</a>We’d love your thoughts — especially from those who’ve wrestled with the “why am I paying 10x for tokens I already sent” problem, or who’ve hit caching systems that mismatched prompts to outputs. Does this approach make sense to you?

展示HN：状态化LLM推理（输入令牌无成本，无提示缓存）