2026年将是设备内智能助手的时代。
我已经构建了一个本地的人工智能记忆层一段时间,每次尝试让助手具备状态感时,都会出现同样的问题。<p>代理在当下表现得令人印象深刻,但随后它就会忘记。或者它记住了错误的信息,并将其固化为一种永久的信念。一句偶然的评论变成了身份特征。一句话变成了持久的特征。这不是模型质量的问题,而是状态管理的问题。<p>大多数人谈论记忆时会提到“更多上下文”。更大的窗口、更多的检索、更多的提示填充。这对于聊天机器人来说是可以的,但代理是不同的。代理需要计划、执行、更新信念,并在明天回来。一旦你跨越了那条界限,记忆就不再是一个特性,而是基础设施。<p>我反复思考的心理模型是操作系统。<p>1. 什么被存储
2. 什么被压缩
3. 什么从“可能”提升为“真实”
4. 什么会衰退
5. 什么会被删除
6. 什么根本不应该成为持久记忆<p>如果你看看现在大多数记忆堆栈的工作方式,管道基本上是相同的。<p>捕捉交互。总结或提取。嵌入。存储向量和元数据。检索。注入到提示中。写回新记忆。<p>这个循环本身并没有错。更大的问题在于这个循环运行的位置。在许多实际部署中,最敏感的部分发生在用户环境之外。原始交互在你最小化或编辑任何内容之前,就被提前发送出去,并且在你决定什么应该成为持久记忆之前。<p>当记忆优先考虑云时,安全模型会以一种非常特定的方式变得混乱。记忆往往会在系统之间成倍增加。一条交互变成原始片段、摘要、嵌入、元数据和检索痕迹。即使每个单独的工件看起来无害,组合在一起的系统也能以令人不安的精确度重建一个人的历史。<p>然后是信任边界问题。如果检索到的记忆被视为可信的上下文,检索就成为了提示注入和污染可以持续存在的地方。写入记忆中的错误指令不仅影响一个响应。除非你有像验证、隔离、删除和审计这样的治理,否则它可能会在后续中不断以“真相”的形式浮现。<p>集中式记忆也成为了高价值目标。这不仅仅是用户数据,而是有组织的意图和偏好,已被索引以便搜索。这正是攻击者所想要的。<p>即使你忽略安全问题,云也引入了延迟耦合。如果你的代理不断读取和写入记忆,你就要为系统中最频繁的操作支付网络税。<p>这就是为什么我认为边缘并不是一个限制,而是关键。如果记忆就是身份,身份就不应该默认离开设备。<p>随着代理变得更加持久,硬件角度也很重要。CXL在这里很有趣,因为它支持内存池化。每台机器不再是一个孤岛,内存可以被分解并作为共享资源分配。这并不会神奇地创造出无限的上下文,但确实推动了堆栈将代理状态视为一个真正的管理底层,而不仅仅是令牌。<p>我对2026年的预测很简单。获胜的代理架构将把认知与维护分开。使用较小的本地模型处理重复的记忆工作,如总结、提取、标记、冗余检查和提升决策。将更大的模型保留用于需要重度推理的稀有时刻。将持久状态保存在磁盘上,以便在重启后仍然存在,可以被检查,并且可以实际删除。<p>我很好奇其他人看到的情况。对于构建代理的人来说,今天在本地运行记忆的最大障碍是什么:模型质量、工具、部署、评估,还是其他什么?
查看原文
I have been building a local AI memory layer for a while, and the same problem shows up every time you try to make an assistant feel stateful.<p>The agent is impressive in the moment, then it forgets. Or it remembers the wrong thing and hardens it into a permanent belief. A one off comment becomes identity. A stray sentence becomes a durable trait. That is not a model quality issue. It is a state management issue.<p>Most people talk about memory as “more context.” Bigger windows, more retrieval, more prompt stuffing. That is fine for chatbots. Agents are different. Agents plan, execute, update beliefs, and come back tomorrow. Once you cross that line, memory stops being a feature and becomes infrastructure.<p>The mental model I keep coming back to is an operating system.<p>1.What gets stored
2.What gets compressed
3.What gets promoted from “maybe” to “true”
4.What decays
5.What gets deleted
6.What should never become durable memory in the first place<p>If you look at what most memory stacks do today, the pipeline is basically the same everywhere.<p>Capture the interaction. Summarize or extract. Embed. Store vectors and metadata. Retrieve. Inject into the prompt. Write back new memories.<p>That loop is not inherently wrong. The bigger issue is where the loop runs. In a lot of real deployments, the most sensitive parts happen outside the user’s environment. Raw interactions get shipped out early, before you have minimized or redacted anything, and before you have decided what should become durable.<p>When memory goes cloud first, the security model gets messy in a very specific way. Memory tends to multiply across systems. One interaction becomes raw snippets, summaries, embeddings, metadata, and retrieval traces. Even if each artifact feels harmless alone, the combined system can reconstruct a person’s history with uncomfortable fidelity.<p>Then there is the trust boundary problem. If retrieved memories are treated as trusted context, retrieval becomes a place where prompt injection and poisoning can persist. A bad instruction that gets written into memory does not just affect one response. It can keep resurfacing later as “truth” unless you have governance that looks like validation, quarantine, deletion, and audit.<p>Centralized memory also becomes a high value target. It is not just user data, it is organized intent and preference, indexed for search. That is exactly what attackers want.<p>And even if you ignore security, cloud introduces latency coupling. If your agent reads and writes memory constantly, you are paying a network tax on the most frequent operations in the system.<p>This is why I think the edge is not a constraint. It is the point. If memory is identity, identity should not default to leaving the device.<p>There is also a hardware angle that matters as agents become more persistent. CXL is interesting here because it enables memory pooling. Instead of each machine being an island, memory can be disaggregated and allocated as a shared resource. That does not magically create infinite context, but it does push the stack toward treating agent state as a real managed substrate, not just tokens.<p>My bet for 2026 is simple. The winning agent architectures will separate cognition from maintenance. Use smaller local models for the repetitive memory work like summarization, extraction, tagging, redundancy checks, and promotion decisions. Reserve larger models for the rare moments that need heavy reasoning. Keep durable state on disk so it survives restarts, can be inspected, and can actually be deleted.<p>Curious what others are seeing. For people building agents, what is the biggest blocker to running memory locally today: model quality, tooling, deployment, evaluation, or something else?