展示HN:时间机器 – 通过从任何步骤分叉和重放来调试AI代理
嗨,HN!我们正在构建“时间机器”,一个用于AI代理的调试和重放平台。我们非常希望听到您的反馈。
这里是一个演示: [https://youtu.be/KyOP9BY0WiY](https://youtu.be/KyOP9BY0WiY)
网站链接:[https://timemachinesdk.dev/](https://timemachinesdk.dev/)
我们试图解决的初始问题是:想象一下,一个代理正在运行到第9步(共10步),它产生了错误的工具调用,向您的数据库写入了垃圾数据,并崩溃了。您修复了提示,然后重新运行。$1.50就这样消失了。在午餐前,这种情况发生了六次。对于在生产环境中运行非平凡工作流的团队来说,每天花费超过100美元在重跑上是很正常的。
我们围绕一个想法构建了“时间机器”:当代理在第9步失败时,您应该能够从第8步分支,并仅重放下游的内容。
如何实现:只需插入TypeScript SDK(或LangChain回调适配器以实现零代码集成),每一步都会被记录——输入、输出、LLM调用、工具调用、完整状态——并持久化到PostgreSQL。仪表板为您提供执行的时间线和有向无环图(DAG)。在任何时候,您都可以分支,修改某些内容(更换模型、编辑提示、调整输入),仅重放下游步骤,并并排比较两次运行的差异。
我们不断回归的内部框架是:为代理执行提供Git支持。检查点、分支、差异、重放。
我们已经看到的一些重叠工具有:LangSmith、Helicone和LangFuse。它们都是不错的工具,但主要是记录器。可观察性是必要的,但在您真正需要的是修改某些内容并查看结果时,它并不足够,而这正是我们使您能够轻松做到的。
我们还提供了原生的Claude Code集成。只需安装一次钩子桥接,每个Claude Code会话都会自动被捕获为“时间机器”执行:工具调用、令牌计数、文件编辑、git上下文、子代理树。您可以在同一个仪表板上获得对Claude Code工作流的全面可观察性,拥有相同的时间线和分支工具,而无需任何手动仪器。此外,我们正在积极开发直接从终端启用“时间机器”的功能,这样您就可以请求Claude Code提取失败的运行,检查追踪,并建议修复,而无需离开您的编辑器。我们的目标是让调试循环保持在开发循环所在的地方。
我们还在同一基础设施上构建一个评估平台。生产运行会自动转化为测试用例。您可以对重放的输出运行断言(包含、正则表达式、余弦相似度、LLM作为评判者、延迟和成本约束),并将其集成到CI/CD中,以便在发布之前测试提示的更改。
当前状态:
MVP已上线——执行捕获、会话重放、分支/重放和Claude Code集成。评估平台正在发布中。SDK是零依赖的。
我们希望与积极调试生产代理的团队合作,成为早期设计合作伙伴。如果您在大规模上面临这个问题,我们非常乐意深入探讨。我们希望人们能够亲自体验这个平台,针对真实的代理运行进行测试,并告诉我们什么可以真正帮助我们消除手动基础设施和代理管理的负担,让您能够专注于快速迭代和实现价值。
查看原文
Hey HN! We are building Time Machine, a debugging and replay platform for AI agents. We would love your feedback.<p>Here's a demo: <a href="https://youtu.be/KyOP9BY0WiY" rel="nofollow">https://youtu.be/KyOP9BY0WiY</a>
Website Link: <a href="https://timemachinesdk.dev/" rel="nofollow">https://timemachinesdk.dev/</a><p>Here is the initial problem we are trying to solve: Imagine it's Step 9 of 10 of an agent running, and it hallucinated a tool call, wrote garbage to your database, and crashed. You fix the prompt. You re-run. $1.50 gone. This happens six more times before lunch. Teams burning $100+ per day on re-runs is normal once you are running non-trivial workflows in production.<p>We built Time Machine around one idea: when an agent fails at step 9, you should be able to fork from step 8 and replay only what is downstream.<p>How: Drop in the TypeScript SDK (or the LangChain callback adapter for zero-code integration) and every step gets recorded — inputs, outputs, LLM calls, tool invocations, full state — persisted to PostgreSQL. The dashboard gives you a timeline and DAG of the execution. At any point, you can fork, change something (swap a model, edit a prompt, tweak an input), replay only the downstream steps, and diff the two runs side by side.<p>The internal framing we keep coming back to: Git for agent execution. Checkpoint, branch, diff, replay.<p>What we already see out there with some overlap: LangSmith, Helicone, and LangFuse. They are good tools, but mainly loggers. Observability is necessary but not sufficient when what you actually need is to change something and see what happens, which is what we enable you to do easily.<p>We also ship a native Claude Code integration. Install the hook bridge once, and every Claude Code session is automatically captured as a Time Machine execution: tool calls, token counts, file edits, git context, subagent trees. You get full observability over your Claude Code workflows in the same dashboard, with the same timeline and fork tooling, without any manual instrumentation. In addition to this, we are actively working on enabling Time Machine directly from your terminal, so you can ask Claude Code to pull a failed run, inspect the trace, and suggest a fix without leaving your editor. The intent is that the debugging loop stays where the development loop already lives.<p>We are also building an eval platform on the same infrastructure. Production runs become test cases automatically. You can run assertions (contains, regex, cosine similarity, LLM-as-judge, latency, and cost constraints) against replayed outputs and plug it into CI/CD so prompt changes get tested before they ship.<p>Current status:
MVP is live - execution capture, session replay, fork/replay, and Claude Code integration. The Eval platform is shipping now. The SDK is zero-dependency.<p>Looking for teams actively debugging production agents who want to be early design partners. Happy to go deeper if this is a problem you are dealing with at scale. We would love for people to get their hands on this, test against real agent runs and let us know what can actually help us to take out the manual infra and Agent management overhead away from your hands - so you can focus on iterating and getting to value quickly.