发布 HN:Lucidic(YC W25) – 在生产环境中调试、测试和评估 AI 代理

8作者: AbhinavX3 天前原帖
嗨,HN,我们是 Abhinav、Andy 和 Jeremy,我们正在开发 Lucidic AI([https://dashboard.lucidic.ai](https://dashboard.lucidic.ai)),这是一款 AI 代理可解释性工具,旨在帮助观察和调试 AI 代理。 这里是一个演示:[https://youtu.be/Zvoh1QUMhXQ](https://youtu.be/Zvoh1QUMhXQ)。 开始使用非常简单,只需一行代码。您只需在代理代码中调用 `lai.init()` 并登录到仪表板。您可以查看每次运行的跟踪、跨会话的累积趋势、内置或自定义评估以及分组的失败模式。调用 `lai.create_step()` 并添加您想要的任何元数据、内存快照、工具输出、状态信息,我们会为调试进行索引。 我们曾在斯坦福 AI 实验室(SAIL)进行 NLP 研究,致力于创建一个 AI 代理(使用微调模型和 DSPy)来解决数学奥林匹克问题(专注于 AIME/USAMO);我们意识到调试这些代理是非常困难的。但最后一根稻草是我们构建了一个可以在线购买商品的电子商务代理。它在结账时不断失败,每次一行代码的更改、调整提示、切换到 Llama、调整工具逻辑,都意味着需要重新运行 10 分钟,只是为了看看是否能到达同一个结账页面。 在这个时候,我们都觉得这太糟糕了,因此我们通过更好的调试、监控和评估来改善代理的可解释性。 我们首先倾听用户的反馈,他们告诉我们传统的 LLM 可观察性平台无法捕捉代理的复杂性。代理不仅仅是输入/输出对,它们还有工具、记忆和事件。因此,我们自动将 OTel(和/或常规)代理日志转换为交互式图形可视化,基于记忆和行动模式对相似状态进行聚类。 我们听说人们希望即使在图形中也能测试小的更改,因此我们创建了“时间旅行”功能,您可以修改任何状态(内存内容、工具输出、上下文),然后重新模拟 30-40 次,以查看结果分布。我们嵌入响应,按相似性聚类,并展示哪些修改导致稳定与发散的行为。 然后我们发现人们在同一任务上运行代理 10 次,逐个观察每次运行,浪费了几个小时查看大部分重复的状态。因此,我们构建了基于相似状态嵌入的轨迹聚类(如相似工具或记忆),以在大规模模拟中揭示行为模式。 接着,我们利用这些信息创建了一种力导向布局,自动将代理采取的相似路径分组,状态显示为节点,动作显示为边,失败概率则用颜色强度表示。聚类使失败模式变得明显;您可以看到数百次运行中的趋势,而不是单个跟踪。 最后,当人们看到我们的可观察性功能时,他们自然希望拥有评估能力。因此,我们开发了一个概念,让人们可以创建自己的评估标准,称为“评分标准”,这让您可以定义具体的标准,为每个标准分配权重,并设定评分定义,从而为您提供一种结构化的方法,以根据您的具体要求衡量代理性能。 为了评估这些标准,我们利用自己的平台构建了一个调查代理,它比传统的 LLM 作为评判者的方法更有效地审查您的标准并评估性能。 要开始使用,请访问 [dashboard.lucidic.ai](https://dashboard.lucidic.ai) 和 [https://docs.lucidic.ai/getting-started/quickstart](https://docs.lucidic.ai/getting-started/quickstart)。您可以免费使用 1,000 次事件和步骤创建。 期待您的想法!如有任何疑问,请随时联系 team@lucidic.ai。
查看原文
Hi HN, we’re Abhinav, Andy, and Jeremy, and we’re building Lucidic AI (<a href="https:&#x2F;&#x2F;dashboard.lucidic.ai">https:&#x2F;&#x2F;dashboard.lucidic.ai</a>), an AI agent interpretability tool to help observe&#x2F;debug AI agents.<p>Here is a demo: <a href="https:&#x2F;&#x2F;youtu.be&#x2F;Zvoh1QUMhXQ" rel="nofollow">https:&#x2F;&#x2F;youtu.be&#x2F;Zvoh1QUMhXQ</a>.<p>Getting started is easy with just one line of code. You just call lai.init() in your agent code and log into the dashboard. You can see traces of each run, cumulative trends across sessions, built-in or custom evals, and grouped failure modes. Call lai.create_step() with any metadata you want, memory snapshots, tool outputs, stateful info, and we&#x27;ll index it for debugging.<p>We did NLP research at Stanford AI Lab (SAIL), where we worked on creating an AI agent (w&#x2F; fine-tuned models and DSPy) to solve math olympiad problems (focusing on AIME&#x2F;USAMO); and we realized debugging these agents was hard. But the last straw was when we built an e-commerce agent that could buy items online. It kept failing at checkout, and every one-line change, tweaking a prompt, switching to Llama, adjusting tool logic, meant another 10-minute rerun just to see if we hit the same checkout page.<p>At this point, we were all like, this sucks, so we improved agent interpretability with better debugging, monitoring, and evals.<p>We started by listening to users who told us traditional LLM observability platforms don&#x27;t capture the complexity of agents. Agents have tools, memories, events, not just input&#x2F;output pairs. So we automatically transform OTel (and&#x2F;or regular) agent logs into interactive graph visualizations that cluster similar states based on memory and action patterns. We heard that people wanted to test small changes even with the graphs, so we created “time traveling,” where you can modify any state (memory contents, tool outputs, context), then re-simulate 30–40 times to see outcome distributions. We embed the responses, cluster by similarity, and show which modifications lead to stable vs. divergent behaviors.<p>Then we saw people running their agent 10 times on the same task, watching each run individually, and wasting hours looking at mostly repeated states. So we built trajectory clustering on similar state embeddings (like similar tools or memories) to surface behavioral patterns across mass simulations.<p>We then use that to create a force-directed layout that automatically groups similar paths your agent took, which displays states as nodes, actions as edges, and failure probability as color intensity. The clusters make failure patterns obvious; you see trends across hundreds of runs, not individual traces.<p>Finally, when people saw our observability features, they naturally wanted evaluation capabilities. So we developed a concept for people to make their own evals called &quot;rubrics,&quot; which lets you define specific criteria, assign weights to each criterion, and set score definitions, giving you a structured way to measure agent performance against your exact requirements.<p>To evaluate these criteria, we used our own platform to build an investigator agent that reviews your criteria and evaluates performance much more effectively than traditional LLM-as-a-judge approaches.<p>To get started visit dashboard.lucidic.ai and <a href="https:&#x2F;&#x2F;docs.lucidic.ai&#x2F;getting-started&#x2F;quickstart">https:&#x2F;&#x2F;docs.lucidic.ai&#x2F;getting-started&#x2F;quickstart</a>. You can use it for free for 1,000 event and step creations.<p>Look forward to your thoughts! And don’t hesitate to reach out at team@lucidic.ai