展示HN:用于AI网络代理的语义几何视觉定位(亚马逊演示)

1作者: tonyww大约 7 小时前原帖
嗨,HN, 我是一名独立创始人,正在开发SentienceAPI,这是一个感知与执行层,旨在帮助大型语言模型(LLM)代理在真实网站上可靠地进行操作。 LLM在规划步骤方面表现良好,但在实际与网页互动时常常失败。仅依赖视觉的代理成本高且不稳定,而基于DOM的自动化在现代页面上容易出错,因为这些页面通常有覆盖层、动态布局和大量噪音。 我的方法是基于语义几何的视觉定位。 与其向模型提供原始HTML(上下文庞大)或截图(不够精确)并让其猜测,不如让API首先将网页简化为一个小的、基于实际可见和可交互元素的动作空间。每个元素都包含几何信息和轻量级的视觉提示,因此模型可以在没有猜测的情况下决定该做什么。 我在此基础上构建了一个参考应用程序,名为MotionDocs。下面的演示展示了该系统如何在亚马逊畅销书页面上导航,打开一个产品,并点击“加入购物车”,使用的是基于坐标的操作(没有脚本点击)。 演示视频(加入购物车): [https://youtu.be/1DlIeHvhOg4](https://youtu.be/1DlIeHvhOg4) 代理如何查看页面(地图模式线框图): [https://sentience-screenshots.sfo3.cdn.digitaloceanspaces.com/hn_wireframe.png](https://sentience-screenshots.sfo3.cdn.digitaloceanspaces.com/hn_wireframe.png) 这个线框图展示了呈现给LLM的简化动作空间。每个框对应一个可见的、可交互的元素。 代码摘录(简化版): ```python from sentienceapi_sdk import SentienceApiClient from motiondocs import generate_video video = generate_video( url="https://www.amazon.com/gp/bestsellers/", instructions="打开一个产品并加入购物车", sentience_client=SentienceApiClient(api_key="your-api-key-here") ) video.save("demo.mp4") ``` 工作原理(高层次): 执行层将浏览器视为黑箱,并提供三种模式: * 地图:识别具有几何和视觉提示的可交互元素 * 视觉:将几何与截图对齐以进行定位 * 读取:提取干净的、适合LLM的文本 关键的见解是视觉提示,尤其是一个简单的is_primary信号。人类不会逐个像素地阅读——我们会扫描视觉层次结构。直接编码这一点使代理能够优先考虑正确的操作,而无需处理原始像素或嘈杂的DOM。 为什么这很重要: * 更小的动作空间 → 更少的幻觉 * 确定性的几何 → 可重复的执行 * 比仅依赖视觉的方法更便宜 总结:我正在构建一个语义几何定位层,将网页转化为一个紧凑的、视觉基础的动作空间,供LLM代理使用。这为模型提供了一份备忘单,而不是让它去解决视觉难题。 这项工作还处于早期阶段,尚未推出。我非常希望得到反馈或质疑,特别是来自构建代理、RPA、QA自动化或开发工具的人士。 —— Tony W
查看原文
Hi HN,<p>I’m a solo founder working on SentienceAPI, a perception &amp; execution layer that helps LLM agents act reliably on real websites.<p>LLMs are good at planning steps, but they fail a lot when actually interacting with the web. Vision-only agents are expensive and unstable, and DOM-based automation breaks easily on modern pages with overlays, dynamic layouts, and lots of noise.<p>My approach is semantic geometry-based visual grounding.<p>Instead of giving the model raw HTML (huge context) or a screenshot (imprecise) and asking it to guess, the API first reduces a webpage into a small, grounded action space made only of elements that are actually visible and interactable. Each element includes geometry plus lightweight visual cues, so the model can decide what to do without guessing.<p>I built a reference app called MotionDocs on top of this. The demo below shows the system navigating Amazon Best Sellers, opening a product, and clicking “Add to cart” using grounded coordinates (no scripted clicks).<p>Demo video (Add to Cart): [<a href="https:&#x2F;&#x2F;youtu.be&#x2F;1DlIeHvhOg4" rel="nofollow">https:&#x2F;&#x2F;youtu.be&#x2F;1DlIeHvhOg4</a>](<a href="https:&#x2F;&#x2F;youtu.be&#x2F;1DlIeHvhOg4" rel="nofollow">https:&#x2F;&#x2F;youtu.be&#x2F;1DlIeHvhOg4</a>)<p>How the agent sees the page (map mode wireframe): [<a href="https:&#x2F;&#x2F;sentience-screenshots.sfo3.cdn.digitaloceanspaces.com&#x2F;hn_wireframe.png" rel="nofollow">https:&#x2F;&#x2F;sentience-screenshots.sfo3.cdn.digitaloceanspaces.co...</a>](<a href="https:&#x2F;&#x2F;sentience-screenshots.sfo3.cdn.digitaloceanspaces.com&#x2F;hn_wireframe.png" rel="nofollow">https:&#x2F;&#x2F;sentience-screenshots.sfo3.cdn.digitaloceanspaces.co...</a>)<p>This wireframe shows the reduced action space surfaced to the LLM. Each box corresponds to a visible, interactable element.<p>Code excerpt (simplified):<p>``` from sentienceapi_sdk import SentienceApiClient from motiondocs import generate_video<p>video = generate_video( url=&quot;<a href="https:&#x2F;&#x2F;www.amazon.com&#x2F;gp&#x2F;bestsellers&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.amazon.com&#x2F;gp&#x2F;bestsellers&#x2F;</a>&quot;, instructions=&quot;Open a product and add it to cart&quot;, sentience_client=SentienceApiClient(api_key=&quot;your-api-key-here&quot;) )<p>video.save(&quot;demo.mp4&quot;) ```<p>How it works (high level):<p>The execution layer treats the browser as a black box and exposes three modes:<p>* Map: identify interactable elements with geometry and visual cues * Visual: align geometry with screenshots for grounding * Read: extract clean, LLM-ready text<p>The key insight is visual cues, especially a simple is_primary signal. Humans don’t read every pixel — we scan for visual hierarchy. Encoding that directly lets the agent prioritize the right actions without processing raw pixels or noisy DOM.<p>Why this matters:<p>* smaller action space → fewer hallucinations * deterministic geometry → reproducible execution * cheaper than vision-only approaches<p>TL;DR: I’m building a semantic geometry grounding layer that turns web pages into a compact, visually grounded action space for LLM agents. It gives the model a cheat sheet instead of asking it to solve a vision puzzle.<p>This is early work, not launched yet. I’d love feedback or skepticism, especially from people building agents, RPA, QA automation, or dev tools.<p>— Tony W