展示HN:Sentience – AI代理的语义视觉基础(WASM和ONNX)

1作者: tonyww大约 1 个月前原帖
嗨,HN,我是SentienceAPI的独立创始人。在过去的12月份,我一直在构建一个专门为大型语言模型(LLM)代理设计的浏览器自动化运行时。 问题:构建可靠的网络代理非常痛苦。你基本上只有两个糟糕的选择: 原始DOM:直接获取document.body.innerHTML既便宜又快速,但会使上下文窗口超载(超过10万个标记),并且缺乏空间上下文(代理会尝试点击隐藏或屏幕外的元素)。 视觉模型(GPT-4o):发送屏幕截图虽然可靠,但速度慢(延迟3-10秒)且成本高(约$0.01每步)。更糟的是,它们经常会产生虚假坐标,错过按钮10个像素。 解决方案:语义几何Sentience是代理的“视觉皮层”。它位于浏览器和你的LLM之间,将嘈杂的网站转换为干净、排序、具有坐标意识的JSON。 工作原理(技术栈): 客户端(WASM):一个Chrome扩展程序注入一个Rust/WASM模块,直接在浏览器进程中修剪95%的DOM(脚本、跟踪像素、不可见的包装器)。它在小于50毫秒内处理Shadow DOM、嵌套iframe(“框架拼接”)和计算样式(可见性/z-index)。 网关(Rust/Axum):修剪后的树发送到一个Rust网关,该网关应用简单视觉提示的启发式重要性评分(例如,is_primary)。 大脑(ONNX):一个服务器端的机器学习层(通过ort运行ms-marco-MiniLM)根据用户的目标(例如,“搜索鞋子”)对元素进行语义重新排序。 结果:你的代理获得了前50个最相关的可交互元素的列表,包含精确的(x,y)坐标、重要性值和视觉提示,帮助LLM代理做出决策。 性能: 成本:每步约$0.001(相比于视觉模型的$0.01+) 延迟:约400毫秒(相比于视觉模型的5秒+) 负载:约1400个标记(相比于原始HTML的10万个) 开发者体验(“酷炫”的东西):我讨厌调试文本日志,所以我构建了Sentience Studio,一个“时光旅行调试器”。它将每一步(DOM快照 + 屏幕截图)记录到一个.jsonl追踪文件中。你可以像视频编辑器一样浏览时间线,准确查看代理看到的内容与它产生的幻觉之间的差异。 链接: 文档与SDK: [https://www.sentienceapi.com/docs](https://www.sentienceapi.com/docs) GitHub(SDK): SDK Python: [https://github.com/SentienceAPI/sentience-python](https://github.com/SentienceAPI/sentience-python) SDK TypeScript: [https://github.com/SentienceAPI/sentience-ts](https://github.com/SentienceAPI/sentience-ts) Studio演示: [https://www.sentienceapi.com/docs/studio](https://www.sentienceapi.com/docs/studio) 构建Web代理: [https://www.sentienceapi.com/docs/sdk/agent-quick-start](https://www.sentienceapi.com/docs/sdk/agent-quick-start) 带有重要性标签(金星)的屏幕截图: [https://sentience-screenshots.sfo3.cdn.digitaloceanspaces.com/Screenshot](https://sentience-screenshots.sfo3.cdn.digitaloceanspaces.com/Screenshot) 2026-01-06 上午7:19:41.png [https://sentience-screenshots.sfo3.cdn.digitaloceanspaces.com/Screenshot](https://sentience-screenshots.sfo3.cdn.digitaloceanspaces.com/Screenshot) 2026-01-06 上午7:19:41.png 我负责后端的Rust和SDK的Python/TypeScript。项目现在处于测试阶段,我非常希望能收到关于架构或排序逻辑的反馈!
查看原文
Hi HN, I’m the solo founder behind SentienceAPI. I’ve spent the last December building a browser automation runtime designed specifically for LLM agents.<p>The Problem: Building reliable web agents is painful. You essentially have two bad choices:<p>Raw DOM: Dumping document.body.innerHTML is cheap&#x2F;fast but overwhelms the context window (100k+ tokens) and lacks spatial context (agents try to click hidden or off-screen elements). Vision Models (GPT-4o): Sending screenshots is robust but slow (3-10s latency) and expensive (~$0.01&#x2F;step). Worse, they often hallucinate coordinates, missing buttons by 10 pixels. The Solution: Semantic Geometry Sentience is a &quot;Visual Cortex&quot; for agents. It sits between the browser and your LLM, turning noisy websites into clean, ranked, coordinate-aware JSON.<p>How it works (The Stack):<p>Client (WASM): A Chrome Extension injects a Rust&#x2F;WASM module that prunes 95% of the DOM (scripts, tracking pixels, invisible wrappers) directly in the browser process. It handles Shadow DOM, nested iframes (&quot;Frame Stitching&quot;), and computed styles (visibility&#x2F;z-index) in &lt;50ms.<p>Gateway (Rust&#x2F;Axum): The pruned tree is sent to a Rust gateway that applies heuristic importance scoring with simple visual cues (e.g. is_primary)<p>Brain (ONNX): A server-side ML layer (running ms-marco-MiniLM via ort) semantically re-ranks the elements based on the user’s goal (e.g., &quot;Search for shoes&quot;).<p>Result: Your agent gets a list of the Top 50 most relevant interactable elements with exact (x,y) coordinates with importance value and visual cues, helping LLM agent make decision.<p>Performance:<p>Cost: ~$0.001 per step (vs. $0.01+ for Vision) Latency: ~400ms (vs. 5s+ for Vision) Payload: ~1400 tokens (vs. 100k for Raw HTML) Developer Experience (The &quot;Cool&quot; Stuff): I hated debugging text logs, so I built Sentience Studio, a &quot;Time-Travel Debugger.&quot; It records every step (DOM snapshot + Screenshot) into a .jsonl trace. You can scrub through the timeline like a video editor to see exactly what the agent saw vs. what it hallucinated.<p>Links:<p>Docs &amp; SDK: <a href="https:&#x2F;&#x2F;www.sentienceapi.com&#x2F;docs" rel="nofollow">https:&#x2F;&#x2F;www.sentienceapi.com&#x2F;docs</a><p>GitHub (SDK): SDK Python: <a href="https:&#x2F;&#x2F;github.com&#x2F;SentienceAPI&#x2F;sentience-python" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;SentienceAPI&#x2F;sentience-python</a><p>SDK TypeScript: <a href="https:&#x2F;&#x2F;github.com&#x2F;SentienceAPI&#x2F;sentience-ts" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;SentienceAPI&#x2F;sentience-ts</a><p>Studio Demo: <a href="https:&#x2F;&#x2F;www.sentienceapi.com&#x2F;docs&#x2F;studio" rel="nofollow">https:&#x2F;&#x2F;www.sentienceapi.com&#x2F;docs&#x2F;studio</a><p>Build Web Agent: <a href="https:&#x2F;&#x2F;www.sentienceapi.com&#x2F;docs&#x2F;sdk&#x2F;agent-quick-start" rel="nofollow">https:&#x2F;&#x2F;www.sentienceapi.com&#x2F;docs&#x2F;sdk&#x2F;agent-quick-start</a><p>Screenshots with importance labels (gold stars): <a href="https:&#x2F;&#x2F;sentience-screenshots.sfo3.cdn.digitaloceanspaces.com&#x2F;Screenshot" rel="nofollow">https:&#x2F;&#x2F;sentience-screenshots.sfo3.cdn.digitaloceanspaces.co...</a> 2026-01-06 at 7.19.41 AM.png<p><a href="https:&#x2F;&#x2F;sentience-screenshots.sfo3.cdn.digitaloceanspaces.com&#x2F;Screenshot" rel="nofollow">https:&#x2F;&#x2F;sentience-screenshots.sfo3.cdn.digitaloceanspaces.co...</a> 2026-01-06 at 7.19.41 AM.png<p>I’m handling the backend in Rust and the SDKs in Python&#x2F;TypeScript. The project is now in beta launch, I would love feedbacks on the architecture or the ranking logic!