展示HN:Sentience – AI代理的语义视觉基础(WASM和ONNX)
嗨,HN,我是SentienceAPI的独立创始人。在过去的12月份,我一直在构建一个专门为大型语言模型(LLM)代理设计的浏览器自动化运行时。
问题:构建可靠的网络代理非常痛苦。你基本上只有两个糟糕的选择:
原始DOM:直接获取document.body.innerHTML既便宜又快速,但会使上下文窗口超载(超过10万个标记),并且缺乏空间上下文(代理会尝试点击隐藏或屏幕外的元素)。
视觉模型(GPT-4o):发送屏幕截图虽然可靠,但速度慢(延迟3-10秒)且成本高(约$0.01每步)。更糟的是,它们经常会产生虚假坐标,错过按钮10个像素。
解决方案:语义几何Sentience是代理的“视觉皮层”。它位于浏览器和你的LLM之间,将嘈杂的网站转换为干净、排序、具有坐标意识的JSON。
工作原理(技术栈):
客户端(WASM):一个Chrome扩展程序注入一个Rust/WASM模块,直接在浏览器进程中修剪95%的DOM(脚本、跟踪像素、不可见的包装器)。它在小于50毫秒内处理Shadow DOM、嵌套iframe(“框架拼接”)和计算样式(可见性/z-index)。
网关(Rust/Axum):修剪后的树发送到一个Rust网关,该网关应用简单视觉提示的启发式重要性评分(例如,is_primary)。
大脑(ONNX):一个服务器端的机器学习层(通过ort运行ms-marco-MiniLM)根据用户的目标(例如,“搜索鞋子”)对元素进行语义重新排序。
结果:你的代理获得了前50个最相关的可交互元素的列表,包含精确的(x,y)坐标、重要性值和视觉提示,帮助LLM代理做出决策。
性能:
成本:每步约$0.001(相比于视觉模型的$0.01+)
延迟:约400毫秒(相比于视觉模型的5秒+)
负载:约1400个标记(相比于原始HTML的10万个)
开发者体验(“酷炫”的东西):我讨厌调试文本日志,所以我构建了Sentience Studio,一个“时光旅行调试器”。它将每一步(DOM快照 + 屏幕截图)记录到一个.jsonl追踪文件中。你可以像视频编辑器一样浏览时间线,准确查看代理看到的内容与它产生的幻觉之间的差异。
链接:
文档与SDK: [https://www.sentienceapi.com/docs](https://www.sentienceapi.com/docs)
GitHub(SDK):
SDK Python: [https://github.com/SentienceAPI/sentience-python](https://github.com/SentienceAPI/sentience-python)
SDK TypeScript: [https://github.com/SentienceAPI/sentience-ts](https://github.com/SentienceAPI/sentience-ts)
Studio演示: [https://www.sentienceapi.com/docs/studio](https://www.sentienceapi.com/docs/studio)
构建Web代理: [https://www.sentienceapi.com/docs/sdk/agent-quick-start](https://www.sentienceapi.com/docs/sdk/agent-quick-start)
带有重要性标签(金星)的屏幕截图:
[https://sentience-screenshots.sfo3.cdn.digitaloceanspaces.com/Screenshot](https://sentience-screenshots.sfo3.cdn.digitaloceanspaces.com/Screenshot) 2026-01-06 上午7:19:41.png
[https://sentience-screenshots.sfo3.cdn.digitaloceanspaces.com/Screenshot](https://sentience-screenshots.sfo3.cdn.digitaloceanspaces.com/Screenshot) 2026-01-06 上午7:19:41.png
我负责后端的Rust和SDK的Python/TypeScript。项目现在处于测试阶段,我非常希望能收到关于架构或排序逻辑的反馈!
查看原文
Hi HN, I’m the solo founder behind SentienceAPI. I’ve spent the last December building a browser automation runtime designed specifically for LLM agents.<p>The Problem: Building reliable web agents is painful. You essentially have two bad choices:<p>Raw DOM: Dumping document.body.innerHTML is cheap/fast but overwhelms the context window (100k+ tokens) and lacks spatial context (agents try to click hidden or off-screen elements).
Vision Models (GPT-4o): Sending screenshots is robust but slow (3-10s latency) and expensive (~$0.01/step). Worse, they often hallucinate coordinates, missing buttons by 10 pixels.
The Solution: Semantic Geometry Sentience is a "Visual Cortex" for agents. It sits between the browser and your LLM, turning noisy websites into clean, ranked, coordinate-aware JSON.<p>How it works (The Stack):<p>Client (WASM): A Chrome Extension injects a Rust/WASM module that prunes 95% of the DOM (scripts, tracking pixels, invisible wrappers) directly in the browser process. It handles Shadow DOM, nested iframes ("Frame Stitching"), and computed styles (visibility/z-index) in <50ms.<p>Gateway (Rust/Axum): The pruned tree is sent to a Rust gateway that applies heuristic importance scoring with simple visual cues (e.g. is_primary)<p>Brain (ONNX): A server-side ML layer (running ms-marco-MiniLM via ort) semantically re-ranks the elements based on the user’s goal (e.g., "Search for shoes").<p>Result: Your agent gets a list of the Top 50 most relevant interactable elements with exact (x,y) coordinates with importance value and visual cues, helping LLM agent make decision.<p>Performance:<p>Cost: ~$0.001 per step (vs. $0.01+ for Vision)
Latency: ~400ms (vs. 5s+ for Vision)
Payload: ~1400 tokens (vs. 100k for Raw HTML)
Developer Experience (The "Cool" Stuff): I hated debugging text logs, so I built Sentience Studio, a "Time-Travel Debugger." It records every step (DOM snapshot + Screenshot) into a .jsonl trace. You can scrub through the timeline like a video editor to see exactly what the agent saw vs. what it hallucinated.<p>Links:<p>Docs & SDK: <a href="https://www.sentienceapi.com/docs" rel="nofollow">https://www.sentienceapi.com/docs</a><p>GitHub (SDK):
SDK Python: <a href="https://github.com/SentienceAPI/sentience-python" rel="nofollow">https://github.com/SentienceAPI/sentience-python</a><p>SDK TypeScript: <a href="https://github.com/SentienceAPI/sentience-ts" rel="nofollow">https://github.com/SentienceAPI/sentience-ts</a><p>Studio Demo: <a href="https://www.sentienceapi.com/docs/studio" rel="nofollow">https://www.sentienceapi.com/docs/studio</a><p>Build Web Agent: <a href="https://www.sentienceapi.com/docs/sdk/agent-quick-start" rel="nofollow">https://www.sentienceapi.com/docs/sdk/agent-quick-start</a><p>Screenshots with importance labels (gold stars):
<a href="https://sentience-screenshots.sfo3.cdn.digitaloceanspaces.com/Screenshot" rel="nofollow">https://sentience-screenshots.sfo3.cdn.digitaloceanspaces.co...</a> 2026-01-06 at 7.19.41 AM.png<p><a href="https://sentience-screenshots.sfo3.cdn.digitaloceanspaces.com/Screenshot" rel="nofollow">https://sentience-screenshots.sfo3.cdn.digitaloceanspaces.co...</a> 2026-01-06 at 7.19.41 AM.png<p>I’m handling the backend in Rust and the SDKs in Python/TypeScript. The project is now in beta launch, I would love feedbacks on the architecture or the ranking logic!