HackerNews中文版

嗨，HN，我是SentienceAPI的独立创始人。在过去的12月份，我一直在构建一个专门为大型语言模型（LLM）代理设计的浏览器自动化运行时。问题：构建可靠的网络代理非常痛苦。你基本上只有两个糟糕的选择：原始DOM：直接获取document.body.innerHTML既便宜又快速，但会使上下文窗口超载（超过10万个标记），并且缺乏空间上下文（代理会尝试点击隐藏或屏幕外的元素）。视觉模型（GPT-4o）：发送屏幕截图虽然可靠，但速度慢（延迟3-10秒）且成本高（约$0.01每步）。更糟的是，它们经常会产生虚假坐标，错过按钮10个像素。解决方案：语义几何Sentience是代理的“视觉皮层”。它位于浏览器和你的LLM之间，将嘈杂的网站转换为干净、排序、具有坐标意识的JSON。工作原理（技术栈）：客户端（WASM）：一个Chrome扩展程序注入一个Rust/WASM模块，直接在浏览器进程中修剪95%的DOM（脚本、跟踪像素、不可见的包装器）。它在小于50毫秒内处理Shadow DOM、嵌套iframe（“框架拼接”）和计算样式（可见性/z-index）。网关（Rust/Axum）：修剪后的树发送到一个Rust网关，该网关应用简单视觉提示的启发式重要性评分（例如，is_primary）。大脑（ONNX）：一个服务器端的机器学习层（通过ort运行ms-marco-MiniLM）根据用户的目标（例如，“搜索鞋子”）对元素进行语义重新排序。结果：你的代理获得了前50个最相关的可交互元素的列表，包含精确的（x,y）坐标、重要性值和视觉提示，帮助LLM代理做出决策。性能：成本：每步约$0.001（相比于视觉模型的$0.01+）延迟：约400毫秒（相比于视觉模型的5秒+）负载：约1400个标记（相比于原始HTML的10万个）开发者体验（“酷炫”的东西）：我讨厌调试文本日志，所以我构建了Sentience Studio，一个“时光旅行调试器”。它将每一步（DOM快照 + 屏幕截图）记录到一个.jsonl追踪文件中。你可以像视频编辑器一样浏览时间线，准确查看代理看到的内容与它产生的幻觉之间的差异。链接：文档与SDK： [https://www.sentienceapi.com/docs](https://www.sentienceapi.com/docs) GitHub（SDK）： SDK Python: [https://github.com/SentienceAPI/sentience-python](https://github.com/SentienceAPI/sentience-python) SDK TypeScript: [https://github.com/SentienceAPI/sentience-ts](https://github.com/SentienceAPI/sentience-ts) Studio演示: [https://www.sentienceapi.com/docs/studio](https://www.sentienceapi.com/docs/studio) 构建Web代理: [https://www.sentienceapi.com/docs/sdk/agent-quick-start](https://www.sentienceapi.com/docs/sdk/agent-quick-start) 带有重要性标签（金星）的屏幕截图： [https://sentience-screenshots.sfo3.cdn.digitaloceanspaces.com/Screenshot](https://sentience-screenshots.sfo3.cdn.digitaloceanspaces.com/Screenshot) 2026-01-06 上午7:19:41.png [https://sentience-screenshots.sfo3.cdn.digitaloceanspaces.com/Screenshot](https://sentience-screenshots.sfo3.cdn.digitaloceanspaces.com/Screenshot) 2026-01-06 上午7:19:41.png 我负责后端的Rust和SDK的Python/TypeScript。项目现在处于测试阶段，我非常希望能收到关于架构或排序逻辑的反馈！

查看原文

Hi HN, I’m the solo founder behind SentienceAPI. I’ve spent the last December building a browser automation runtime designed specifically for LLM agents.The Problem: Building reliable web agents is painful. You essentially have two bad choices:Raw DOM: Dumping document.body.innerHTML is cheap/fast but overwhelms the context window (100k+ tokens) and lacks spatial context (agents try to click hidden or off-screen elements). Vision Models (GPT-4o): Sending screenshots is robust but slow (3-10s latency) and expensive (~$0.01/step). Worse, they often hallucinate coordinates, missing buttons by 10 pixels. The Solution: Semantic Geometry Sentience is a "Visual Cortex" for agents. It sits between the browser and your LLM, turning noisy websites into clean, ranked, coordinate-aware JSON.How it works (The Stack):Client (WASM): A Chrome Extension injects a Rust/WASM module that prunes 95% of the DOM (scripts, tracking pixels, invisible wrappers) directly in the browser process. It handles Shadow DOM, nested iframes ("Frame Stitching"), and computed styles (visibility/z-index) in <50ms.Gateway (Rust/Axum): The pruned tree is sent to a Rust gateway that applies heuristic importance scoring with simple visual cues (e.g. is_primary)Brain (ONNX): A server-side ML layer (running ms-marco-MiniLM via ort) semantically re-ranks the elements based on the user’s goal (e.g., "Search for shoes").Result: Your agent gets a list of the Top 50 most relevant interactable elements with exact (x,y) coordinates with importance value and visual cues, helping LLM agent make decision.Performance:Cost: ~$0.001 per step (vs. $0.01+ for Vision) Latency: ~400ms (vs. 5s+ for Vision) Payload: ~1400 tokens (vs. 100k for Raw HTML) Developer Experience (The "Cool" Stuff): I hated debugging text logs, so I built Sentience Studio, a "Time-Travel Debugger." It records every step (DOM snapshot + Screenshot) into a .jsonl trace. You can scrub through the timeline like a video editor to see exactly what the agent saw vs. what it hallucinated.Links:Docs & SDK: <a href="https://www.sentienceapi.com/docs" rel="nofollow">https://www.sentienceapi.com/docs</a>GitHub (SDK): SDK Python: <a href="https://github.com/SentienceAPI/sentience-python" rel="nofollow">https://github.com/SentienceAPI/sentience-python</a>SDK TypeScript: <a href="https://github.com/SentienceAPI/sentience-ts" rel="nofollow">https://github.com/SentienceAPI/sentience-ts</a>Studio Demo: <a href="https://www.sentienceapi.com/docs/studio" rel="nofollow">https://www.sentienceapi.com/docs/studio</a>Build Web Agent: <a href="https://www.sentienceapi.com/docs/sdk/agent-quick-start" rel="nofollow">https://www.sentienceapi.com/docs/sdk/agent-quick-start</a>Screenshots with importance labels (gold stars): <a href="https://sentience-screenshots.sfo3.cdn.digitaloceanspaces.com/Screenshot" rel="nofollow">https://sentience-screenshots.sfo3.cdn.digitaloceanspaces.co...</a> 2026-01-06 at 7.19.41 AM.png<a href="https://sentience-screenshots.sfo3.cdn.digitaloceanspaces.com/Screenshot" rel="nofollow">https://sentience-screenshots.sfo3.cdn.digitaloceanspaces.co...</a> 2026-01-06 at 7.19.41 AM.pngI’m handling the backend in Rust and the SDKs in Python/TypeScript. The project is now in beta launch, I would love feedbacks on the architecture or the ranking logic!

展示HN：Sentience – AI代理的语义视觉基础（WASM和ONNX）