展示HN:Droidrun – Android的LLM代理

1作者: nodueck6 天前原帖
嗨,HN, 我是尼古拉伊,DroidRun 的软件工程师和联合创始人。我们开发了 DroidRun,这是一个基于大型语言模型(LLM)的代理,利用 Android 可访问性树来精确控制和理解用户界面元素。它可以在真实手机和模拟器上运行,并且是开源的。 **起源:** 我们的联合创始人尼尔斯·施密特(你将在演示中看到他)编写了一个原型并分享了一段快速视频。视频迅速走红,在 X 平台上不到两个小时就获得了大约 5 万次观看。那一刻促使我们全力以赴投入到 DroidRun 的开发中,并不久后将其开源。 **工作原理:** 大多数代理仅依赖截图作为上下文。我们不仅这样做,还将可访问性树输入到 LLM 中。这提供了关于用户界面元素的结构性、层次性和空间元数据。 **示例:** 真实用户界面的截图: [https://imgur.com/a/ePRLpyv](https://imgur.com/a/ePRLpyv) 与之匹配的可访问性 JSON 片段: ```json { "index": 3, "resourceId": "com.android.settings:id/search_action_bar", "className": "LinearLayout", "text": "search_action_bar", "bounds": "42, 149, 1038, 338", "children": [ { "index": 4, "resourceId": "com.android.settings:id/search_bar_title", "className": "TextView", "text": "In Einstellungen suchen", "bounds": "189, 205, 768, 282", "children": [] } ] } ``` 我们还在截图中用数字标注用户界面区域,然后在树中进行匹配。这种结构使代理能够深入理解屏幕上显示的内容,即使在不同设备类型(如平板电脑)之间也是如此。 这使得在不同设备和屏幕尺寸之间的泛化能力更强。代理可以更自信地执行操作,减少错误。 **当前状态:** - 最近在 AndroidWorld 排名第一(现在竞争非常激烈) - 支持真实设备和模拟器 - 在简单和复杂的用户界面任务上表现良好 - 目前 Gemini 2.5 Pro 的表现最佳,但我们正在快速迭代 **接下来的计划:** 我们正在开发一个云平台,您可以在 Android 设备上运行提示,而无需任何设置。想象一下,LLM 在云中控制手机,随时准备测试您的自动化。 **我们在寻找:** - 来自 HN 的反馈 - 热爱 Android、LLM 和代理的合作者 - 开源贡献者
查看原文
Hi HN,<p>I&#x27;m Nikolai, software engineer and co-founder at DroidRun. We built DroidRun, an LLM-based agent that leverages the Android Accessibility Tree for precise control and understanding of UI elements. It works on real phones and emulators, and it&#x27;s open source.<p>How it started:<p>Our co-founder Niels Schmidt (you’ll see him in the demos) coded a prototype and shared a quick video. It went viral, about 50k views on X in under 2 hours. That moment pushed us to go all-in on DroidRun and soon after, we open-sourced it.<p>How it works:<p>Most agents rely on screenshots alone for context. We do that plus feed the Accessibility Tree into the LLM. That gives structural, hierarchical, and spatial metadata about UI elements.<p>Here’s an example:<p>Screenshot of a real UI: <a href="https:&#x2F;&#x2F;imgur.com&#x2F;a&#x2F;ePRLpyv" rel="nofollow">https:&#x2F;&#x2F;imgur.com&#x2F;a&#x2F;ePRLpyv</a><p>And a matching accessibility JSON snippet:<p><pre><code> { &quot;index&quot;: 3, &quot;resourceId&quot;: &quot;com.android.settings:id\\&#x2F;search_action_bar&quot;, &quot;className&quot;: &quot;LinearLayout&quot;, &quot;text&quot;: &quot;search_action_bar&quot;, &quot;bounds&quot;: &quot;42, 149, 1038, 338&quot;, &quot;children&quot;: [ { &quot;index&quot;: 4, &quot;resourceId&quot;: &quot;com.android.settings:id\\&#x2F;search_bar_title&quot;, &quot;className&quot;: &quot;TextView&quot;, &quot;text&quot;: &quot;In Einstellungen suchen&quot;, &quot;bounds&quot;: &quot;189, 205, 768, 282&quot;, &quot;children&quot;: [] } ] } </code></pre> We also annotate UI regions in screenshots with numbers, then match them in the tree. This structure gives the agent a deep understanding of what’s on screen, even across different device types like tablets.<p>This allows for better generalization across devices and screen sizes. Agents can act with greater confidence and fewer hallucinations.<p>Current Status:<p>- Ranked #1 on AndroidWorld until recently (it became highly competitive)<p>- Supports real devices + Emulators<p>- Strong performance on simple and complex UI tasks<p>- Gemini 2.5 Pro works best so far, but we’re iterating fast<p>What&#x27;s next:<p>We’re working on a cloud platform where you can run prompts on Android devices without setup. Think of LLM controlling a phone in the cloud, ready to test your automations.<p>Looking for:<p>- Feedback from HN<p>- Collaborators who love Android, LLMs, agents<p>- OSS contributors