发布 HN:Halluminate(YC S25)——模拟互联网以训练计算机使用技能
大家好,我们是来自Halluminate的Jerry和Wyatt(<a href="https://halluminate.ai">https://halluminate.ai</a>)。我们帮助人工智能实验室使用高质量数据和强化学习环境来训练计算机使用代理。
训练人工智能代理使用计算机、浏览器和软件是人工智能领域最具潜力的机会之一。然而,到目前为止,这一能力仍然不够可靠。为改善这一现状,出现了一种新兴的方法,称为可验证奖励的强化学习(RLVR)。然而,研究人员目前面临的瓶颈是缺乏高质量的模拟器和任务验证器。
为了解决这个问题,我们正在构建Westworld,这是一个完全模拟的互联网,由最常见的消费和企业应用程序的合成版本组成。代理在Westworld中学习如何执行经济价值高的任务。
例如,人工智能代理可以在一个模拟的航班预订网站上练习规划假期(<a href="https://flights.halluminate.ai">https://flights.halluminate.ai</a>),或者学习如何在销售平台上重新组织过时的信息,或者直接在电子表格中进行财务建模。
这里有一个展示我们航班预订模拟的演示:<a href="https://www.loom.com/share/74a3b28067e24c1b886054ba90a90aa5" rel="nofollow">https://www.loom.com/share/74a3b28067e24c1b886054ba90a90aa5</a>。
工作原理:人工智能代理访问我们的环境,并被分配一个任务和验证器。任务基本上是代理需要完成的目标,例如“在这个日期从旧金山到纽约市预订一张航班,使用x、y、z过滤器。”验证器是一种程序化的方法,用于确定任务是否成功完成。例如,在这种情况下,它可能是一个json,用于检查最终航班数据是否符合预期。这些信号可以用于计算强化学习中的奖励。
我们构建的模拟器越多,人工智能实验室在计算机使用代理当前薄弱的能力上就能取得更多进展。我们的一个客户在使用我们的航班预订模拟器进行训练时,日期选择性能提高了约20%。
到目前为止,有两个因素使得这一过程变得困难:
(1) 模拟必须真实。你不能仅仅依靠一个“80%解决方案”,因为即使是小的偏差也会影响性能。生成模拟数据更加困难。例如,使航班数据看起来真实需要大量的试验和调整。
(2) 你训练代理的任务必须经过精心挑选。只有当它们反映出人们实际希望解决的工作时,它们才有价值。我们需要大量来自领域专家的反馈,以确保这一点。
尽管如此,我们发现这项工作非常有趣,并期待解决这些问题。我们近期计划推出的一些功能包括:- 通过将多个模拟器串联在一起以实现延续工作流程的长时间任务训练能力;- 程序化数据生成。我们如何建模数据生成,使得我们的模拟器在代理探索时能够程序化地填充(想想Minecraft);- 开源!我们计划将我们的环境公开发布,以便开发者和研究人员可以进行自己的实验。
强化学习模拟器只是我们业务的一部分。另一部分是围绕人类数据创建(想象一下Scale AI,但针对计算机使用)。我们提供现成的预训练/微调数据集、专家人类评估/错误分析,或满足客户的其他数据需求。两者之间也有很多令人兴奋的交集——例如,利用人类专家帮助创建我们的模拟器/任务。我们乐意提供更多细节,但我们认为模拟器会是更有趣的HackerNews帖子 :)
最后,关于我们:Wyatt和我在康奈尔大学学习计算机科学时相识,已经共同生活和工作超过7年。我曾在Capital One Labs负责产品/研究,推出了银行业首批人工智能代理之一。Wyatt曾是康奈尔Milstein学者,并为纽约市的两家早期创业公司进行大规模数据工程。我们去年辞去了工作,在为我们的客户(浏览器/计算机使用代理公司)构建评估时,亲身面对了这些问题。
如果有人有任何问题、反馈或想法,请告诉我们!期待您的评论。
查看原文
Hi everyone, Jerry and Wyatt here from Halluminate (<a href="https://halluminate.ai/">https://halluminate.ai/</a>). We help AI labs train computer use agents with high quality data and RL environments.<p>Training AI agents to use computers, browsers, and software is one of the highest-potential opportunities for AI. To date, however, this capability is still unreliable. The emerging method to improve this is called Reinforcement Learning with Verifiable Rewards (RLVR). However, researchers are currently bottlenecked by a lack of high-quality simulators and task + verifiers.<p>To solve this problem, we’re building Westworld, a fully-simulated internet made up of synthetic versions of the most common consumer and enterprise apps. Agents use Westworld to learn how to do economically valuable tasks.<p>For example, AI agents can practice planning vacations on a simulated flight booking site (<a href="https://flights.halluminate.ai/">https://flights.halluminate.ai/</a>), or learn how to reorganize outdated information in your sales platform, or train to do financial modeling directly in a spreadsheet.<p>Here’s a demo showing our flight booking simulation: <a href="https://www.loom.com/share/74a3b28067e24c1b886054ba90a90aa5" rel="nofollow">https://www.loom.com/share/74a3b28067e24c1b886054ba90a90aa5</a>.<p>How it works: AI agents access our environment and are given a task + verifier. A task is basically an objective for the agent to achieve, for example "Book me a flight from SF to NYC on this date with x, y, z filters.” A verifier is a programmatic way to determine if the task was successfully completed. For example, in this case it might be a json that checks if the final flight data matches expectations. These signals can then be used to calculate a reward in RL.<p>The more simulators we build, the more AI labs can improve on capabilities that computer use agents are currently weak at. One of our customers saw a ~20% improvement in date-picking performance when training on our flight booking simulator.<p>Two things make this hard so far:<p>(1) The simulations have to be realistic. You can’t get away with a vibe-coded “80% solution” because even small divergences impact performance. Generating simulated data is even harder. For example, massaging flight data to look realistic took a lot of trial and experimentation.<p>(2) The tasks you train agents on have to be well-chosen. They are only valuable if they reflect work that people actually want solved. We need a lot of feedback from domain experts to get this right.<p>That said, we find this work incredibly interesting and are excited to tackle these issues. A few things we are pumped to ship in the near term: - Ability to train on long-horizon tasks by stringing multiple simulators together for extended workflows; - Procedural data generation. Instead of synthetically generating all the data upfront, how can we model data generation so that our simulators are populated procedurally as agents explore (think Minecraft); - Open source! We plan to release our environments to the public so developers/researchers can hack them for their own experimentation.<p>RL simulators are just one part of our business. The other part is around human data creation (think Scale AI but for computer use). We produce off-the-shelf pre-training/fine-tuning datasets, expert human evaluation/error analysis, or any other data needs for our customers. There are also a lot of exciting overlaps between the two - for example, using human experts to help create our simulators/tasks. Happy to go in more detail, but we thought that simulators would make for the more interesting HackerNews post :)<p>Finally, about us: Wyatt and I met while studying CS at Cornell and have been living and working together for over 7 years. I previously led product/research at Capital One Labs, where I launched one of the first AI agents in banking. Wyatt previously was a Cornell Milstein scholar and did large-scale data engineering for 2 early-stage startups in NYC. We left our jobs last year, and faced these problems first-hand while building evals for our customers who were browser/computer use agent companies.<p>If anyone has any questions, feedback, or thoughts please let us know! Looking forward to your comments.