启动 HN:设计竞技场(YC S25)——面向美学的对抗性 AI 基准测试
嗨,HN,我是来自Design Arena的Grace(<a href="https://www.designarena.ai">https://www.designarena.ai</a>)——我们正在建立一个众包的AI生成视觉效果基准(包括网站、图像、视频等)。我们将AI模型和构建工具进行面对面的比较,并由来自全球的真实用户进行投票。可以把它想象成AI时代的“热或不热” :)
(顺便说一下,当我们说真实用户时,我们指的是<i>真实</i>用户,因此您可能会在网站上遇到验证码。抱歉,但我们必须使用所有可用的机器人保护措施!我们只希望获得人类的评分,显而易见的原因。)
这是一个演示视频:<a href="https://www.youtube.com/watch?v=vPyEQnuVgeI" rel="nofollow">https://www.youtube.com/watch?v=vPyEQnuVgeI</a>
我们并不是一开始就打算做这个——我们其实是在开发一个AI游戏引擎。但我们发现模型在外观和感觉上表现不佳。即使输出的代码通常是功能性的,大多数视觉方面仍然缺乏让优秀图形看起来生动的灵魂。
因此,我们为自己制作了一个“这个或那个”的游戏,以找出哪些生成的输出具有最佳图形。令我们惊讶的是,这比最初的想法更令人兴奋——事实证明这是一个普遍存在的问题!一个月前我们做了一个Show HN(<a href="https://news.ycombinator.com/item?id=44542578">https://news.ycombinator.com/item?id=44542578</a>),这部分促使我们将这个基准项目作为我们的实际产品。
尽管最先进的模型可能在IMO中赢得金奖,但它们仍然在白色背景上放置白色文本。需要对什么是好的,什么不是(是的,确实存在好的设计!)进行<i>某种</i>衡量,而这显然不会来自大型语言模型(LLMs)。
我们来自工程背景(苹果和英伟达),热爱设计;我们知道自己喜欢或不喜欢某样东西,即使无法说明原因。这种“这个或那个”/“热或不热”的游戏正是为这样的领域而设计的:Design Arena的目标是让一切变得愚蠢简单,以便人类可以轻松完成:喜欢与不喜欢。这也恰好是有价值的部分,因为人类最容易做的事情,实际上是AI目前无法做到的。
自从我们的Show HN以来,我们将最初的约25个LLM模型扩展到了54个LLM模型、12个图像模型、4个视频模型、22个音频模型和22个情感编码工具(如Lovable、Bolt、v0、Firebase Studio等)。在最后一个类别中,我们惊讶地发现,像Devin这样的非专门市场推广的情感编码工具在构建者类别中表现出色,超越了专门的构建工具如Lovable、v0和Bolt。
我们的用户主要是希望快速搭建前端的开发者,或者希望更快生成设计变体的设计师。在这两种情况下,Design Arena提供了一种快速了解哪些选项优于其他选项的方法。开发者或设计师需要做出最终决定,因为没有什么能替代良好的判断。但这种格式确实可以提供很大帮助。
我们计划通过向需要量化其产品在不同版本之间改进的公司提供版本测试服务来盈利。
这是我们第一次做这样的事情!我们非常希望向大家学习,并期待您的反馈。
查看原文
Hi HN, I’m Grace from Design Arena (<a href="https://www.designarena.ai/">https://www.designarena.ai/</a>) - we’re building a crowdsourced benchmark for AI-generated visuals (websites, images, video, and more). We put AI models and builder tools in head-to-head comparisons that get voted on by real users from around the world. Think “Hot or Not” for the AI era :)<p>(Btw, when we say real users we mean <i>real</i> users, so you may get a captcha on the site. Sorry, but we have to use every bot protection available! We only want human ratings, for obvious reasons.)<p>Here’s a demo video: <a href="https://www.youtube.com/watch?v=vPyEQnuVgeI" rel="nofollow">https://www.youtube.com/watch?v=vPyEQnuVgeI</a><p>We didn’t set out to build this - we were actually working on an AI game engine. But we found that models sucked at look-and-feel. Even when the output code was usually functional, most visual aspects lacked the soul that makes great graphics feel alive.<p>So we built a this-or-that game, just for ourselves, to figure out which generated outputs had the best graphics. To our surprise, that turned out to be more exciting than the original idea—it turns out this is a widespread problem! We did a Show HN a month ago (<a href="https://news.ycombinator.com/item?id=44542578">https://news.ycombinator.com/item?id=44542578</a>) and that was partly what convinced us to make this benchmark thing our actual product.<p>State-of-the-art models might be winning IMO gold, but they are still putting white text on a white background. There needs to be <i>some</i> measurement of what’s good and what isn’t (yes, there is such a thing as good design!), and it sure isn’t going to come from LLMs.<p>We come from engineering backgrounds (Apple and Nvidia) with a love for design; we know when we like or dislike something, even when we can’t say why. This-or-that / hot-or-not games are made for domains like this: Design Arena’s goal is to make everything stupidly simple so humans can just do the easy part: like-vs.-dislike. Which also turns out to be the valuable part, because what’s easiest for humans is actually the part that the AIs can’t currently do.<p>Since our Show HN, we’ve extended our initial set of ~25 LLM models to 54 LLM models, 12 image models, 4 video models, 22 audio models, and 22 vibe-coding tools (like Lovable, Bolt, v0, Firebase Studio, and more). In this last category, we’ve been surprised to find that agentic tools that were not specifically marketed as vibe-coders like Devin performed exceedingly well in the builder category, outperforming dedicated builder tools like Lovable, v0, and Bolt.<p>Our users are mostly devs who want to spin up a frontend, or designers who want to spin up design variants faster. In both cases, Design Arena provides a quick way to find out which options are better than others. Dev-or-designer needs to make the final calls, because there’s no substitute for good judgment. But this type of formatting can really help.<p>We plan to make money by offering version testing as a service to companies that need to quantify improvements in their product between builds.<p>This is the first time we’ve ever worked on something like this! We’d love to learn from you all and look forward to your feedback.