HackerNews中文版

有这么多模型，而且不断有新的模型发布，我很难知道应该优先测试哪些模型。从经验来看，你发现哪些基准测试特别能反映实际性能呢？我使用的有： * Aider的Polyglot基准似乎是判断哪些模型在编码方面表现良好的一个不错指标： https://aider.chat/docs/leaderboards/ * 我通常认为OpenRouter的使用情况可以作为模型受欢迎程度的一个指标，进而反映其实用性： https://openrouter.ai/rankings * LLM-Stats有很多基准测试的图表，我会查看： https://llm-stats.com/

查看原文

There are so many models, and so many new ones being released all the time, that I have a hard time knowing which ones to prioritize testing anecdotally. What benchmarks have you found to be especially indicative of real-world performance?I use:* Aider's Polyglot benchmark seems to be a decent indicator of which models are going to be good at coding:https://aider.chat/docs/leaderboards/* I generally assume OpenRouter usage to be an indicator of a model's popularity, and by proxy, utility:https://openrouter.ai/rankings* LLM-Stats has a lot of charts of benchmarks that I look at:https://llm-stats.com/

请问HN：你们使用哪些基准来评估人工智能模型？