请问HN:你们使用哪些基准来评估人工智能模型?

2作者: cowpig9 个月前原帖
有这么多模型,而且不断有新的模型发布,我很难知道应该优先测试哪些模型。从经验来看,你发现哪些基准测试特别能反映实际性能呢? 我使用的有: * Aider的Polyglot基准似乎是判断哪些模型在编码方面表现良好的一个不错指标: https://aider.chat/docs/leaderboards/ * 我通常认为OpenRouter的使用情况可以作为模型受欢迎程度的一个指标,进而反映其实用性: https://openrouter.ai/rankings * LLM-Stats有很多基准测试的图表,我会查看: https://llm-stats.com/
查看原文
There are so many models, and so many new ones being released all the time, that I have a hard time knowing which ones to prioritize testing anecdotally. What benchmarks have you found to be especially indicative of real-world performance?<p>I use:<p>* Aider&#x27;s Polyglot benchmark seems to be a decent indicator of which models are going to be good at coding:<p>https:&#x2F;&#x2F;aider.chat&#x2F;docs&#x2F;leaderboards&#x2F;<p>* I generally assume OpenRouter usage to be an indicator of a model&#x27;s popularity, and by proxy, utility:<p>https:&#x2F;&#x2F;openrouter.ai&#x2F;rankings<p>* LLM-Stats has a lot of charts of benchmarks that I look at:<p>https:&#x2F;&#x2F;llm-stats.com&#x2F;