请问HN:你们使用哪些基准来评估人工智能模型?
有这么多模型,而且不断有新的模型发布,我很难知道应该优先测试哪些模型。从经验来看,你发现哪些基准测试特别能反映实际性能呢?
我使用的有:
* Aider的Polyglot基准似乎是判断哪些模型在编码方面表现良好的一个不错指标:
https://aider.chat/docs/leaderboards/
* 我通常认为OpenRouter的使用情况可以作为模型受欢迎程度的一个指标,进而反映其实用性:
https://openrouter.ai/rankings
* LLM-Stats有很多基准测试的图表,我会查看:
https://llm-stats.com/
查看原文
There are so many models, and so many new ones being released all the time, that I have a hard time knowing which ones to prioritize testing anecdotally. What benchmarks have you found to be especially indicative of real-world performance?<p>I use:<p>* Aider's Polyglot benchmark seems to be a decent indicator of which models are going to be good at coding:<p>https://aider.chat/docs/leaderboards/<p>* I generally assume OpenRouter usage to be an indicator of a model's popularity, and by proxy, utility:<p>https://openrouter.ai/rankings<p>* LLM-Stats has a lot of charts of benchmarks that I look at:<p>https://llm-stats.com/