在密集统计领域(如板球)中,LLM(GPT-5)出现高频率的幻觉现象。

2作者: sp198210 天前原帖
免责声明:我不是机器学习研究者,因此术语可能不够正式或准确,敬请谅解!<p>我正在进行一个小实验,旨在观察模型是否“知道自己知道”的情况,实验对象是T20国际板球比赛的记分卡(数据来源于cricsheet.com)。这个实验的想法是测试模型在它们可能在训练期间见过的公开数据上的表现,看看它们是否会产生幻觉或承认自己不知道。<p>设置:每个问题来自一场单独的T20比赛。模型必须返回一个答案(数字或选项中的选择)或`no_answer`。<p>结果(每个模型N=100):<p>- gpt-4o-search-preview • 答案率:0.96 • 准确率:0.88 • 已回答的准确率:0.91 • 已回答的幻觉率:0.09 • 每100个错误:9<p>- gpt-5 • 答案率:0.35 • 准确率:0.27 • 已回答的准确率:0.77 • 已回答的幻觉率:0.23 • 每100个错误:8<p>- gpt-4o-mini • 答案率:0.37 • 准确率:0.14 • 已回答的准确率:0.38 • 已回答的幻觉率:0.62 • 每100个错误:23<p>- gpt-5-mini • 答案率:0.05 • 准确率:0.02 • 已回答的准确率:0.40 • 已回答的幻觉率:0.60 • 每100个错误:3<p>注意:大多数剩余的“错误”与搜索相关,属于模糊或有争议的案例,公共来源之间存在分歧。<p>似乎在模型可能见过<i>一些</i>数据的领域,依赖于放弃+ RAG的方式比使用覆盖面更广但幻觉率更高的大模型要好。<p>代码/数据:https://github.com/jobswithgpt/llmcriceval
查看原文
Disclaimer: I am not a ML researcher, so the terms are informal&#x2F;wonky. Apologies!<p>I’m doing a small experiment to see whether models “know when they know” on T20 international cricket scorecards (cricsheet.com for source). The idea is to test models on publicly available data they likely saw during training, and see if they hallucinate or admit they don&#x27;t know.<p>Setup: Each question is from a single T20 match. Model must return an answer (numeric or choice from options) or `no_answer`.<p>Results (N=100 per model):<p>- gpt-4o-search-preview • Answer rate: 0.96 • Accuracy: 0.88 • Accuracy (answered): 0.91 • Hallucination (answered): 0.09 • Wrong&#x2F;100: 9<p>- gpt-5 • Answer rate: 0.35 • Accuracy: 0.27 • Accuracy (answered): 0.77 • Hallucination (answered): 0.23 • Wrong&#x2F;100: 8<p>- gpt-4o-mini • Answer rate: 0.37 • Accuracy: 0.14 • Accuracy (answered): 0.38 • Hallucination (answered): 0.62 • Wrong&#x2F;100: 23<p>- gpt-5-mini • Answer rate: 0.05 • Accuracy: 0.02 • Accuracy (answered): 0.40 • Hallucination (answered): 0.60 • Wrong&#x2F;100: 3<p>Note: most remaining “errors” with search are obscure&#x2F;disputed cases where public sources disagree.<p>It seems for domains where models might have seen <i>some</i> data, it’s better to rely on abstention + RAG vs a larger model with more coverage but worse hallucination rate.<p>Code&#x2F;Data: https:&#x2F;&#x2F;github.com&#x2F;jobswithgpt&#x2F;llmcriceval