告诉HN:Gemini 3.5 Flash 以愚蠢的方式出现故障
我以为自己快要疯了,试图使用 Gemini 3.5 Flash 来评分一些答案,但它总是给出 7 分,而不是正确答案的 10 分。
显然,一旦你添加了“评分标准”的文本,模型就会陷入一种“向评分中心压缩”的幻觉(或训练集过拟合)。
在 X 上有人让我尝试重现这个问题,我实际上在他们的 Gemini Chat 中第一次尝试就成功了:
https://x.com/XCSme/status/2057613611959279988
我不太确定该如何看待这个(或大多数最先进的)模型。它们在编码和工具使用方面变得更聪明了,但在其他方面却变得愚蠢了很多……
查看原文
I thought I was going crazy, trying to use Gemini 3.5 Flash to rate some answers, but it kept giving 7 instead of 10 for correct answers.<p>Apparently once you add a "Grading criteria" text, the model collapses into a "compressed toward the center of the scale" hallucination (or training set overfitting).<p>Someone on X asked me to try to reproduce it, and I actually got it on the first try on their Gemini Chat:<p>https://x.com/XCSme/status/2057613611959279988<p>I am not sure what to make of this (or most SOTA) models. They got a lot smarter with coding and tool usage, but a lot dumber in other ways...