问HN:GPT-5是退步,还是只是我个人的感觉?

1作者: technocratius5 个月前原帖
背景:自从一个多月前发布以来,我一直在使用GPT-5,属于我的Plus订阅。在这次发布之前,我在大多数复杂任务中主要依赖于GPT-3,对于简单问题则使用GPT-4。我使用它进行科学文献的网络搜索,例如理解与健康相关的话题,偶尔进行编码辅助,以及帮助我处理与*unix系统管理员相关的任务。请注意,我没有使用它的API或与IDE的集成。 基于一个月的GPT-5使用体验,我觉得这个模型主要是一次倒退: 1. 速度慢:思考模式可能需要很长时间,有时甚至会完全卡住。它对是否需要思考的自动评估似乎与大多数任务不太匹配,过于容易默认进入深度推理模式。 2. 幻觉现象严重:我评估在10个任务中有7个任务的回答中持续出现幻觉,导致需要进行纠正和仔细监控以重新引导。它会幻觉出你提示中并不存在的列表项、软件包功能/能力和CLI参数等。即使是通过明确链接到来源的详细提示,例如在深入研究中,常常也会偏离轨道。 3. 缺乏自我批判:即使在思考模式下,它也经常输出不正确的信息,明显的“这不正确,请检查你的答案”可以直接纠正。 注意:我并不是一个超级高级的提示工程师,上述评估主要是针对前一代模型。我希望随着模型能力的进步,用户在应用精确提示工程方面的需求会减少,而不是增加。 我非常好奇听听你们的体验。
查看原文
Context: I have been using GPT5 since its release over a month ago, within my Plus subscription. Before this release, I heavily relied on gpt-o3 for most complex tasks, with 4o for simple question. I use it for a mix of scientific literature websearch for e.g. understanding health related topics, the occasional coding assistance and helping me out with *nix sysadmin related tasks. Note that I have not used its API or integration with an IDE.<p>Based on a month of GPT5 usage, this model feels like primarily like a regression:<p>1. It&#x27;s slow: thinking mode can take ages, and sometimes gets downright stuck. It&#x27;s auto-assessment of whether or not it needs to think feels poorly tuned to most tasks and defaults too easily to going into deep reasoning mode.<p>2. Hallucinations are in overdrive: I would assess that in 7&#x2F;10 tasks, hallucinations continuously clutter the responses and warrant corrections and careful monitoring and steering back. It hallucinates list items from your prompt that weren&#x27;t there, software package functionalities&#x2F;capabilities and CLI parameters etc. Even thorough prompting with explicit linking to sources, e.g. also wihtin deep research frequently goes of the rails.<p>3. Not self critical: even in thinking mode, it frequently spews out incorrect stuff, that a blatant &quot;this is not correct, check your answer&quot; can directly correct.<p>Note: I am not a super advanced prompt engineer, and this above assessment is mainly wrt the previous generation of models. I would expect that with progression of model capabilities, the need for users to apply careful prompt engineering goes down, not up.<p>I am very curious to hear your experiences.