HackerNews中文版

我今天花了一些时间比较了Opus 4.6和4.7，使用了我自己的使用数据来观察它们的实际表现。虽然4.7还处于早期阶段，但有一些事情让我感到惊讶。在我的会话中，4.7在第一次尝试时正确的概率低于4.6。一击成功率约为74.5%，而4.6为83.8%；每次编辑的重试次数大约是前者的两倍（0.46对比0.22）。此外，4.7每次调用产生的输出也明显更多，大约800个标记，而4.6为372个，这使得它的成本显著增加。每次调用的成本为0.185美元，而4.6为0.112美元。当我按任务类型进行分析时，4.7在编码和调试方面的表现都较弱。编码的一击成功率从84.7%降至75.4%，调试则从85.3%降至76.5%。在功能开发方面，4.7稍微好一些（75%对比71.4%），但样本量较小。委派任务的表现差距较大（100%对比33.3%），不过4.7的样本仅有3个，因此我不想对此过于解读。4.7每次调用使用的工具也较少（1.83对比2.77），几乎没有委派给子代理（0.6%对比3.1%）。目前还不确定这是否是风格差异，还是样本量较小造成的。有几点需要注意。这是4.7大约3天的数据（3,592次调用）与4.6的8天数据（8,020次调用）进行的比较。有些类别只有少量示例。这些数字会随着使用量的增加而变化，您的结果可能会因工作类型的不同而有所不同。npx codeburn compare

查看原文

I spent some time today comparing Opus 4.6 and 4.7 using my own usage data to see how they actually behave side by side.still pretty early for 4.7, but a few things surprised me.In my sessions, 4.7 gets things right on the first try less often than 4.6. One-shot rate sits around 74.5% vs 83.8%, and I am seeing roughly double the retries per edit (0.46 vs 0.22).It also produces a lot more output per call, about 800 tokens vs 372 on 4.6, which makes it noticeably more expensive. cost per call is $0.185 vs $0.112.when I broke it down by task type, coding and debugging both looked weaker on 4.7. Coding one-shot dropped from 84.7% to 75.4%, debugging from 85.3% to 76.5%. Feature work was slightly better on 4.7 (75% vs 71.4%), but the sample is small. Delegation showed a big gap (100% vs 33.3%), though that one only has 3 samples on the 4.7 side so I wouldnt read much into it yet.4.7 also uses fewer tools per turn (1.83 vs 2.77) and barely delegates to subagents (0.6% vs 3.1%). Not sure yet if that's a style difference or just the smaller sample.A couple of caveats. This is about 3 days of 4.7 data (3,592 calls) vs 8 days of 4.6 (8,020 calls). Some categories only have a handful of examples. These numbers will shift with more usage, and your results will probably look different depending on what kind of work you do.npx codeburn compare

在我的实际编码会话中，Opus 4.7与4.6并行使用了3天后的对比。