在我的实际编码会话中,Opus 4.7与4.6并行使用了3天后的对比。
我今天花了一些时间比较了Opus 4.6和4.7,使用了我自己的使用数据来观察它们的实际表现。<p>虽然4.7还处于早期阶段,但有一些事情让我感到惊讶。<p>在我的会话中,4.7在第一次尝试时正确的概率低于4.6。一击成功率约为74.5%,而4.6为83.8%;每次编辑的重试次数大约是前者的两倍(0.46对比0.22)。<p>此外,4.7每次调用产生的输出也明显更多,大约800个标记,而4.6为372个,这使得它的成本显著增加。每次调用的成本为0.185美元,而4.6为0.112美元。<p>当我按任务类型进行分析时,4.7在编码和调试方面的表现都较弱。编码的一击成功率从84.7%降至75.4%,调试则从85.3%降至76.5%。在功能开发方面,4.7稍微好一些(75%对比71.4%),但样本量较小。委派任务的表现差距较大(100%对比33.3%),不过4.7的样本仅有3个,因此我不想对此过于解读。<p>4.7每次调用使用的工具也较少(1.83对比2.77),几乎没有委派给子代理(0.6%对比3.1%)。目前还不确定这是否是风格差异,还是样本量较小造成的。<p>有几点需要注意。这是4.7大约3天的数据(3,592次调用)与4.6的8天数据(8,020次调用)进行的比较。有些类别只有少量示例。这些数字会随着使用量的增加而变化,您的结果可能会因工作类型的不同而有所不同。<p>npx codeburn compare
查看原文
I spent some time today comparing Opus 4.6 and 4.7 using my own usage data to see how they actually behave side by side.<p>still pretty early for 4.7, but a few things surprised me.<p>In my sessions, 4.7 gets things right on the first try less often than 4.6. One-shot rate sits around 74.5% vs 83.8%, and I am seeing roughly double the retries per edit (0.46 vs 0.22).<p>It also produces a lot more output per call, about 800 tokens vs 372 on 4.6, which makes it noticeably more expensive. cost per call is $0.185 vs $0.112.<p>when I broke it down by task type, coding and debugging both looked weaker on 4.7. Coding one-shot dropped from 84.7% to 75.4%, debugging from 85.3% to 76.5%. Feature work was slightly better on 4.7 (75% vs 71.4%), but the sample is small. Delegation showed a big gap (100% vs 33.3%), though that one only has 3 samples on the 4.7 side so I wouldnt read much into it yet.<p>4.7 also uses fewer tools per turn (1.83 vs 2.77) and barely delegates to subagents (0.6% vs 3.1%). Not sure yet if that's a style difference or just the smaller sample.<p>A couple of caveats. This is about 3 days of 4.7 data (3,592 calls) vs 8 days of 4.6 (8,020 calls). Some categories only have a handful of examples. These numbers will shift with more usage, and your results will probably look different depending on what kind of work you do.<p>npx codeburn compare