问HN:为什么苹果的语音转录如此糟糕?

3作者: keepamovin大约 1 个月前原帖
为什么苹果的语音转录如此糟糕?<p>即使在2到3年前,OpenAI的Whisper模型在离线状态下也能提供更好、几乎即时的语音转录,而该模型的大小仅约为500MB。在这样的背景下,很难理解苹果的转录服务为何在强大的在线服务器上表现如此糟糕。<p>以下是我刚刚使用iOS原生应用时的真实示例:<p>- “BigQuery update” → “bakery update”<p>- “GitHub” → “get her”<p>- “CI build” → “CI bill”<p>- “GitHub support” → “get her support”<p>这些并不是生僻词汇——它们都是软件领域中非常常见的词汇,在日常对话中清晰地表达出来。与几年前即使在完全离线的情况下所能达到的准确性相比,这种差距显得尤为明显。<p>这主要是模型质量问题、流媒体/分段问题、激进的后处理,还是苹果语音处理架构中的某种结构性问题?真正的技术限制是什么?尽管现代硬件和云处理技术已经存在,为什么转录服务仍未得到改善?
查看原文
Why is Apple’s voice transcription so hilariously bad?<p>Even 2–3 years ago, OpenAI’s Whisper models delivered better, near-instant voice transcription offline — and the model was only about ~500 MB. With that context, it’s hard to understand how Apple’s transcription, which runs online on powerful servers, performs so poorly today.<p>Here are real examples from using the iOS native app just now:<p>- “BigQuery update” → “bakery update”<p>- “GitHub” → “get her”<p>- “CI build” → “CI bill”<p>- “GitHub support” → “get her support”<p>These aren’t obscure terms — they’re extremely common words in software, spoken clearly in casual contexts. The accuracy gap feels especially stark compared to what was already possible years ago, even fully offline.<p>Is this primarily a model-quality issue, a streaming&#x2F;segmentation problem, aggressive post-processing, or something architectural in Apple’s speech stack? What are the real technical limitations, and why hasn’t it improved despite modern hardware and cloud processing?