展示HN:17MB的发音评分器在音素层面超越人类专家
我开发了一个英语发音评估引擎,体积仅为17MB,并且在CPU上运行时间不到300毫秒。<p>架构:CTC强制对齐 + GOP评分 + 集成头(MLP + XGBoost)。不使用wav2vec2或大型自监督模型——整个流程采用量化的NeMo Citrinet-256作为声学骨干。<p>在speechocean762(标准学术基准,2500个发音)上进行基准测试:
- 音素准确率(PCC):0.580 — 超过人类标注者间的一致性(0.555)
- 句子准确率:0.710 — 超过人类一致性(0.675)
- 模型体积比基于wav2vec2的最先进技术小70倍<p>权衡:在原始准确率上,我们比最先进技术低约10-15%。但对于语言学习应用中的实时反馈而言,这种延迟/体积的权衡是值得的。<p>可作为REST API、MCP服务器(用于AI代理)以及在Azure市场上提供。<p>演示:<a href="https://huggingface.co/spaces/fabiosuizu/pronunciation-assessment" rel="nofollow">https://huggingface.co/spaces/fabiosuizu/pronunciation-assessment</a><p>希望获得关于评分方法和人们认为有价值的用例的反馈。
查看原文
I built an English pronunciation assessment engine that fits in 17MB and runs in under 300ms on CPU.<p>Architecture: CTC forced alignment + GOP scoring + ensemble heads (MLP + XGBoost). No wav2vec2 or large self-supervised models — the entire pipeline uses a quantized NeMo Citrinet-256 as the acoustic backbone.<p>Benchmarked on speechocean762 (standard academic benchmark, 2500 utterances):
- Phone accuracy (PCC): 0.580 — exceeds human inter-annotator agreement (0.555)
- Sentence accuracy: 0.710 — exceeds human agreement (0.675)
- Model is 70x smaller than wav2vec2-based SOTA<p>Trade-off: we're ~10-15% below SOTA on raw accuracy. But for real-time feedback in language learning apps, the latency/size trade-off is worth it.<p>Available as REST API, MCP server (for AI agents), and on Azure Marketplace.<p>Demo: <a href="https://huggingface.co/spaces/fabiosuizu/pronunciation-assessment" rel="nofollow">https://huggingface.co/spaces/fabiosuizu/pronunciation-asses...</a><p>Interested in feedback on the scoring approach and use cases people would find valuable.