HackerNews中文版

我开发了一个英语发音评估引擎，体积仅为17MB，并且在CPU上运行时间不到300毫秒。架构：CTC强制对齐 + GOP评分 + 集成头（MLP + XGBoost）。不使用wav2vec2或大型自监督模型——整个流程采用量化的NeMo Citrinet-256作为声学骨干。在speechocean762（标准学术基准，2500个发音）上进行基准测试： - 音素准确率（PCC）：0.580 — 超过人类标注者间的一致性（0.555） - 句子准确率：0.710 — 超过人类一致性（0.675） - 模型体积比基于wav2vec2的最先进技术小70倍权衡：在原始准确率上，我们比最先进技术低约10-15%。但对于语言学习应用中的实时反馈而言，这种延迟/体积的权衡是值得的。可作为REST API、MCP服务器（用于AI代理）以及在Azure市场上提供。演示：<a href="https://huggingface.co/spaces/fabiosuizu/pronunciation-assessment" rel="nofollow">https://huggingface.co/spaces/fabiosuizu/pronunciation-assessment</a>希望获得关于评分方法和人们认为有价值的用例的反馈。

查看原文

I built an English pronunciation assessment engine that fits in 17MB and runs in under 300ms on CPU.Architecture: CTC forced alignment + GOP scoring + ensemble heads (MLP + XGBoost). No wav2vec2 or large self-supervised models — the entire pipeline uses a quantized NeMo Citrinet-256 as the acoustic backbone.Benchmarked on speechocean762 (standard academic benchmark, 2500 utterances): - Phone accuracy (PCC): 0.580 — exceeds human inter-annotator agreement (0.555) - Sentence accuracy: 0.710 — exceeds human agreement (0.675) - Model is 70x smaller than wav2vec2-based SOTATrade-off: we're ~10-15% below SOTA on raw accuracy. But for real-time feedback in language learning apps, the latency/size trade-off is worth it.Available as REST API, MCP server (for AI agents), and on Azure Marketplace.Demo: <a href="https://huggingface.co/spaces/fabiosuizu/pronunciation-assessment" rel="nofollow">https://huggingface.co/spaces/fabiosuizu/pronunciation-asses...</a>Interested in feedback on the scoring approach and use cases people would find valuable.

展示HN：17MB的发音评分器在音素层面超越人类专家