展示HN:我增强了Soundex,以正确处理多词字符串
你好,HN。
我开发了 Flookup Data Wrangler,这是一个强大的 Google Sheets 插件,可以在不编写任何代码的情况下进行数据清理。
传统的 Soundex 算法是为单个单词设计的,比如“John”和“Jonny”,使得对这些字符串的比较变得简单。然而,典型的 Soundex 输出无法处理多词或重新排列的字符串比较,例如“John Doe”与“Doe Jonny”,因为这样会产生不准确的结果。
为了解决这个问题,我对 Soundex 算法进行了修改,支持多词和重新排列的字符串,添加了一个辅助函数,将输出重新编码为可以用于准确文本对文本比较的格式。这个优化保持了最小的开销,确保对性能的影响可以忽略不计。
通过利用这一增强功能,Flookup 用户可以做到以下几点:
+ 模糊匹配和合并
+ 重复项高亮和删除
+ 提取唯一值列表
……所有这些都是基于字符串或字符串部分的发音(如英语发音)。
我非常希望能收到反馈,特别是来自那些关注数据清理的人(我猜这应该是每个人)。
如果你有兴趣尝试一下,这里有一个快速入门指南: [https://www.getflookup.com/get-started](https://www.getflookup.com/get-started)
查看原文
Hello HN.<p>I built Flookup Data Wrangler, a powerful Google Sheets add-on for data cleaning without writing single line of code.<p>Traditional Soundex is designed for single words like "John" and "Jonny", making data cleaning comparisons between such strings straightforward. However, typical Soundex outputs cannot be used to handle multi-word or reordered string comparisons like "John Doe" vs "Doe Jonny", as this would produce inaccurate results.<p>To address this, I modified the Soundex algorithm to support multi-word and reordered strings by adding a helper function that re-encodes the output into a format that can be used for accurate text-to-text comparisons. The optimisation keeps overhead minimal, ensuring negligible impact on performance.<p>By leveraging this enhancement, Flookup users can do the following:<p>+ Fuzzy matching and merging<p>+ Duplicate highlighting and removal<p>+ Extracting a list of unique values<p>... all based on the sound the strings or parts of the strings make (as pronounced in English).<p>I would love feedback, especially from those into data cleaning (which I'm guessing is everyone).<p>If you are curious to give it a try, here is a quick start guide: <a href="https://www.getflookup.com/get-started" rel="nofollow">https://www.getflookup.com/get-started</a>