HackerNews中文版

你好，HN。我开发了 Flookup Data Wrangler，这是一个强大的 Google Sheets 插件，可以在不编写任何代码的情况下进行数据清理。传统的 Soundex 算法是为单个单词设计的，比如“John”和“Jonny”，使得对这些字符串的比较变得简单。然而，典型的 Soundex 输出无法处理多词或重新排列的字符串比较，例如“John Doe”与“Doe Jonny”，因为这样会产生不准确的结果。为了解决这个问题，我对 Soundex 算法进行了修改，支持多词和重新排列的字符串，添加了一个辅助函数，将输出重新编码为可以用于准确文本对文本比较的格式。这个优化保持了最小的开销，确保对性能的影响可以忽略不计。通过利用这一增强功能，Flookup 用户可以做到以下几点： + 模糊匹配和合并 + 重复项高亮和删除 + 提取唯一值列表 ……所有这些都是基于字符串或字符串部分的发音（如英语发音）。我非常希望能收到反馈，特别是来自那些关注数据清理的人（我猜这应该是每个人）。如果你有兴趣尝试一下，这里有一个快速入门指南： [https://www.getflookup.com/get-started](https://www.getflookup.com/get-started)

查看原文

Hello HN.I built Flookup Data Wrangler, a powerful Google Sheets add-on for data cleaning without writing single line of code.Traditional Soundex is designed for single words like "John" and "Jonny", making data cleaning comparisons between such strings straightforward. However, typical Soundex outputs cannot be used to handle multi-word or reordered string comparisons like "John Doe" vs "Doe Jonny", as this would produce inaccurate results.To address this, I modified the Soundex algorithm to support multi-word and reordered strings by adding a helper function that re-encodes the output into a format that can be used for accurate text-to-text comparisons. The optimisation keeps overhead minimal, ensuring negligible impact on performance.By leveraging this enhancement, Flookup users can do the following:+ Fuzzy matching and merging+ Duplicate highlighting and removal+ Extracting a list of unique values... all based on the sound the strings or parts of the strings make (as pronounced in English).I would love feedback, especially from those into data cleaning (which I'm guessing is everyone).If you are curious to give it a try, here is a quick start guide: <a href="https://www.getflookup.com/get-started" rel="nofollow">https://www.getflookup.com/get-started</a>

展示HN：我增强了Soundex，以正确处理多词字符串