展示HN:我抓取了Reddit数据,以寻找最具争议的厨师刀
我想量化在 r/chefknives 上关于“我应该买哪把刀”的无尽争论,因此我建立了一个数据分析管道,以获得一些真实的答案。
该项目是一个由 Node.js 构建的五阶段系统。它首先使用 Fuse.js 对约 450 个已知品牌和约 8,700 个型号进行快速、容错的模糊匹配。剩余的文本随后通过 LLM(通过 OpenRouter)进行处理,以发现新的、未知的实体,并对每次提及进行情感分析。我在超过 1,000 个主题上运行了该系统,总计超过 25,000 条评论。
一些有趣的发现:
- 黑马:经济实惠的 Tojiro 拥有高达 27:1 的正面与负面提及比例。
- 争议之王:Shun 无疑是最具争议的品牌,引发了强烈的爱与恨讨论(59 条正面提及 vs. 24 条负面提及)。
- 不受欢迎:Dalstrong 是少数几个负面提及超过正面提及的品牌之一。
该系统并不完美——我在报告中坦诚存在一个关键的实体聚合错误。完整的技术架构、结果和原始数据均可获取。
我在这里回答任何问题!
博客文章(完整故事及可视化):[https://new.knife.day/blog/we-analyzed-25000-reddit-comments-to-find-most-loved-and-hated-chef-knives](https://new.knife.day/blog/we-analyzed-25000-reddit-comments-to-find-most-loved-and-hated-chef-knives)
GitHub(技术细节及原始数据):[https://github.com/pvijeh/reddit-named-entity-recognition/blob/main/chefknives-brands.md](https://github.com/pvijeh/reddit-named-entity-recognition/blob/main/chefknives-brands.md)
原始 Reddit 讨论:[https://www.reddit.com/r/chefknives/comments/1o2p363/i_analyzed_over_1000_posts_on_rchefknives_heres/](https://www.reddit.com/r/chefknives/comments/1o2p363/i_analyzed_over_1000_posts_on_rchefknives_heres/)
查看原文
I wanted to quantify the endless "which knife should I buy" debates on r/chefknives, so I built a data analysis pipeline to get some real answers.<p>The project is a 5-phase system built with Node.js. It first uses Fuse.js for fast, typo-tolerant fuzzy matching of ~450 known brands and ~8,700 models. The remaining text is then passed to an LLM (via OpenRouter) for discovering new, unknown entities and performing sentiment analysis on every mention. I ran it on over 1,000 threads, totaling more than 25,000 comments.<p>A few interesting findings:<p>The Underdog: Budget-friendly Tojiro has a massive 27-to-1 positive-to-negative mention ratio.<p>The Controversy King: Shun is by far the most polarizing brand, sparking strong love/hate discussions (59 positive vs. 24 negative mentions).<p>The Unloved: Dalstrong was one of the few brands to receive more negative mentions than positive.<p>The system isn't perfect—I'm open about a critical entity aggregation bug in the write-up. The full technical architecture, results, and raw data are available.<p>I'm here to answer any questions!<p>Blog Post (full story & visualizations): <a href="https://new.knife.day/blog/we-analyzed-25000-reddit-comments-to-find-most-loved-and-hated-chef-knives" rel="nofollow">https://new.knife.day/blog/we-analyzed-25000-reddit-comments...</a><p>GitHub (technical breakdown & raw data): <a href="https://github.com/pvijeh/reddit-named-entity-recognition/blob/main/chefknives-brands.md" rel="nofollow">https://github.com/pvijeh/reddit-named-entity-recognition/bl...</a><p>Original Reddit Discussion: <a href="https://www.reddit.com/r/chefknives/comments/1o2p363/i_analyzed_over_1000_posts_on_rchefknives_heres/" rel="nofollow">https://www.reddit.com/r/chefknives/comments/1o2p363/i_analy...</a>