HackerNews中文版

我一直在研究爬虫架构。我找到的两个最有用的来源是博客文章《在2025年，24小时内抓取十亿个网页》和Mercator论文（《Mercator：一个可扩展的网络爬虫》）。这两者以及我遇到的大多数其他资料，主要关注的是抓取广泛的开放网络，而不是特定的域名集合。对于产品价格而言，后者更为重要。例如，Mercator指出DNS解析是一个主要瓶颈，但当你只针对几百个域名时，这并不是一个真正的问题。另一个缺口是这两者都假设使用静态HTML。对于我们的使用案例，我们需要一个无头浏览器，并且还必须处理Cloudflare和类似的反爬虫系统。具体来说，对于产品价格，许多网站发布价格数据流，这简化了事情，但也有很多网站没有发布，获取良好的覆盖率仍然需要抓取。我们当前的系统每天处理约5亿个页面，我们希望提高其性能。这里有没有人有相关经验，或者知道关于如何使用无头浏览器扩展针对特定（而非广泛）爬虫的文章或博客？任何指点都非常感谢。

查看原文

I've been reading up on crawler architecture. The two most useful sources I've found are the blog post "Crawling a billion web pages in just over 24 hours, in 2025" and the Mercator paper ("Mercator: A Scalable, Extensible Web Crawler").<p>Both of these, and most other material I've come across, focus on crawling the broad open web rather than a targeted set of domains. For product prices it's the latter. Mercator calls out DNS resolution as a major bottleneck, for example, but when you're only hitting a few hundred domains that isn't really a concern.<p>The other gap is that both assume static HTML. For our use case we need a headless browser, and we also have to deal with Cloudflare and similar anti-bot systems.<p>For product prices specifically, a lot of sites publish price feeds which simplifies things, but plenty don't, and getting good coverage still requires scraping. Our current system does about 500M pages/day and we're looking to improve its performance.<p>Does anyone here have experience in this space, or know of articles/blog posts on scaling targeted (rather than broad) crawlers with headless browsers? Any pointers appreciated.

问HN：如何将一个针对性网络爬虫的规模扩大到每天超过5亿个页面？