问HN:如何将一个针对性网络爬虫的规模扩大到每天超过5亿个页面?
我一直在研究爬虫架构。我找到的两个最有用的来源是博客文章《在2025年,24小时内抓取十亿个网页》和Mercator论文(《Mercator:一个可扩展的网络爬虫》)。
这两者以及我遇到的大多数其他资料,主要关注的是抓取广泛的开放网络,而不是特定的域名集合。对于产品价格而言,后者更为重要。例如,Mercator指出DNS解析是一个主要瓶颈,但当你只针对几百个域名时,这并不是一个真正的问题。
另一个缺口是这两者都假设使用静态HTML。对于我们的使用案例,我们需要一个无头浏览器,并且还必须处理Cloudflare和类似的反爬虫系统。
具体来说,对于产品价格,许多网站发布价格数据流,这简化了事情,但也有很多网站没有发布,获取良好的覆盖率仍然需要抓取。我们当前的系统每天处理约5亿个页面,我们希望提高其性能。
这里有没有人有相关经验,或者知道关于如何使用无头浏览器扩展针对特定(而非广泛)爬虫的文章或博客?任何指点都非常感谢。
查看原文
I've been reading up on crawler architecture. The two most useful sources I've found are the blog post "Crawling a billion web pages in just over 24 hours, in 2025" and the Mercator paper ("Mercator: A Scalable, Extensible Web Crawler").<p>Both of these, and most other material I've come across, focus on crawling the broad open web rather than a targeted set of domains. For product prices it's the latter. Mercator calls out DNS resolution as a major bottleneck, for example, but when you're only hitting a few hundred domains that isn't really a concern.<p>The other gap is that both assume static HTML. For our use case we need a headless browser, and we also have to deal with Cloudflare and similar anti-bot systems.<p>For product prices specifically, a lot of sites publish price feeds which simplifies things, but plenty don't, and getting good coverage still requires scraping. Our current system does about 500M pages/day and we're looking to improve its performance.<p>Does anyone here have experience in this space, or know of articles/blog posts on scaling targeted (rather than broad) crawlers with headless browsers? Any pointers appreciated.