展示HN:WhiskeySour – 一个速度提升10倍的BeautifulSoup替代品

3作者: ayas_behera2 天前原帖
问题<p>我使用BeautifulSoup已经有一段时间了。它在Python爬虫中以易用性著称,但在处理大规模数据集时几乎总是成为性能瓶颈。<p>在Python中解析复杂或庞大的HTML树通常会遭遇高内存分配成本以及在树遍历过程中Python对象模型的开销。在我的生产爬虫工作负载中,解析器消耗的CPU周期甚至超过了网络I/O。Lxml虽然速度快,但在处理大文档时同样会消耗大量内存,并且可能会在处理格式不正确的HTML时出现问题。<p>解决方案<p>我希望保留使BS4出色的API兼容性,同时消除减慢高流量管道的开销。它还使用了html5ever,这就是我构建WhiskeySour的原因。没错……我*全程都是凭感觉编码的*。<p>WhiskeySour是一个可直接替换的库。你应该能够将“bs4 import BeautifulSoup”替换为“from whiskeysour import WhiskeySour”,并立即看到速度的提升。以前需要超过30分钟的工作流程现在可能只需不到5分钟。<p>我在这里分享了该库的详细架构: <a href="https://the-pro.github.io/whiskeySour/architecture/" rel="nofollow">https://the-pro.github.io/whiskeySour/architecture/</a><p>这是与bs4和html.parser的基准报告:<a href="https://the-pro.github.io/whiskeySour/bench-report/" rel="nofollow">https://the-pro.github.io/whiskeySour/bench-report/</a><p>这是仓库的链接:<a href="https://github.com/the-pro/WhiskeySour" rel="nofollow">https://github.com/the-pro/WhiskeySour</a><p>我分享这个的原因<p>我希望从社区获得两个方面的反馈:<p>1. 边缘案例:如果你有特别混乱或格式不正确的HTML,而BS4处理得很好,我想知道WhiskeySour是否会遇到任何回归问题。<p>2. 基准测试:如果你正在运行高流量的解析器,我将非常感激你能在自己的数据集上进行测试并分享结果。
查看原文
The Problem<p>I’ve been using BeautifulSoup for sometime. It’s the standard for ease-of-use in Python scraping, but it almost always becomes the performance bottleneck when processing large-scale datasets.<p>Parsing complex or massive HTML trees in Python typically suffers from high memory allocation costs and the overhead of the Python object model during tree traversal. In my production scraping workloads, the parser was consuming more CPU cycles than the network I&#x2F;O. Lxml is fast but again uses up a lot of memory when processing large documents and has can cause trouble with malformed HTML.<p>The Solution<p>I wanted to keep the API compatibility that makes BS4 great, but eliminates the overhead that slows down high-volume pipelines. It also uses html5ever which That’s why I built WhiskeySour. And yes… I *vibe coded the whole thing*.<p>WhiskeySour is a drop-in replacement. You should be able to swap from &quot;bs4 import BeautifulSoup&quot; with &quot;from whiskeysour import WhiskeySour&quot; and see immediate speedups. Your workflows that used to take more than 30 mins might take less than 5 mins now.<p>I have shared the detailed architecture of the library here: <a href="https:&#x2F;&#x2F;the-pro.github.io&#x2F;whiskeySour&#x2F;architecture&#x2F;" rel="nofollow">https:&#x2F;&#x2F;the-pro.github.io&#x2F;whiskeySour&#x2F;architecture&#x2F;</a><p>Here is the benchmark report against bs4 with html.parser: <a href="https:&#x2F;&#x2F;the-pro.github.io&#x2F;whiskeySour&#x2F;bench-report&#x2F;" rel="nofollow">https:&#x2F;&#x2F;the-pro.github.io&#x2F;whiskeySour&#x2F;bench-report&#x2F;</a><p>Here is the link to the repo: <a href="https:&#x2F;&#x2F;github.com&#x2F;the-pro&#x2F;WhiskeySour" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;the-pro&#x2F;WhiskeySour</a><p>Why I’m sharing this<p>I’m looking for feedback from the community on two fronts:<p>1. Edge cases: If you have particularly messy or malformed HTML that BS4 handles well, I’d love to know if WhiskeySour encounters any regressions.<p>2. Benchmarks: If you are running high-volume parsers, I’d appreciate it if you could run a test on your own datasets and share the results.