Tadpole 语言用于抓取 0.2.0 – 复杂控制流、隐蔽性及更多功能

4作者: zachperkitny大约 1 个月前原帖
你好, 我几周前发布了一些关于我自定义抓取语言的内容。这个项目确实引起了一些关注,这让我感到非常兴奋。 GitHub 仓库:[https://github.com/tadpolehq/tadpole](https://github.com/tadpolehq/tadpole) 文档:[https://tadpolehq.com/](https://tadpolehq.com/) 在过去的两周里,我专注于引入特定的隐匿操作、更复杂的控制流操作以及用于清理数据的各种评估器。 以下是一个从 `books.toscrape.com` 抓取数据的示例: ```plaintext main { new_page { goto "https://books.toscrape.com/" loop { do { $$ article.product_pod { extract "books[]" { title { $ "h3 a"; attr title } rating { $ ".star-rating"; attr "class"; extract "star-rating (One|Two|Three|Four|Five)" caseInsensitive=#true; func "(v) => ({'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5}[v.toLowerCase()] || null)" } price { $ "p.price_color"; text; as_float } in_stock { $ "p.availability"; text; matches "In stock" caseInsensitive=#true } } } } while { $ "li.next" } next { $ "li.next a" { click } wait_until } } } } ``` 我引入了像 `apply_identity` 这样的操作,用于覆盖用户代理头和用户代理元数据。以下是一个选择性创建不同身份的示例模块: ```plaintext module stealth { // Apple M2 Pro action apply_apple_m2 { apply_identity mac set_webgl_vendor "Apple Inc." "Apple M2" set_device_memory 16 set_hardware_concurrency 8 set_viewport 1440 900 deviceScaleFactor=2 } // Windows Desktop action apply_windows_16_8 { apply_identity windows set_webgl_vendor "Google Inc. (Intel)" "ANGLE (Intel, Intel(R) UHD Graphics 620 Direct3D11 vs_5_0 ps_5_0)" set_device_memory 16 set_hardware_concurrency 8 set_viewport 1920 1080 } // Windows Budget Laptop action apply_windows_8_4 { apply_identity windows set_webgl_vendor "Google Inc. (Intel)" "ANGLE (Intel, Intel(R) UHD Graphics 620 Direct3D11 vs_5_0 ps_5_0)" set_device_memory 8 set_hardware_concurrency 4 set_viewport 1366 768 } } ``` 完整的发布更新日志可以在这里查看:[https://github.com/tadpolehq/tadpole/releases/](https://github.com/tadpolehq/tadpole/releases/) 我对下一个 0.3.0 版本的目标是重点关注插件、通过消息队列进行分布式执行、支持 Redis 爬虫,以及静态解析,而不仅仅是通过 CDP/Chrome。 我会尽量保持每两周发布一次新版本!
查看原文
Hello,<p>I posted a few weeks ago about my custom scraping language. It definitely got some traction, which was very exciting to see.<p>Github Repo: https:&#x2F;&#x2F;github.com&#x2F;tadpolehq&#x2F;tadpole Docs: https:&#x2F;&#x2F;tadpolehq.com&#x2F;<p>The past 2 weeks, I&#x27;ve been focusing my efforts in introducing specific stealth actions, more complicated control flow actions and a lot of various evaluators for cleaning data.<p>Here is an example for scraping from `books.toscrape.com`<p><pre><code> main { new_page { goto &quot;https:&#x2F;&#x2F;books.toscrape.com&#x2F;&quot; loop { do { $$ article.product_pod { extract &quot;books[]&quot; { title { $ &quot;h3 a&quot;; attr title } rating { $ &quot;.star-rating&quot;; attr &quot;class&quot;; extract &quot;star-rating (One|Two|Three|Four|Five)&quot; caseInsensitive=#true; func &quot;(v) =&gt; ({&#x27;one&#x27;: 1, &#x27;two&#x27;: 2, &#x27;three&#x27;: 3, &#x27;four&#x27;: 4, &#x27;five&#x27;: 5}[v.toLowerCase()] || null)&quot; } price { $ &quot;p.price_color&quot;; text; as_float } in_stock { $ &quot;p.availability&quot;; text; matches &quot;In stock&quot; caseInsensitive=#true } } } } while { $ &quot;li.next&quot; } next { $ &quot;li.next a&quot; { click } wait_until } } } } </code></pre> I&#x27;ve introduced actions like `apply_identity` to override User Agent Headers and User Agent Metadata. Here is an example module to selectively create different identities:<p><pre><code> module stealth { &#x2F;&#x2F; Apple M2 Pro action apply_apple_m2 { apply_identity mac set_webgl_vendor &quot;Apple Inc.&quot; &quot;Apple M2&quot; set_device_memory 16 set_hardware_concurrency 8 set_viewport 1440 900 deviceScaleFactor=2 } &#x2F;&#x2F; Windows Desktop action apply_windows_16_8 { apply_identity windows set_webgl_vendor &quot;Google Inc. (Intel)&quot; &quot;ANGLE (Intel, Intel(R) UHD Graphics 620 Direct3D11 vs_5_0 ps_5_0)&quot; set_device_memory 16 set_hardware_concurrency 8 set_viewport 1920 1080 } &#x2F;&#x2F; Windows Budget Laptop action apply_windows_8_4 { apply_identity windows set_webgl_vendor &quot;Google Inc. (Intel)&quot; &quot;ANGLE (Intel, Intel(R) UHD Graphics 620 Direct3D11 vs_5_0 ps_5_0)&quot; set_device_memory 8 set_hardware_concurrency 4 set_viewport 1366 768 } } </code></pre> The full release changelog is available here: https:&#x2F;&#x2F;github.com&#x2F;tadpolehq&#x2F;tadpole&#x2F;releases&#x2F;<p>My goals for the next 0.3.0 release is to heavily focus on Plugins, Distributed Execution through Message Queues, Redis Support for Crawling, Static Parsing as opposed to exclusively over CDP&#x2F;Chrome.<p>I will keep trying to keep my release cadence at every 2 weeks!