Tadpole 语言用于抓取 0.2.0 – 复杂控制流、隐蔽性及更多功能
你好,
我几周前发布了一些关于我自定义抓取语言的内容。这个项目确实引起了一些关注,这让我感到非常兴奋。
GitHub 仓库:[https://github.com/tadpolehq/tadpole](https://github.com/tadpolehq/tadpole)
文档:[https://tadpolehq.com/](https://tadpolehq.com/)
在过去的两周里,我专注于引入特定的隐匿操作、更复杂的控制流操作以及用于清理数据的各种评估器。
以下是一个从 `books.toscrape.com` 抓取数据的示例:
```plaintext
main {
new_page {
goto "https://books.toscrape.com/"
loop {
do {
$$ article.product_pod {
extract "books[]" {
title { $ "h3 a"; attr title }
rating {
$ ".star-rating";
attr "class";
extract "star-rating (One|Two|Three|Four|Five)" caseInsensitive=#true;
func "(v) => ({'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5}[v.toLowerCase()] || null)"
}
price { $ "p.price_color"; text; as_float }
in_stock { $ "p.availability"; text; matches "In stock" caseInsensitive=#true }
}
}
}
while { $ "li.next" }
next {
$ "li.next a" { click }
wait_until
}
}
}
}
```
我引入了像 `apply_identity` 这样的操作,用于覆盖用户代理头和用户代理元数据。以下是一个选择性创建不同身份的示例模块:
```plaintext
module stealth {
// Apple M2 Pro
action apply_apple_m2 {
apply_identity mac
set_webgl_vendor "Apple Inc." "Apple M2"
set_device_memory 16
set_hardware_concurrency 8
set_viewport 1440 900 deviceScaleFactor=2
}
// Windows Desktop
action apply_windows_16_8 {
apply_identity windows
set_webgl_vendor "Google Inc. (Intel)" "ANGLE (Intel, Intel(R) UHD Graphics 620 Direct3D11 vs_5_0 ps_5_0)"
set_device_memory 16
set_hardware_concurrency 8
set_viewport 1920 1080
}
// Windows Budget Laptop
action apply_windows_8_4 {
apply_identity windows
set_webgl_vendor "Google Inc. (Intel)" "ANGLE (Intel, Intel(R) UHD Graphics 620 Direct3D11 vs_5_0 ps_5_0)"
set_device_memory 8
set_hardware_concurrency 4
set_viewport 1366 768
}
}
```
完整的发布更新日志可以在这里查看:[https://github.com/tadpolehq/tadpole/releases/](https://github.com/tadpolehq/tadpole/releases/)
我对下一个 0.3.0 版本的目标是重点关注插件、通过消息队列进行分布式执行、支持 Redis 爬虫,以及静态解析,而不仅仅是通过 CDP/Chrome。
我会尽量保持每两周发布一次新版本!
查看原文
Hello,<p>I posted a few weeks ago about my custom scraping language. It definitely got some traction, which was very exciting to see.<p>Github Repo: https://github.com/tadpolehq/tadpole
Docs: https://tadpolehq.com/<p>The past 2 weeks, I've been focusing my efforts in introducing specific stealth actions, more complicated control flow actions and a lot of various evaluators for cleaning data.<p>Here is an example for scraping from `books.toscrape.com`<p><pre><code> main {
new_page {
goto "https://books.toscrape.com/"
loop {
do {
$$ article.product_pod {
extract "books[]" {
title { $ "h3 a"; attr title }
rating {
$ ".star-rating";
attr "class";
extract "star-rating (One|Two|Three|Four|Five)" caseInsensitive=#true;
func "(v) => ({'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5}[v.toLowerCase()] || null)"
}
price { $ "p.price_color"; text; as_float }
in_stock { $ "p.availability"; text; matches "In stock" caseInsensitive=#true }
}
}
}
while { $ "li.next" }
next {
$ "li.next a" { click }
wait_until
}
}
}
}
</code></pre>
I've introduced actions like `apply_identity` to override User Agent Headers and User Agent Metadata. Here is an example module to selectively create different identities:<p><pre><code> module stealth {
// Apple M2 Pro
action apply_apple_m2 {
apply_identity mac
set_webgl_vendor "Apple Inc." "Apple M2"
set_device_memory 16
set_hardware_concurrency 8
set_viewport 1440 900 deviceScaleFactor=2
}
// Windows Desktop
action apply_windows_16_8 {
apply_identity windows
set_webgl_vendor "Google Inc. (Intel)" "ANGLE (Intel, Intel(R) UHD Graphics 620 Direct3D11 vs_5_0 ps_5_0)"
set_device_memory 16
set_hardware_concurrency 8
set_viewport 1920 1080
}
// Windows Budget Laptop
action apply_windows_8_4 {
apply_identity windows
set_webgl_vendor "Google Inc. (Intel)" "ANGLE (Intel, Intel(R) UHD Graphics 620 Direct3D11 vs_5_0 ps_5_0)"
set_device_memory 8
set_hardware_concurrency 4
set_viewport 1366 768
}
}
</code></pre>
The full release changelog is available here:
https://github.com/tadpolehq/tadpole/releases/<p>My goals for the next 0.3.0 release is to heavily focus on Plugins, Distributed Execution through Message Queues, Redis Support for Crawling, Static Parsing as opposed to exclusively over CDP/Chrome.<p>I will keep trying to keep my release cadence at every 2 weeks!