LLM僵尸网络:公司是否在利用僵尸网络抓取内容?

2作者: flyriver9 个月前原帖
我有一个网站,包含数百万篇文章,这些文章是由Llama、GPT和Gemini生成的。可以想象,这里发生了大量的爬虫抓取。一般来说,我允许那些遵循robots.txt并自我标识为爬虫的程序随意抓取。我认为,如果网站能够“被”大型语言模型(LLMs)收录,可能会获得更多的曝光。否则,我会尝试阻止它们。 随着时间的推移,尤其是最近,我看到成千上万的不同IP地址在抓取我的网站。它们使用随机或变化的用户代理。起初,我阻止了来自巴西的/16段IP,因为大部分流量似乎来自那里,但在过去几周,IP地址来自各个地方。每个IP只发出几次请求,试图保持低调。目前,我设置了一些脚本来阻止并记录这些IP。 我每分钟阻止50到100个独特的IP地址,而这已经是在我阻止了主要的中国大型语言模型爬虫和几个/16段IP之后。很少有这些IP属于明显的服务提供商,许多似乎只是家庭用户。还有很多来自那些没有资金建设大型语言模型的国家,甚至还有无线电话公司的IP。 这些请求没有特别恶意,它们只是下载页面。 我是不是漏掉了什么?是否有新的僵尸网络在抓取网络?快速查看我的日志显示,在过去90分钟内我已经阻止了15,000个请求,但其中只有1,300个是重复的IP,这些IP已被加入我的阻止列表。昨天,我阻止了220,000个请求,其中只有13,000个是重复的。
查看原文
I have a web site with several million pages of articles, generated by the llama&#x2F;gpt&#x2F;and gemini. As you can imagine, there is a lot of scraping happening. Generally speaking, I allow the crawlers that respect robots.txt and identity themselves as bots to go wild. I figure it might get the site more exposure if it is &quot;in&quot; the LLMs. Otherwise, I try to block them.<p>Over time, especially recently, I have seen thousands of diverse IP addresses scraping the site. They use random&#x2F;varying user-agents. I originally was blocking Brazil &#x2F;16s since it appeared that most of the traffic was coming from there, but over the past few weeks the IPs come from everywhere. Each IP makes only a few requests, trying to stay under the radar. Right now, I have set some scripts to block and log the IPs as they come in.<p>I am blocking between 50 and 100 unique IP addresses per minute, and this is after I already blocked the main Chinese LLM scrapers and several &#x2F;16s. Few of the IPs belong to obvious providers. Many just seem to be home users. Many are from countries that do not have the money to build LLMs. There are even wireless phone company IPs.<p>None of the requests are particularly malicious. They are just downloading pages.<p>Am I missing something? Is there a new botnet scraping the web ? A quick grep through my logs shows I have blocked 15,000 requests in the past 90 minutes, but only 1300 of them are repeats of IPs that have been added to my block list. Yesterday, I blocked 220,000 requests and only 13,000 of them were repeats.