Butlerian Jihad
This page collects my blog posts on the topic of fighting off spam bots, search engine spiders and other non-humans wasting the precious resources we have on Earth.
This page collects my blog posts on the topic of fighting off spam bots, search engine spiders and other non-humans wasting the precious resources we have on Earth.
The Wikimedia Foundation, stewards of the finest projects on the web, have written about the hammering their servers are taking from the scraping bots that feed large language models.
Our infrastructure is built to sustain sudden traffic spikes from humans during high-interest events, but the amount of traffic generated by scraper bots is unprecedented and presents growing risks and costs.
Drew DeVault puts it more bluntly, saying Please stop externalizing your costs directly into my face:
Over the past few months, instead of working on our priorities at SourceHut, I have spent anywhere from 20-100% of my time in any given week mitigating hyper-aggressive LLM crawlers at scale.
And no, a robots.txt file doesn’t help.
If you think these crawlers respect robots.txt then you are several assumptions of good faith removed from reality. These bots crawl everything they can find, robots.txt be damned.
Free and open source projects are particularly vulnerable. FOSS infrastructure is under attack by AI companies:
LLM scrapers are taking down FOSS projects’ infrastructure, and it’s getting worse.
You try to do the right thing by making knowledge and tools freely available. This is how you get repaid. AI bots are destroying Open Access:
There’s a war going on on the Internet. AI companies with billions to burn are hard at work destroying the websites of libraries, archives, non-profit organizations, and scholarly publishers, anyone who is working to make quality information universally available on the internet.
My own experience with The Session bears this out.
Ars Technica has a piece on this: Open source devs say AI crawlers dominate traffic, forcing blocks on entire countries .
So does MIT Technology Review: AI crawler wars threaten to make the web more closed for everyone.
When we talk about the unfair practices and harm done by training large language models, we usually talk about it in the past tense: how they were trained on other people’s creative work without permission. But this is an ongoing problem that’s just getting worse.
The worst of the internet is continuously attacking the best of the internet. This is a distributed denial of service attack on the good parts of the World Wide Web.
If you’re using the products powered by these attacks, you’re part of the problem. Don’t pretend it’s cute to ask ChatGPT for something. Don’t pretend it’s somehow being technologically open-minded to continuously search for nails to hit with the latest “AI” hammers.
If you’re going to use generative tools powered by large language models, don’t pretend you don’t know how your sausage is made.
As it currently stands, both the rapid growth of AI-generated content overwhelming online spaces and aggressive web-crawling practices by AI firms threaten the sustainability of essential online resources. The current approach taken by some large AI companies—extracting vast amounts of data from open-source projects without clear consent or compensation—risks severely damaging the very digital ecosystem on which these AI models depend.
AI companies with billions to burn are hard at work destroying the websites of libraries, archives, non-profit organizations, and scholarly publishers, anyone who is working to make quality information universally available on the internet.
More on how large language bots are DDOSing the web:
LLM scrapers are taking down FOSS projects’ infrastructure, and it’s getting worse.
Over the past few months, instead of working on our priorities at SourceHut, I have spent anywhere from 20-100% of my time in any given week mitigating hyper-aggressive LLM crawlers at scale.
This matches my experience with The Session. In fact, while I had this article open in a tab, I had to go deal with a tsunami of large language model bots. It’s really fucking depressing.
Please stop legitimizing LLMs or AI image generators or GitHub Copilot or any of this garbage. I am begging you to stop using them, stop talking about them, stop making new ones, just stop. If blasting CO2 into the air and ruining all of our freshwater and traumatizing cheap laborers and making every sysadmin you know miserable and ripping off code and books and art at scale and ruining our fucking democracy isn’t enough for you to leave this shit alone, what is?
Last.fm are hiring. If you're London-based and in the job market, I can think of a lot worse places to work.
Geeky but good.