AI bots and data hoarding
With a majority of online content being produced by AI, where does that leave digital preservation?
Tags: ai archivalPosted on: 2026-06-04
For the first time in history, bots outnumber humans online. Cloudflare CEO Matthew Prince confirmed it just this week: agentic traffic grew so fast that the crossover he had predicted for 2027 arrived over year early. Depending on which method you use, bots now account for up to 57% of all web traffic, with AI-driven crawlers alone growing 187% in 2025 while human traffic grew just 3%.
The common reaction to this is alarm. The less common one is more interesting: so what if we just accept it? Do we have a choice?
Let’s be clear about what the web looks like in 2026. AI is just the latest of many changes that have affected the way web content has been produced for decades. First it was monetization, then SEO content, algorithmic feeds, and now AI agents are simply automating the whole process. A typical user sees a bleak picture of the web today: Search results are increasingly AI-generated summaries that never link back to sources, social feeds are curated by algorithms designed to maximize engagement metrics, not inform people, and most of the content produced daily is unoriginal duplicates of stuff others have posted before, with the sole goal to make money through ads.
I can easily see a world where most of the online traffic is AI agents producing content that ends up being consumed by other AI agents, with no human in the loop. But what does it mean for digital preservation?
Archivists have always relied on proxies for cultural significance: citation counts, inbound links, traffic volume. Those signals are now being poisoned. When a web page receives 10,000 visits in a month, how many were humans? When an article gets linked across 50 other sites, how many of those sites were written by people? The metadata we use to decide what matters, what gets archived, and what gets prioritized is becoming unreliable in ways that will only get worse.
There is also a content provenance problem that will haunt future historians. The Wayback Machine and similar services crawl without filtering for authenticity. The web being archived today contains an unknown and growing percentage of AI-generated text, hallucinated statistics, and synthetic events. Someone researching the early 2020s fifty years from now will face a signal-to-noise problem unlike anything historians have previously had to manage.
Then there is the economic collapse of human content creation. If your articles are ingested by AI crawlers rather than read by paying humans, the referral traffic disappears. Ad revenue collapses. Writers, journalists, and independent researchers stop producing. The archive of genuine human cultural output may have a surprisingly sharp inflection point somewhere in the late 2020s, after which the record becomes dominated by synthetic content. That is a preservation crisis happening in slow motion.
Finally, there is the walled garden problem, where sites are being assaulted by AI crawlers more than ever before, often with no respect of robots.txt or other terms of service. Sites are responding by walling off behind paid accounts or Cloudflare challenges. Content disappearing behind authentication is content that archivists cannot reach. The Internet Archive spoke about this problem just a few weeks ago. The very tools that threaten creators are driving actual human content into places where preservation becomes impossible.
The bottom line is that the bots are not going away. Agent-to-agent traffic will continue to grow. The web will increasingly look like infrastructure rather than culture: APIs calling APIs, structured data flowing between systems, with human participation as a niche use case rather than the primary one. AI can be a wonderful tool, but when it's abused for greed and profit, it causes a lot of collateral problems. One result we've recently seen is the raise of The Small Web, something I've spoken about before and recommend you check out if you enjoy the old-style web environment.
I don't have a perfect solution, especially when it comes to preservation of online content. I think the community by and large has decided that AI content is for the most part not worth archiving, and focusing on high quality human content will become more of a challenge as time goes on. When it comes to the decisions to add archives to DataHoarding.org for example, I focus on what seems to have value, and that tends to be human focused collections.
The internet may increasingly be by machines, for machines, but I think it just means that curated human archives will become more worthwhile for future generations.