Why I became a digital archivist in 2025
Some thoughts about the current state of things and what led me to create DataHoarding.org
Posted on: 2025-07-14

I've always enjoyed libraries, museums and art galleries. I'm one of those people who consider an afternoon spent in an ancient ruin to be a good time. But tech has always been my main passion and professional career, so I certainly did not see myself becoming an amateur digital archivist this year. Yet it did, and so I thought I would write a quick post on what happened and why.
I've always been well aware of the fragility of digital data. I've been on the Internet since the mid-1990s, I've been a Wikipedia editor since the early 2000s, and I've hosted countless web sites, blogs and participated in forums, most of which no longer exist. There's been many studies that show how quickly sites disappear, such as the Pew Research study that showed 38% of web sites from 2013 had vanished in less than 10 years.
So when it comes to my personal data, I've always been quite diligent with backups and archival. And I've been pleased to see the many online efforts to archive public online content, from Wikipedia, to the Internet Archive, and others. But ever since the new US administration came into power, I really felt like digital data wasn't just a passive victim of link rot anymore, but from active attempts by people in power to systematically erase history. From the de-funding of crucial scientific institutions, to the blatant erasure of web sites and datasets, to pressure being put on US companies to follow the party line, I felt like the US was acting more like China with its Great Firewall than the beacon of democracy it always claims to be.
So I decided that I needed to do more, and back in January I started to investigate what those efforts could be. I quickly learned about digital archiving through many great organizations like the Digital Preservation Coalition, the International Council on Archives and the International Internet Preservation Consortium. I went through a lot of learning material, and explored a lot of existing archives.
Through all my research, I came to 2 conclusions:
- First, there are many archival sites out there, and while I was concerned about data erasure, so were many other people. A lot of individuals and groups stood up and helped archive at-risk datasets, but a lot of the important data being stored these days is being done in the US. The Internet Archive in particular, which is to this day the largest and most used online archive by far, is based in the US, and so are many other sites. So I thought it was crucial that more emphasis be put on international efforts.
- Second, while a lot of people are downloading data, very little is being done when it comes to curation and indexing. Almost every day I go through the /r/DataHoarder subreddit and see people mention a specific site or dataset that's at-risk or about to disappear, only for one of two things to happen. Either someone suggests that a copy be dumped into the endless void that is the Internet Archive, or someone will mention having backed the site up to their own personal archive. Then everyone is satisfied, and the world keeps turning. But what happens next? How are people going to access this data in a few years? There's a big lack of indexing of resources and archives, and finding specific types of data can be a huge challenge, even on the Internet Archive, never mind throughout all other archives around the web.
That's when the idea for DataHoarding.org became clear. I decided to build the world's largest index of resources and archives related to data hoarding, web archival and digital preservation. The goal is not only to hoard data, but curate and index it as well. This is where my tech background came into play, since I knew exactly what would be needed to make this type of site, from a tech standpoint.
The site contains 2 main indexes:
- Resources - This is a list of data hoarding resources if you want to get started and help archival teams, or simply backup web content for your own personal use.
- Archives - On this page you will find links to data archives from various countries. These archives contain data that was gathered and saved for the public good.
The deployment stack is not too complicated. The site is completely static, using HTML pages on several web servers, running on a Proxmox cluster behind a Cloudflare CDN. The pages are created automatically through a content management system called Directus using a preprocessing pipeline. All of which to say that it works well for my use case.
Finding which sites to add proved to be a bigger challenge. Before collecting anything, I had to decide on the criteria for inclusion. First, the sites must have a significant collection of items. Second, these items must be available to the public without having to jump through significant hoops (ie. requirement to have a local library card) or requiring a subscription fee.
This is a volunteer effort, so it's a slow and steady effort to compile all these resources and archives. There are 3 main ways I find items to add to the indexes:
- One way is through word of mouth, by simply watching content from various similar organizations, or reading subreddits and forums. I find lots of useful archives that way.
- A second way is through email tips from users, sent to the site's email address.
- And finally, I also built an AI agent that scours the Internet for relevant archives. It's built on the RelevanceAI platform and every day I instruct it to find additional archives about music, software, climate change, etc.
As of today, the site receives over 500 unique visitors every day, and I managed to collect over 200 resources and archives. And this is just the beginning. I'm hoping to keep expanding it, a few sites each day, and spread the word to others who believe like I do that digital preservation is important, and that think important data about science, climate change, or even which games we're allowed to play, shouldn't be in the hands of a few rich politicians and corporations.