Why I became a digital archivist in 2025

Some thoughts about the current state of things and what led me to create DataHoarding.org

Posted on: 2025-07-14

I've always enjoyed libraries, museums and art galleries. I'm one of those people who consider an afternoon spent in an ancient ruin to be a good time. But tech has always been my main passion and professional career, so I certainly did not see myself becoming an amateur digital archivist this year. Yet it did, and so I thought I would write a quick post on what happened and why.

I've always been well aware of the fragility of digital data. I've been on the Internet since the mid-1990s, I've been a Wikipedia editor since the early 2000s, and I've hosted countless web sites, blogs and participated in forums, most of which no longer exist. There's been many studies that show how quickly sites disappear, such as the Pew Research study that showed 38% of web sites from 2013 had vanished in less than 10 years.

So when it comes to my personal data, I've always been quite diligent with backups and archival. And I've been pleased to see the many online efforts to archive public online content, from Wikipedia, to the Internet Archive, and others. But ever since the new US administration came into power, I really felt like digital data wasn't just a passive victim of link rot anymore, but from active attempts by people in power to systematically erase history. From the de-funding of crucial scientific institutions, to the blatant erasure of web sites and datasets, to pressure being put on US companies to follow the party line, I felt like the US was acting more like China with its Great Firewall than the beacon of democracy it always claims to be.

So I decided that I needed to do more, and back in January I started to investigate what those efforts could be. I quickly learned about digital archiving through many great organizations like the Digital Preservation Coalition, the International Council on Archives and the International Internet Preservation Consortium. I went through a lot of learning material, and explored a lot of existing archives.

Through all my research, I came to 2 conclusions:

That's when the idea for DataHoarding.org became clear. I decided to build the world's largest index of resources and archives related to data hoarding, web archival and digital preservation. The goal is not only to hoard data, but curate and index it as well. This is where my tech background came into play, since I knew exactly what would be needed to make this type of site, from a tech standpoint.

The site contains 2 main indexes:

The deployment stack is not too complicated. The site is completely static, using HTML pages on several web servers, running on a Proxmox cluster behind a Cloudflare CDN. The pages are created automatically through a content management system called Directus using a preprocessing pipeline. All of which to say that it works well for my use case.

Finding which sites to add proved to be a bigger challenge. Before collecting anything, I had to decide on the criteria for inclusion. First, the sites must have a significant collection of items. Second, these items must be available to the public without having to jump through significant hoops (ie. requirement to have a local library card) or requiring a subscription fee.

This is a volunteer effort, so it's a slow and steady effort to compile all these resources and archives. There are 3 main ways I find items to add to the indexes:

As of today, the site receives over 500 unique visitors every day, and I managed to collect over 200 resources and archives. And this is just the beginning. I'm hoping to keep expanding it, a few sites each day, and spread the word to others who believe like I do that digital preservation is important, and that think important data about science, climate change, or even which games we're allowed to play, shouldn't be in the hands of a few rich politicians and corporations.