How to forget everything

11 May 2021

Recently I was studying the list of digital formats recommended by the Library of Congress for digital data—meaning, formats that are likely (though not guaranteed) to be well supported in the future.

One of the many lurking horrors of the digital age: though it’s become much easier to publish and retrieve information in the present—e.g., we can put material on a website instead of printing & shipping a book—the long-term archiving of information has arguably become more difficult. That printed book, you can put on a shelf and it will probably last hundreds of years. The data, you can … do what, exactly?

Turning data into physical artifacts is possible. But other complications arise. Like many who owned a computer in the ’90s, I archived a lot of data using CD-R discs. At the time, these were projected to last 100–200 years. It’s turned out, however, that many of these discs are going bad after about 10 years. This means that many CD-R data archives will annihilate the information they were entrusted to protect. (I was able to recover all the data from my own pile of CD-Rs because I made multiple copies. But indeed, a number of individual disks had already failed.)

One major advantage of printed material is that it’s self-contained: you take a book from the shelf, and it works the way the creator intended. Not so for software. Any digital file needs an app to read it & display it; the app needs a certain operating system; the operating system needs certain hardware; the hardware is complicated and always physically deteriorating.

The Internet Archive has been taking this problem seriously for 20+ years. It’s maybe best known for the Wayback Machine, its historical archive of web pages. But it’s also been figuring out how to preserve other software. The best idea is to at least get rid of the hardware, and run the old operating system as a “virtual machine” on the current system (the Internet Arcade being one example). In this way, everything that is needed to run a file—the OS, the app, etc.—is packaged together, and thus, like a book, becomes something closer to self-contained.

As a software developer, I use virtual machines like these to test my work on older systems. I don’t want to keep an actual Windows XP machine in my office. But I can run it anytime from a hard drive attached to my Mac. It’s easy and it works.

One of the less-noted side effects of the cloud-computing era is that this form of archivability is starting to disappear. When you run a cloud-connected piece of software, it’s dependent as usual on a certain operating system and certain hardware. But also a new ingredient: a server elsewhere on the internet that determines whether the program can run. (Usually, based on whether you have a subscription, etc.)

This outside server makes it impossible to archive the software. Why? First, the dependency on the outside server can’t be encapsulated within a virtual machine. Second, there’s no guarantee that the outside server will keep operating. (On the contrary: every server gets unplugged, eventually.) And because the software environment cannot be archived, the data files that rely on that environment also cannot be archived.

What does this mean for the culture and knowledge that’s captured in digital data? I’m far from a software-freedom absolutist. But I think the rise of subscription-based software is terrible for the longer arc of computing, because it forces so much to become temporary.

update, 344 days later

The idea of software longevity through minimalistic use of computing resources is starting to percolate upward. See, e.g.—Collapse OS, permacomputing, and the Uxn stack.