Ninety Days to Save the Internet: The Manic, Wget-Fueled Race to Pull 38 Million GeoCities Pages From Yahoo's Trash Compactor
Yahoo's announcement landed on April 23, 2009, in the tone of a company closing a branch office nobody visited anymore. GeoCities — the free web hosting service that had, at its peak in the late nineties, been the third most visited destination on the entire internet — would be shut down in the fall. The press release was bloodless and corporate. The reaction from a certain corner of the internet was anything but.
Within 48 hours, an IRC channel that would become the nerve center of the largest amateur archival effort in internet history had more people in it than most servers see on a good day.
What Yahoo Was Actually Throwing Away
To understand why people lost their minds, you have to understand what GeoCities actually was. Not the punchline. Not the animated GIF jokes. The actual thing.
From roughly 1995 to 2001, GeoCities was where the internet went to exist. Before social media, before blogging platforms, before any of the infrastructure that now makes publishing on the web trivially easy, GeoCities was the place where ordinary people — not developers, not corporations, not media companies — made their first marks on the digital world. You picked a neighborhood. You got a subdirectory. You learned just enough HTML to be dangerous.
The result was 38 million pages of everything. Fan shrines to TV shows that had been canceled for fifteen years. Personal homepages for people who'd since died. Detailed hobbyist documentation for activities so niche that no institution had ever thought to record them. Political content from the late nineties that had nowhere else to live. Original MIDI compositions. Web rings. Guestbooks full of messages from people who'd visited in 1998 and never came back. The complete, unfiltered record of what humans wanted to put on the internet when nobody was curating it and nobody was monetizing it.
Yahoo was going to delete all of it. Not archive it. Not donate it. Delete it.
The Channel
Jason Scott — archivist, filmmaker, professional internet historian, and a man constitutionally incapable of watching cultural material get destroyed without doing something about it — had been thinking about large-scale web archiving for years. When the GeoCities announcement dropped, he did what anyone who came up through the IRC era does when they need to organize people fast: he opened a channel.
The IRC channel that coalesced around the GeoCities rescue effort was chaotic in the way that all productive IRC channels are chaotic — simultaneously a planning meeting, a technical support desk, an argument about methodology, and a place where someone was always pasting a wget command that may or may not have been tested. The Archive Team, which would eventually formalize into an actual organization with a name and a logo and a Wikipedia page, was at this point just a collection of people in a channel who all agreed that what Yahoo was doing was wrong and that hard drives were cheap enough now that there was no good reason to let it happen.
The technical challenge was not trivial. GeoCities wasn't a neat database you could export. It was decades of accumulated chaos — inconsistent directory structures, broken links, pages that only rendered correctly in Netscape 4, embedded content hosted on servers that had already been dead for years. Crawling it comprehensively required not just bandwidth and storage but judgment calls about what counted as part of the site versus what was already gone.
People brought their own machines. They brought their own bandwidth, which in 2009 was still not infinite for most residential connections. They coordinated through the IRC channel, through a wiki that got stood up in the first week, and through the kind of informal consensus-building that the old internet ran on before everything had to have a product manager.
The Arguments Nobody Talks About
Here's the part of the story that gets sanitized in the retrospective coverage: there were real, heated disagreements about what was worth saving.
The obvious answer is "everything" — and that was the position the Archive Team ultimately took, more or less. But in the early weeks of the effort, when it wasn't clear whether they'd have the capacity to get everything, the arguments got philosophical fast. Was a blank page worth crawling? What about pages that were just the default GeoCities template with no user content? What about pages that were clearly spam, or that contained content that was illegal in various jurisdictions, or that were just someone's placeholder from 1997 that said "UNDER CONSTRUCTION" with a gif of a guy in a hard hat?
The answer the archivists eventually landed on was that they were not the right people to make those calls, and that any decision to filter the archive would introduce their own biases into what survived. The historian's instinct is always to save more rather than less. You can always discard later. You cannot un-delete.
This was not an obvious conclusion. It was argued out in an IRC channel by people who were simultaneously running crawl jobs and arguing about epistemology, which is a very specific kind of multitasking that the nineties internet produced in abundance.
What the Crawlers Found
The people running the crawlers started reporting back things that reframed the whole project. A complete archive of a regional BBS that had shut down in 1999 and existed nowhere else. Original music from bands that had never recorded anywhere else, posted by the musicians themselves in 1997 and forgotten. Detailed documentation of Hurricane Mitch's impact on Central American communities, written by people who were there, in 1998, before any NGO had thought to collect survivor accounts digitally. Medical information communities for conditions so rare that the GeoCities pages were the only place patients had ever found each other.
This was not nostalgia. This was primary source material. This was history that existed nowhere else and was about to not exist anywhere.
The Wayback Machine at the Internet Archive had been crawling the web since 1996, but its coverage of GeoCities was incomplete — it had captured snapshots of some pages, but not comprehensively, and not with the associated assets that made pages actually render correctly. The Archive Team effort was specifically filling in the gaps that institutional archiving had missed because institutional archiving moves slowly and GeoCities moved toward deletion quickly.
October 26, 2009
Yahoo pulled the plug on schedule. The GeoCities servers went dark. And the Archive Team had approximately 650 gigabytes of compressed data representing what they'd managed to get — later estimates put the final rescued total at around a terabyte of unique content, covering the substantial majority of what had been there.
It wasn't everything. There were pages that got missed, content that had already been deleted by users before the crawl, embedded media that was hosted elsewhere and already gone. The archive has gaps. But it exists, which is the alternative to not existing at all.
The full GeoCities archive is now distributed across Archive.org and various mirrors maintained by people who were in that IRC channel in 2009 and never quite got over the experience. Researchers use it. Journalists use it. People find their own old pages in it and have the particular kind of emotional experience that only the internet can produce — confronting a version of yourself from fifteen years ago who thought that background tile was a good idea.
What It Actually Meant
The Library of Congress had a web archiving program in 2009. It was focused on government websites and selected cultural material. The concept of systematically preserving the vernacular web — the stuff ordinary people made, not the stuff institutions produced — was not really on their radar.
The GeoCities rescue effort happened before anyone had a framework for thinking about it. The people doing it were operating on instinct, on the hacker ethic that information wants to be free and that destroying it is the worst thing you can do, and on the practical knowledge that wget existed and hard drives were fifty bucks at Best Buy.
Every subsequent conversation about web preservation — about what platforms owe their users when they shut down, about what cultural institutions should be doing with born-digital material, about who gets to decide what the internet remembers — traces back, in some way, to ninety days in 2009 when a bunch of people in an IRC channel decided Yahoo was wrong.
They were right. The animated GIFs and all.