AI scraping & “publicly available web data”

22 June 2024

Publicly available web data is a phrase intended to conjure the idea of “data made publicly available on the web by the author,” because many web pages are. But in the context of automated scraping of AI training data, it means something simpler and grubbier: every byte that can be accessed via the public web.

The problem? The public web contains a lot of material that wasn’t put there by the author. For instance, Anna’s Archive provides links to a huge number of copyrighted works (incl. several scrapeable copies of my book). This is true also of visual art, books, magazines, fonts, software, music, TV shows, movies—you name it. It’s all out there for whoever wants to scrape the public web hard enough.

Learning vs. plagiarizing

It’s often been suggested to me that it should be acceptable for LLMs to “learn” from data on the web but not to “plagiarize.” As a technical and legal matter, it’s still an open question where LLMs fall on this spectrum. (In the meantime, adherents of the “learning” metaphor should carefully consider John Searle’s Chinese Room Argument.)

But this dichotomy sidesteps a knottier issue, which is that many human writers on the web—including me—benefit from our writing in ways that depend entirely on human readers (impolitely aka “traffic”). For instance, it’s easy to imagine an LLM-powered website that “learned” from my work sufficiently well to deliver the informational value of my writing, yet in a form sufficiently entropized that it would not appear to be plagiarism. So what? It will still negatively affect my business. To be fair, if a human reader copied my writing to deliberately create a market substitute, I wouldn’t consider that “learning” either. I would consider that a ripoff—financially, morally, legally. (And yes, it has happened.)

The limits of robots.txt

Certain AI companies that are scraping “publicly available web data” to train models are also saying that there is a simple solution: the robots.txt file. Well, hold on. The idea of the robots.txt file arose around 1994, when automated web scraping was becoming widespread. This is a small file that resides at the top level of a website and indicates which web scrapers, if any, the website owner wishes to exclude. In the AI age, the idea is that AI companies would identify their training-data scrapers, and web publishers could use their robots.txt to exclude those scrapers too.

As a person who publishes on the web, I foresee major problems with hanging the future of web-hosted intellectual property on this thin reed:

Compliance by the company doing the scraping is entirely voluntary. The robots.txt file is not a technical protection, like a password; it is merely a statement of preference. (There are signs that the détente is already crumbling.)
Reliance on robots.txt creates a permanent, escalating burden on web writers, because every AI scraper identifies itself differently.
Sites like Anna’s Archive don’t typically exclude anyone using robots.txt, so my work will still leak out to scrapers that way.
My nerdy-lawyer constituency may be wondering “doesn’t this robots.txt file form a contractual obligation?” Let’s generously suppose it does. If you sued such a web scraper in your local or state court under a contract-law theory for ignoring the robots.txt file, I expect the first thing they’d argue is that your claim is preempted by federal copyright law. Case dismissed—unless you want to file a copyright-infringement action in federal court, which for various administrative and rational-actor reasons, the average web writer is very unlikely to do.

So in practice, I expect that the robots.txt maneuver will merely be a feel-good theatrical gesture that will have no practical or legal impact on these scrapers doing whatever the hell they want. Indeed, I wouldn’t be surprised if the AI companies promoting robots.txt have made exactly the same calculation.

(Though I am a lawyer, I am not your lawyer. Nothing in this message is offered as legal advice.)

update, 65 days later

Generative AI companies don’t seem to notice that their extractive strategy will only work once. We’re at an unusual moment in human history where we have this thing—the internet—that contains the biggest set of human-created works ever. A lot of dreck, too. But for now, the good largely outweighs the bad.

That will soon stop being true, however, as generative AI floods the internet with toxic AI sludge. For their part, generative-AI companies are highly averse to training models on AI-generated data because it leads to model collapse, a state of degenerate statistical behavior that arises when an AI model is trained on AI-generated data. Probabilistically, an AI model tends to produce output near the median. So training another AI model on this output will cause it to lose diversity in its training dataset. Much like a population of animals that inbreeds over generations will lose genetic diversity. In the NYT today: an excellent explanation & visualization of model collapse.

Thus, by flooding the web with AI slop, generative AI companies are polluting the very resource they need for survival (sound familiar, humans?) Before too long, scraping the internet will no longer be a viable way of gathering datasets, because it will have become irreversibly polluted with AI-generated material.