AI scraping & “publicly available web data”

Publicly avail­able web data is a phrase intended to conjure the idea of “data made publicly avail­able on the web by the author,” because many web pages are. But in the context of auto­mated scraping of AI training data, it means some­thing simpler and grub­bier: every byte that can be accessed via the public web.

The problem? The public web contains a lot of mate­rial that wasn’t put there by the author. For instance, Anna’s Archive provides links to a huge number of copy­righted works (incl. several scrape­able copies of my book). This is true also of visual art, books, maga­zines, fonts, soft­ware, music, TV shows, movies—you name it. It’s all out there for whoever wants to scrape the public web hard enough.

Learning vs. plagiarizing

It’s often been suggested to me that it should be accept­able for LLMs to “learn” from data on the web but not to “plagia­rize.” As a tech­nical and legal matter, it’s still an open ques­tion where LLMs fall on this spec­trum. (In the mean­time, adher­ents of the “learning” metaphor should care­fully consider John Searle’s Chinese Room Argu­ment.)

But this dichotomy side­steps a knot­tier issue, which is that many human writers on the web—including me—benefit from our writing in ways that depend entirely on human readers (impo­litely aka “traffic”). For instance, it’s easy to imagine an LLM-powered website that “learned” from my work suffi­ciently well to deliver the infor­ma­tional value of my writing, yet in a form suffi­ciently entropized that it would not appear to be plagia­rism. So what? It will still nega­tively affect my busi­ness. To be fair, if a human reader copied my writing to delib­er­ately create a market substi­tute, I wouldn’t consider that “learning” either. I would consider that a ripoff—finan­cially, morally, legally. (And yes, it has happened.)

The limits of robots.txt

Certain AI compa­nies that are scraping “publicly avail­able web data” to train models are also saying that there is a simple solu­tion: the robots.txt file. Well, hold on. The idea of the robots.txt file arose around 1994, when auto­mated web scraping was becoming wide­spread. This is a small file that resides at the top level of a website and indi­cates which web scrapers, if any, the website owner wishes to exclude. In the AI age, the idea is that AI compa­nies would iden­tify their training-data scrapers, and web publishers could use their robots.txt to exclude those scrapers too.

As a person who publishes on the web, I foresee major prob­lems with hanging the future of web-hosted intel­lec­tual prop­erty on this thin reed:

  1. Compli­ance by the company doing the scraping is entirely volun­tary. The robots.txt file is not a tech­nical protec­tion, like a pass­word; it is merely a state­ment of pref­er­ence. (There are signs that the détente is already crum­bling.)

  2. Reliance on robots.txt creates a perma­nent, esca­lating burden on web writers, because every AI scraper iden­ti­fies itself differ­ently.

  3. Sites like Anna’s Archive don’t typi­cally exclude anyone using robots.txt, so my work will still leak out to scrapers that way.

  4. My nerdy-lawyer constituency may be wondering “doesn’t this robots.txt file form a contrac­tual oblig­a­tion?” Let’s gener­ously suppose it does. If you sued such a web scraper in your local or state court under a contract-law theory for ignoring the robots.txt file, I expect the first thing they’d argue is that your claim is preempted by federal copy­right law. Case dismissed—unless you want to file a copy­right-infringe­ment action in federal court, which for various admin­is­tra­tive and rational-actor reasons, the average web writer is very unlikely to do.

So in prac­tice, I expect that the robots.txt maneuver will merely be a feel-good theatrical gesture that will have no prac­tical or legal impact on these scrapers doing what­ever the hell they want. Indeed, I wouldn’t be surprised if the AI compa­nies promoting robots.txt have made exactly the same calcu­la­tion.

