This copilot is stupid and wants to kill me

This week, Microsoft released an AI-based tool for writing soft­ware called GitHub Copilot. As a lawyer and 20+ year partic­i­pant in the world of open-source soft­ware, I agree with those who consider Copilot to be primarily an engine for violating open-source licenses.

Still, I’m not worried about its effects on open source. Why? Because as a matter of basic legal hygiene, I expect that orga­ni­za­tions that create soft­ware assets will have to forbid the use of Copilot and other AI-assisted tools, lest they unwit­tingly cont­a­m­i­nate those soft­ware assets with license viola­tions and intel­lec­tual-prop­erty infringe­ments.

(Before we go further: I am not your lawyer, nor anyone’s lawyer, and you should not take anything on this page as legal advice.)

It’s licenses all the way down

Those versed in open-source history might recog­nize my argu­ment as similar to the one Microsoft pushed for many years to deter orga­ni­za­tions from adopting open source at all. “How can you trust that the code doesn’t contain IP viola­tions?”, they asked. This was often derided as pure FUD (= the marketing tactic of spreading “fear, uncer­tainty, and doubt” about a competitor). But as a legal matter, it’s a fair ques­tion to ask of any new tech­nology that by design contains portions of other people’s work.

As applied to open source, what made the ques­tion unfair was its impli­ca­tion that the open-source world is some kind of sloppy mosh pit of IP rights, care­lessly remixed. On the contrary, the growth of open source over 20+ years has only been possible because of its fastid­ious insis­tence on explicit, enforce­able soft­ware licenses.

For instance, as an author of open-source soft­ware—including unfash­ion­able AI soft­ware—I don’t just dump my code on some website and invite others to guess what they can do with it. Rather, every­thing I make is accom­pa­nied by a simple but explicit license—I’ve always preferred the one known as the MIT license—that invites certain kinds of uses, forbids others, and sets the condi­tions.

Like­wise, though open-source licenses allow redis­tri­b­u­tion and modi­fi­ca­tion of code, I still have to honor the specific terms of other open-source soft­ware that I use in my projects. Again, it’s not a mosh pit. There are many open-source licenses in use. Not all of them are compat­ible. For instance, the GNU General Public License (GPL) is much more limited than the MIT license. Thus, I can’t embed GPL-licensed soft­ware within my MIT-licensed projects, because it would amount to reli­censing that soft­ware under an MIT license, which would violate the GPL. This shouldn’t seem arcane: as a logical and legal matter, I can’t pass along to others permis­sions I never had in the first place.

GitHub Copilot is different

GitHub Copilot is an AI-based system that has been trained, according to Microsoft, on “billions of lines of public code”. In use, it acts as a fancy auto­com­plete. As you type into a code editor, Copilot reads what you’ve written—say, the name of a func­tion—and suggests code that imple­ments it.

For instance, I started a Copilot session to write a JavaScript func­tion for calcu­lating the nth prime number. So I typed this:

1
2
3

function calculateNthPrime(n) {

}

After a pause, Copilot filled in the rest:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
function calculateNthPrime(n) {
   var primes = [2];
   var i = 3;
   while (primes.length < n) {
       if (isPrime(i)) {
          primes.push(i);
       }
       i += 2;
   }
   return primes[n - 1];
}

This code doesn’t work, however, because it relies on a func­tion called isPrime that doesn’t yet exist. Being a good sport, I forgave Copilot and typed that wish next:

1
2
3

function isPrime(n) {

}

Once again, Copilot produced the rest:

1
2
3
4
5
6
7
8
function isPrime(n) {
   for (var i = 2; i < n; i++) {
       if (n % i === 0) {
           return false;
       }
   }
   return true;
}

This is the code I would expect from a talented 12-year-old who learned about JavaScript yesterday and prime numbers today. Does it work? Uh—maybe? Notably, Microsoft doesn’t claim that any of the code Copilot produces is correct. That’s still your problem. Thus, Copilot essen­tially tasks you with correcting a 12-year-old’s home­work, over and over. (I have no idea how this is prefer­able to just doing the home­work your­self.)

Some have suggested that I’m doing it wrong: that the point of Copilot is merely to auto­mat­i­cally finish the rest of a single line, which amounts to predictable boil­er­plate once the first part is typed. Yes, I agree that per-line usage surely produces fewer errors and prob­ably no license viola­tions worth worrying about. But that’s not at all how Copilot is marketed. The first demo on the Copilot landing page shows Copilot filling in func­tion defi­n­i­tions in three languages. In fact, all 14 exam­ples on that page show Copilot gener­ating blocks of code; none of the exam­ples show this suppos­edly idiomatic per-line usage.

Speaking of prime numbers, one of the Copilot marketing demos proposes this code as an IsPrime­Test in Java. I wouldn’t even accept this from the 12-year-old:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18

import static org.junit.Assert.*;
import org.junit.Test;

public class IsPrimeTest { // Math.isPrime(int) returns whether the given number is prime or not @Test public void testIsPrime() { assertTrue(Math.isPrime(2)); assertTrue(Math.isPrime(3)); assertTrue(Math.isPrime(5)); assertTrue(Math.isPrime(7)); assertTrue(Math.isPrime(11)); assertTrue(Math.isPrime(13)); assertTrue(Math.isPrime(17)); assertTrue(Math.isPrime(19)); assertTrue(Math.isPrime(23)); assertTrue(Math.isPrime(29));
}

The big nowhere

But this gener­ated code raises an even more vexing ques­tion: if Copilot was trained on soft­ware code that was subject to an open-source license, what license might apply to the code produced by Copilot? MIT? GPL? Some­thing else? No license—in the sense of public domain? No license—in the sense that the under­lying pieces are under incom­pat­ible licenses and there’s no way to combine them?

Microsoft makes no claims about this either. Rather, it explic­itly passes the risk to users, who must carry the entire burden of license compli­ance (emphasis added below):

We recom­mend you take the same precau­tions when using code gener­ated by GitHub Copilot that you would when using any code you didn’t write your­self. These precau­tions include rigorous testing, IP scan­ning 

By IP scan­ning I assume Microsoft is speaking of intel­lec­tual-prop­erty scan­ning, meaning the process of veri­fying that the code doesn’t contain IP viola­tions. (Unfor­tu­nately the phrase IP scan­ning is also commonly used to mean IP-address scan­ning in the network sense.)

On the one hand, we can’t expect Microsoft to offer legal advice to its zillions of users or a blanket indem­ni­fi­ca­tion. On the other hand, Microsoft isn’t sharing any of the infor­ma­tion users would need to make these deter­mi­na­tions. On the contrary—Copilot completely severs the connec­tion between its inputs (= code under various open-source licenses) and its outputs (= code algo­rith­mi­cally produced by Copilot). Thus, after 20+ years, Microsoft has finally produced the very thing it falsely accused open source of being: a black hole of IP rights.

Copilot is malware

CTOs and general coun­sels of orga­ni­za­tions that generate soft­ware IP assets now have an urgent problem: how to prevent the cont­a­m­i­na­tion of those assets with code gener­ated by Copilot (and similar AI tools that will certainly emerge).

Let’s be very clear—this has not been a prac­tical problem for open-source soft­ware over the last 20+ years. Why? Because open source was designed around license-based account­ability. Have there been instances where open-source soft­ware has violated IP rights? Sure. Just like there have been instances where propri­etary soft­ware has also done so. The point of open source was never to create a regime of soft­ware licensing that was imper­vious to IP liti­ga­tion. Rather, it was to show that sharing and modi­fi­ca­tion of source code could become part of the soft­ware industry without collapsing the existing regime. Open-source soft­ware has success­fully coex­isted with propri­etary soft­ware because it plays by the same legal rules.

Copilot does not. Whereas open source strives for clarity around licensing, Copilot creates nothing but fog. Microsoft has imposed upon users the respon­si­bility for deter­mining the IP status of the code that Copilot emits, but provides none of the data they would need to do so.

The task, there­fore, is impos­sible. For this reason, one must further conclude that any code gener­ated by Copilot may contain lurking license or IP viola­tions. In that case, the only prudent posi­tion is to reject Copilot—and other AI assis­tants trained on external code—entirely. I imagine this will quickly be adopted as the offi­cial policy of soft­ware orga­ni­za­tions. Because what other posi­tion could be defen­sible? “We put our enter­prise code­base at risk to spare our highly paid program­mers the indig­nity of writing a program to calcu­late the nth prime number”?

Still, I’m sure some orga­ni­za­tions will try to find a middle path with Copilot on the (misguided) prin­ciple of devel­oper produc­tivity and general AI maxi­malism. Before too long, someone at these orga­ni­za­tions will find a giant license viola­tion in some Copilot-gener­ated code, and the exper­i­ment will quietly end. More broadly, it’s still unclear how the chaotic nature of AI can be squared with the virtue of predictability that is foun­da­tional to many busi­ness orga­ni­za­tions.

(Another trou­ble­some aspect of Copilot is that it oper­ates as a keylogger within your code editor. Regard­less of whether you use it to complete partial lines or whole blocks, it’s still sending every­thing you type back to Microsoft for processing. Sure, you can switch it on and off. But it still repre­sents a risk to privacy, IP, and trade secrets that’s diffi­cult to control. As above, the only prudent policy will be to keep it away from devel­oper machines entirely.)

Can Copilot be fixed?

Maybe—if instead of fog, Copilot were to offer sunshine. Rather than conceal the licenses of the under­lying open-source code it relies on, it could in prin­ciple keep this infor­ma­tion attached to each chunk of code as it wends its way through the model. Then, on the output side, it would be possible for a user to inspect the gener­ated code and see where every part came from and what license is attached to it.

Keeping license terms attached to code would also allow users to shape the output of Copilot by license. For instance, generate an nth-prime func­tion using only MIT-licensed source mate­rial. As the end user, this wouldn’t elim­i­nate my respon­si­bility to verify these terms. But at least I’d have the infor­ma­tion I’d need to do so. As it stands, the task is hope­less.

In the law, this concept is crit­ical, and known as chain of custody: the idea that the reli­a­bility of certain mate­rial depends on veri­fying where it came from. For instance, without recording the chain of custody, you could never intro­duce docu­ments into evidence at trial, because you’d have no way of confirming that the docu­ments were authentic and trust­worthy.

“But that’s not how AI models work—there’s no way of preserving license infor­ma­tion.” I don’t assume that the limi­ta­tions of today’s systems will neces­sarily persist. Without license auditability, however, few users will conclude that the bene­fits of these systems outweigh the risks. If AI vendors stay on that path, they will be rele­gating these systems to becoming low-cost toys whose main purpose is devel­oper surveil­lance rather than code synthesis.

What Copilot means for open source

If Copilot is vigor­ously violating open-source licenses, what should open-source authors do about it?

In the large, I don’t think the prob­lems open-source authors have with AI training are that different from the prob­lems everyone will have. We’re just encoun­tering them sooner.

Most impor­tantly, I don’t think we should let the arrival of a new obstacle compro­mise the spirit of open source. For instance, some have suggested creating an open-source license that forbids AI training. But this kind of usage-based restric­tion has never been part of the open-source ethos. Further­more, it’s over­in­clu­sive: we can imagine (as I have above) AI systems that behave more respon­sibly and ethi­cally than the first gener­a­tion will. It would be self-defeating for open-source authors to set them­selves athwart tech­no­log­ical progress, since that’s one of the main goals of open-sourcing code in the first place.

By the same token, it doesn’t make sense to hold AI systems to a different stan­dard than we would hold human users. Wide­spread open-source license viola­tions shouldn’t be shrugged off as an unavoid­able cost. Suppose we accept that AI training falls under the US copy­right notion of fair use. (Though the ques­tion is far from settled.) If so, then the fair-use excep­tion would super­sede the license terms. But even if the input to the AI system qual­i­fies as fair use, the output of that system may not. Microsoft has not made this claim about GitHub Copilot—and never will, because no one can guar­antee the behavior of a nonde­ter­min­istic system.

We are at the begin­ning of the era of prac­tical, wide­spread AI systems. It’s inevitable that there will be liti­ga­tion and regu­la­tion about the behavior of these systems. It’s also inevitable that the nonde­ter­minism of these systems will be used as a defense of their misbe­havior—“we don’t really know how it works either, so we all just have to accept it”.

I think that regu­la­tions mandating the auditability of AI systems by showing the connec­tion between inputs and outputs—akin to a chain of custody—are very likely, prob­ably in the EU before the US. This is the only way to ensure that AI systems are not being used to launder mate­rials that are other­wise uneth­ical or illegal. In the US, I think it’s possible AI may end up provoking an amend­ment to the US consti­tu­tion—but that’s a topic for another day.

In the interim, I think the most impor­tant thing open-source authors can do is continue to bring atten­tion to certain facts about Copilot that Microsoft would prefer to leave buried in the fine print. For now, Copilot’s greatest enemy is itself.

Further reading

update, 6 days later

In light of Copilot and other insults to open source, Soft­ware Freedom Conser­vancy is now calling on open-source authors to give up GitHub.

I agree with SFC. I’m now in the midst of moving my projects off GitHub. To be fair, in June 2018, when the acqui­si­tion was announced, I foresaw that this day would come:

I hope that all open-source main­tainers also even­tu­ally move their projects off GitHub … It has nothing to do with [Microsoft] per se. Rather: [GitHub] rose to the posi­tion it has by holding itself out as a supporter of open source. They ran their busi­ness consis­tently with those prin­ci­ples (more or less). That will continue to be true for a little while. But then it will disap­pear.

“What took you so long, coward?” At the time, I looked into the alter­na­tives, like Bitbucket and GitLab. But that just seemed like trading one set of corpo­rate jack­asses for another. GitLab, for instance, seems to be slouching toward AI as well. Unfor­tu­nately, open-source authors have never really had many good options for project hosting—either you had to do it your­self, or (more likely) pick the least noxious corpo­rate-subsi­dized service, because it’s not easy and it’s not free. (I had hoped that an open-source-oriented nonprofit like SFC would launch a source-hosting service, but I can also see why that would be off mission.)

On the basis of multiple recom­men­da­tions, I’ve recently gotten a paid subscrip­tion at Sourcehut. We’ll see how it goes. Maybe the fate of source-hosting services, like social networks and other empires, is to rise and fall. Onward we paddle.

update, 10 days later

Sourcehut’s reliance on the Git-over-HTTP protocol means it is appar­ently not compat­ible with the core Racket package server that my Pollen soft­ware relies on. I am instead trying out Code­berg.

update, 32 days later

Code­berg runs an instance of open-source Git soft­ware called Gitea. Rather than rely on the good graces of Code­berg, I thought: how hard could it be to put up my own Gitea server? These ques­tions some­times lead to dark places. For now, however, the answer seems to be: not hard at all. I used a Digital Ocean one-click installer to make a Gitea server. Then I followed these tips from Ryan Fleury to enable HTTPS using Caddy. Clev­erly, Gitea has a migra­tion tool that will vacuum up repos from GitHub, including issues and pull-request history. With that, git.matthew­but­t­erick.com is now my GitHub replace­ment.

update, 114 days later

I met some excel­lent class-action lawyers at the Joseph Saveri Law Firm who see the same prob­lems with Copilot. We have teamed up to inves­ti­gate GitHub Copilot for a poten­tial lawsuit. Click here to see how you can help the inves­ti­ga­tion.

update, 306 days later

Consis­tent with my predic­tion that the EU would outpace the US on AI regu­la­tion, the EU is now proposing manda­tory AI dataset trans­parency.