This week, Microsoft released an AI-based tool for writing software called GitHub Copilot. As a lawyer and 20+ year participant in the world of open-source software, I agree with those who consider Copilot to be primarily an engine for violating open-source licenses.
Still, I’m not worried about its effects on open source. Why? Because as a matter of basic legal hygiene, I expect that organizations that create software assets will have to forbid the use of Copilot and other AI-assisted tools, lest they unwittingly contaminate those software assets with license violations and intellectual-property infringements.
(Before we go further: I am not your lawyer, nor anyone’s lawyer, and you should not take anything on this page as legal advice.)
Those versed in open-source history might recognize my argument as similar to the one Microsoft pushed for many years to deter organizations from adopting open source at all. “How can you trust that the code doesn’t contain IP violations?”, they asked. This was often derided as pure FUD (= the marketing tactic of spreading “fear, uncertainty, and doubt” about a competitor). But as a legal matter, it’s a fair question to ask of any new technology that by design contains portions of other people’s work.
As applied to open source, what made the question unfair was its implication that the open-source world is some kind of sloppy mosh pit of IP rights, carelessly remixed. On the contrary, the growth of open source over 20+ years has only been possible because of its fastidious insistence on explicit, enforceable software licenses.
For instance, as an author of open-source software—including unfashionable AI software—I don’t just dump my code on some website and invite others to guess what they can do with it. Rather, everything I make is accompanied by a simple but explicit license—I’ve always preferred the one known as the MIT license—that invites certain kinds of uses, forbids others, and sets the conditions.
Likewise, though open-source licenses allow redistribution and modification of code, I still have to honor the specific terms of other open-source software that I use in my projects. Again, it’s not a mosh pit. There are many open-source licenses in use. Not all of them are compatible. For instance, the GNU General Public License (GPL) is much more limited than the MIT license. Thus, I can’t embed GPL-licensed software within my MIT-licensed projects, because it would amount to relicensing that software under an MIT license, which would violate the GPL. This shouldn’t seem arcane: as a logical and legal matter, I can’t pass along to others permissions I never had in the first place.
GitHub Copilot is an AI-based system that has been trained, according to Microsoft, on “billions of lines of public code”. In use, it acts as a fancy autocomplete. As you type into a code editor, Copilot reads what you’ve written—say, the name of a function—and suggests code that implements it.
For instance, I started a Copilot session to write a JavaScript function for calculating the nth prime number. So I typed this:
function calculateNthPrime(n) {
}
After a pause, Copilot filled in the rest:
function calculateNthPrime(n) {
var primes = [2];
var i = 3;
while (primes.length < n) {
if (isPrime(i)) {
primes.push(i);
}
i += 2;
}
return primes[n - 1];
}
This code doesn’t work, however, because it relies on a function called isPrime that doesn’t yet exist. Being a good sport, I forgave Copilot and typed that wish next:
function isPrime(n) {
}
Once again, Copilot produced the rest:
function isPrime(n) {
for (var i = 2; i < n; i++) {
if (n % i === 0) {
return false;
}
}
return true;
}
This is the code I would expect from a talented 12-year-old who learned about JavaScript yesterday and prime numbers today. Does it work? Uh—maybe? Notably, Microsoft doesn’t claim that any of the code Copilot produces is correct. That’s still your problem. Thus, Copilot essentially tasks you with correcting a 12-year-old’s homework, over and over. (I have no idea how this is preferable to just doing the homework yourself.)
Some have suggested that I’m doing it wrong: that the point of Copilot is merely to automatically finish the rest of a single line, which amounts to predictable boilerplate once the first part is typed. Yes, I agree that per-line usage surely produces fewer errors and probably no license violations worth worrying about. But that’s not at all how Copilot is marketed. The first demo on the Copilot landing page shows Copilot filling in function definitions in three languages. In fact, all 14 examples on that page show Copilot generating blocks of code; none of the examples show this supposedly idiomatic per-line usage.
Speaking of prime numbers, one of the Copilot marketing demos proposes this code as an IsPrimeTest in Java. I wouldn’t even accept this from the 12-year-old:
import static org.junit.Assert.*;
import org.junit.Test;
public class IsPrimeTest {
// Math.isPrime(int) returns whether the given number is prime or not
@Test
public void testIsPrime() {
assertTrue(Math.isPrime(2));
assertTrue(Math.isPrime(3));
assertTrue(Math.isPrime(5));
assertTrue(Math.isPrime(7));
assertTrue(Math.isPrime(11));
assertTrue(Math.isPrime(13));
assertTrue(Math.isPrime(17));
assertTrue(Math.isPrime(19));
assertTrue(Math.isPrime(23));
assertTrue(Math.isPrime(29));
}
But this generated code raises an even more vexing question: if Copilot was trained on software code that was subject to an open-source license, what license might apply to the code produced by Copilot? MIT? GPL? Something else? No license—in the sense of public domain? No license—in the sense that the underlying pieces are under incompatible licenses and there’s no way to combine them?
Microsoft makes no claims about this either. Rather, it explicitly passes the risk to users, who must carry the entire burden of license compliance (emphasis added below):
We recommend you take the same precautions when using code generated by GitHub Copilot that you would when using any code you didn’t write yourself. These precautions include rigorous testing, IP scanning …
By IP scanning I assume Microsoft is speaking of intellectual-property scanning, meaning the process of verifying that the code doesn’t contain IP violations. (Unfortunately the phrase IP scanning is also commonly used to mean IP-address scanning in the network sense.)
On the one hand, we can’t expect Microsoft to offer legal advice to its zillions of users or a blanket indemnification. On the other hand, Microsoft isn’t sharing any of the information users would need to make these determinations. On the contrary—Copilot completely severs the connection between its inputs (= code under various open-source licenses) and its outputs (= code algorithmically produced by Copilot). Thus, after 20+ years, Microsoft has finally produced the very thing it falsely accused open source of being: a black hole of IP rights.
CTOs and general counsels of organizations that generate software IP assets now have an urgent problem: how to prevent the contamination of those assets with code generated by Copilot (and similar AI tools that will certainly emerge).
Let’s be very clear—this has not been a practical problem for open-source software over the last 20+ years. Why? Because open source was designed around license-based accountability. Have there been instances where open-source software has violated IP rights? Sure. Just like there have been instances where proprietary software has also done so. The point of open source was never to create a regime of software licensing that was impervious to IP litigation. Rather, it was to show that sharing and modification of source code could become part of the software industry without collapsing the existing regime. Open-source software has successfully coexisted with proprietary software because it plays by the same legal rules.
Copilot does not. Whereas open source strives for clarity around licensing, Copilot creates nothing but fog. Microsoft has imposed upon users the responsibility for determining the IP status of the code that Copilot emits, but provides none of the data they would need to do so.
The task, therefore, is impossible. For this reason, one must further conclude that any code generated by Copilot may contain lurking license or IP violations. In that case, the only prudent position is to reject Copilot—and other AI assistants trained on external code—entirely. I imagine this will quickly be adopted as the official policy of software organizations. Because what other position could be defensible? “We put our enterprise codebase at risk to spare our highly paid programmers the indignity of writing a program to calculate the nth prime number”?
Still, I’m sure some organizations will try to find a middle path with Copilot on the (misguided) principle of developer productivity and general AI maximalism. Before too long, someone at these organizations will find a giant license violation in some Copilot-generated code, and the experiment will quietly end. More broadly, it’s still unclear how the chaotic nature of AI can be squared with the virtue of predictability that is foundational to many business organizations.
(Another troublesome aspect of Copilot is that it operates as a keylogger within your code editor. Regardless of whether you use it to complete partial lines or whole blocks, it’s still sending everything you type back to Microsoft for processing. Sure, you can switch it on and off. But it still represents a risk to privacy, IP, and trade secrets that’s difficult to control. As above, the only prudent policy will be to keep it away from developer machines entirely.)
Maybe—if instead of fog, Copilot were to offer sunshine. Rather than conceal the licenses of the underlying open-source code it relies on, it could in principle keep this information attached to each chunk of code as it wends its way through the model. Then, on the output side, it would be possible for a user to inspect the generated code and see where every part came from and what license is attached to it.
Keeping license terms attached to code would also allow users to shape the output of Copilot by license. For instance, generate an nth-prime function using only MIT-licensed source material. As the end user, this wouldn’t eliminate my responsibility to verify these terms. But at least I’d have the information I’d need to do so. As it stands, the task is hopeless.
In the law, this concept is critical, and known as chain of custody: the idea that the reliability of certain material depends on verifying where it came from. For instance, without recording the chain of custody, you could never introduce documents into evidence at trial, because you’d have no way of confirming that the documents were authentic and trustworthy.
“But that’s not how AI models work—there’s no way of preserving license information.” I don’t assume that the limitations of today’s systems will necessarily persist. Without license auditability, however, few users will conclude that the benefits of these systems outweigh the risks. If AI vendors stay on that path, they will be relegating these systems to becoming low-cost toys whose main purpose is developer surveillance rather than code synthesis.
If Copilot is vigorously violating open-source licenses, what should open-source authors do about it?
In the large, I don’t think the problems open-source authors have with AI training are that different from the problems everyone will have. We’re just encountering them sooner.
Most importantly, I don’t think we should let the arrival of a new obstacle compromise the spirit of open source. For instance, some have suggested creating an open-source license that forbids AI training. But this kind of usage-based restriction has never been part of the open-source ethos. Furthermore, it’s overinclusive: we can imagine (as I have above) AI systems that behave more responsibly and ethically than the first generation will. It would be self-defeating for open-source authors to set themselves athwart technological progress, since that’s one of the main goals of open-sourcing code in the first place.
By the same token, it doesn’t make sense to hold AI systems to a different standard than we would hold human users. Widespread open-source license violations shouldn’t be shrugged off as an unavoidable cost. Suppose we accept that AI training falls under the US copyright notion of fair use. (Though the question is far from settled.) If so, then the fair-use exception would supersede the license terms. But even if the input to the AI system qualifies as fair use, the output of that system may not. Microsoft has not made this claim about GitHub Copilot—and never will, because no one can guarantee the behavior of a nondeterministic system.
We are at the beginning of the era of practical, widespread AI systems. It’s inevitable that there will be litigation and regulation about the behavior of these systems. It’s also inevitable that the nondeterminism of these systems will be used as a defense of their misbehavior—“we don’t really know how it works either, so we all just have to accept it”.
I think that regulations mandating the auditability of AI systems by showing the connection between inputs and outputs—akin to a chain of custody—are very likely, probably in the EU before the US. This is the only way to ensure that AI systems are not being used to launder materials that are otherwise unethical or illegal. In the US, I think it’s possible AI may end up provoking an amendment to the US constitution—but that’s a topic for another day.
In the interim, I think the most important thing open-source authors can do is continue to bring attention to certain facts about Copilot that Microsoft would prefer to leave buried in the fine print. For now, Copilot’s greatest enemy is itself.
If Software is My Copilot, Who Programmed My Software? Bradley Kuhn, Software Freedom Conservancy
Is GitHub Copilot a blessing, or a curse? Jeremy Howard, fast.ai
Evaluating Large Language Models Trained on Code Mark Chen et al. An introduction to OpenAI Codex, the model underlying Copilot.
The Gullible Software Altruist Ryan Fleury
In light of Copilot and other insults to open source, Software Freedom Conservancy is now calling on open-source authors to give up GitHub.
I agree with SFC. I’m now in the midst of moving my projects off GitHub. To be fair, in June 2018, when the acquisition was announced, I foresaw that this day would come:
I hope that all open-source maintainers also eventually move their projects off GitHub … It has nothing to do with [Microsoft] per se. Rather: [GitHub] rose to the position it has by holding itself out as a supporter of open source. They ran their business consistently with those principles (more or less). That will continue to be true for a little while. But then it will disappear.
“What took you so long, coward?” At the time, I looked into the alternatives, like Bitbucket and GitLab. But that just seemed like trading one set of corporate jackasses for another. GitLab, for instance, seems to be slouching toward AI as well. Unfortunately, open-source authors have never really had many good options for project hosting—either you had to do it yourself, or (more likely) pick the least noxious corporate-subsidized service, because it’s not easy and it’s not free. (I had hoped that an open-source-oriented nonprofit like SFC would launch a source-hosting service, but I can also see why that would be off mission.)
On the basis of multiple recommendations, I’ve recently gotten a paid subscription at Sourcehut. We’ll see how it goes. Maybe the fate of source-hosting services, like social networks and other empires, is to rise and fall. Onward we paddle.
Sourcehut’s reliance on the Git-over-HTTP protocol means it is apparently not compatible with the core Racket package server that my Pollen software relies on. I am instead trying out Codeberg.
Codeberg runs an instance of open-source Git software called Gitea. Rather than rely on the good graces of Codeberg, I thought: how hard could it be to put up my own Gitea server? These questions sometimes lead to dark places. For now, however, the answer seems to be: not hard at all. I used a Digital Ocean one-click installer to make a Gitea server. Then I followed these tips from Ryan Fleury to enable HTTPS using Caddy. Cleverly, Gitea has a migration tool that will vacuum up repos from GitHub, including issues and pull-request history. With that, git.matthewbutterick.com is now my GitHub replacement.
I met some excellent class-action lawyers at the Joseph Saveri Law Firm who see the same problems with Copilot. We have teamed up to investigate GitHub Copilot for a potential lawsuit. Click here to see how you can help the investigation.
The Joseph Saveri Law Firm and I have filed a class-action lawsuit challenging the legality of GitHub Copilot.
Consistent with my prediction that the EU would outpace the US on AI regulation, the EU is now proposing mandatory AI dataset transparency.