Recently, we’ve been working to make our CI experience better by leveraging the newly released GitHub feature, Actions larger runners, to run our CI.
GitHub Copilot is trained on billions of lines of public code. The suggestions it makes to you are adapted to your code, but the processing behind it is ultimately informed by code written by others.
How direct is the relationship between the suggested code and the code that informed it? In a recent thought-provoking paper1, Bender, Gebru et al. coined the phrase “stochastic parrots” for artificial intelligence systems, like the ones that power GitHub Copilot. Or, as a fellow machine learning engineer at GitHub2 remarked during a water cooler chat: these systems can feel like “a toddler with a photographic memory.”
These are deliberate oversimplifications. Many GitHub Copilot suggestions feel specifically tailored to the particular code base the user is working on. Often, it looks less like a parrot and more like a crow building novel tools out of small blocks3. Yet there’s no denying that GitHub Copilot has an impressive memory:
Here, I intentionally directed4 GitHub Copilot to recite a well-known text it obviously knows by heart. I, too, know a couple of texts by heart. For example, I still remember some poems I learned in school. Yet no matter the topic, not once have I been tempted to derail a conversation by falling into iambic tetrameter and waxing about daffodils.
So, is that (or rather the coding equivalent of it) something GitHub Copilot is prone to doing? How many of its suggestions are unique, and how often does it just parrot some likely looking code it has seen during training?
During GitHub Copilot’s early development, nearly 300 employees used it in their daily work as part of an internal trial. This trial provided a good dataset to test for recitation. I wanted to find out how often GitHub Copilot gave them a suggestion that was quoted from something it had seen before.
I limited the investigation to Python suggestions with a cutoff on May 7, 2021 (the day we started extracting that data). That left 453,780 suggestions spread out over 396 “user weeks”, that is, calendar weeks during which a user actively used GitHub Copilot on Python code.
Though 453,780 suggestions are a lot, many of them can be dismissed immediately. To get to the interesting cases, consider sequences of “words” that occur in the suggestion in the same order as in the code GitHub Copilot has been trained on. In this context, punctuation, brackets, or other special characters all count as “words,” while tabs, spaces, or even line breaks are ignored completely. After all, a quote is still a quote, whether it’s indented by one tab or eight spaces.
For example, one of GitHub Copilot’s suggestions was the following regex for numbers separated by whitespace:
This would be exactly 100 “words” in the sense above, but it’s a particularly dense example. The average non-empty line of code has only 10 “words.” I’ve restricted this investigation to cases where the overlap with the code GitHub Copilot was trained on contains at least 60 such “words”. We must set the cut somewhere, and I think it’s rather rare that shorter sequences would be of great interest. In fact, most of the interesting cases identified later are well clear of that threshold of 60.
If the overlap extends to what the user has already written, that also counts for the length. After all, the user may have written that context with the help of GitHub Copilot as well!
In the following example, the user has started writing a very common snippet. GitHub Copilot completes it. Even though the completion itself is rather short, together with the already existing code, it clears the threshold and is retained.
This procedure is permissive enough to let many relatively “boring” examples through, like the two above. But it’s still effective at dialing in the human analysis to the interesting cases, sorting out more than 99% of GitHub Copilot suggestions.
After filtering, there were 473 suggestions left. However, they came in very different forms:
- Some were basically just repeats of another case that passed filtering. For example, sometimes GitHub Copilot makes a suggestion, the developer types a comment line, and GitHub Copilot offers a very similar suggestion again. I removed these cases from the analysis as duplicates.
- Some were long, repetitive sequences. Take the following example, where the repeated blocks of
‘<p>’are, of course, found somewhere in the training set:
Such suggestions can be helpful (test cases, regular expressions) or not helpful (like this case, I suspect). In any case, they do not fit the idea of rote learning I had in mind when I started this investigation.
- Some were standard inventories, like the natural numbers, the prime numbers, stock market tickers, or the Greek alphabet:
- Some were common, straightforward ways, perhaps even universal ways, of doing things with very few natural degrees of freedom. For example, the middle part of the following strikes me as very much the standard way of using the BeautifulSoup package to parse a Wikipedia list. In fact, the best matching snippet found in GitHub Copilot’s training data5 uses such code to parse a different article and goes on to do different things with the results.
This doesn’t fit my idea of a quote either. It’s a bit like when someone says “I’m taking out the trash. I’ll be back soon.” That’s a matter-of-fact statement, not a quote, even though that particular phrase has been uttered many times before.
- Then there are all other cases. Those with at least some specific overlap in either code or comments. These are what interest me the most, and what I’m going to concentrate on moving forward.
This bucketing necessarily has some edge cases6, and your mileage may vary in how you think they should be classified. Maybe you even disagree with the whole set of buckets in the first place.
That’s why we’ve open sourced that dataset7. So, if you feel a bit differently about the bucketing, or if you’re interested in other aspects of GitHub Copilot parroting its training set, you’re very welcome to ignore my next section and draw your own conclusions.
For most of GitHub Copilot’s suggestions, our automatic filter didn’t find any significant overlap with the code used for training. Yet it did bring 473 cases to our attention. Removing the first bucket (cases that look very similar to other cases) left me with 185 suggestions. Of these suggestions, 144 got sorted out in buckets 2 – 4. This left 41 cases in the last bucket, the “recitations,” in the meaning of the term I have in mind.
That corresponds to one recitation event every 10 user weeks (95% confidence interval: 7 – 13 weeks, using a Poisson test).
Naturally, this was measured by the GitHub and Microsoft developers who tried out GitHub Copilot. If your coding behavior is very different from theirs, your results might differ. Some of these developers are only working part-time on Python projects. I could not distinguish that and therefore counted everyone who writes some Python in a given week as a user.
One event in 10 weeks doesn’t sound like a lot, but it’s not 0 either. Also, I found three things that struck me.
If I want to learn the lyrics to a song, I must listen to it many times. GitHub Copilot is no different: To learn a snippet of code by heart, it must see that snippet a lot. Each file is only shown to GitHub Copilot once, so the snippet needs to exist in many different files in public code.
Of the 41 main cases we singled out during manual labelling, none appear in less than 10 different files. Most (35 cases) appear more than a hundred times. In one instance, GitHub Copilot suggested starting an empty file with something it had even seen more than a whopping 700,000 different times during training–that was the GNU General Public License.
The following plot shows the number of matched files of the results in bucket 5 (one red mark on the bottom for each result), versus buckets 2-4. I left out bucket 1, which is really just a mix of duplicates of bucket 2-4 cases and duplicates of bucket 5 cases. The inferred distribution is displayed as a red line. It peaks between 100 and 1000 matches.
As time goes on, each file becomes unique. Yet GitHub Copilot doesn’t wait for that8. It will offer its solutions while your file is still extremely generic. And in the absence of anything specific to go on, it’s much more likely to quote from somewhere else than it would be otherwise.
Of course, software developers spend most of their time deep inside the files, where the context is unique enough that GitHub Copilot will offer unique suggestions. In contrast, the suggestions at the beginning are rather hit-and-miss, since GitHub Copilot cannot know what the program will be. Yet sometimes, especially in toy projects or standalone scripts, a modest amount of context can be enough to hazard a reasonable guess of what the user wanted to do. Sometimes it’s also still generic enough so that GitHub Copilot thinks one of the solutions it knows by heart looks promising:
This is all but directly taken from coursework for a robotics class uploaded in different variations9.
In its current form, the filter will turn up a good number of uninteresting cases when applied broadly. Yet it still should not be too much noise. For the internal users in the experiment, it would have been a bit more than one find per week on average (albeit likely in bursts!). Of these finds, roughly 17% (95% confidence interval using a binomial test: 14%-21%) would be in the fifth bucket.
Nothing is ever foolproof, of course, so this too can be tricked. Some cases are rather hard to detect by the tool we’re building, but still have an obvious source. To return to the Zen of Python:
This investigation demonstrates that GitHub Copilot can quote a body of code verbatim, yet it rarely does so, and when it does, it mostly quotes code that everybody quotes, typically at the beginning of a file, as if to break the ice.
However, there’s still one big difference between GitHub Copilot reciting code and me reciting a poem: I know when I’m quoting. I would also like to know when Copilot is echoing existing code rather than coming up with its own ideas. That way, I’m able to look up background information about that code, and when to include credit where credit is due.
The answer is obvious: sharing the prefiltering solution we used in this analysis to detect overlap with the training set. When a suggestion contains snippets copied from the training set, the UI should simply tell you where it’s quoted from. You can then either include proper attribution, or decide against using that code altogether.
This duplication search is not yet integrated into the technical preview, but we plan to do so. We will both continue to work on decreasing rates of recitation, as well as making its detection more precise.
3: see von Bayern et al. about the creative wisdom of crows: Compound tool construction by New Caledonian crows ^
4: see Carlini et al. about deliberately triggering the recall of training data: Extracting Training Data from Large Language Models ^
6: Probably not too many though. I’ve asked some developers to help me label the cases, and everyone was prompted to flag up any uncertainty with their judgement. That happened in only 34 cases, i.e. less than 10%. ^
7: In the public dataset, I list the part of Copilot’s suggestion that was also found in the training set, how often it was found, and a link to an example where it occurs in public code. For privacy reasons, I don’t include the not-matched part of the completion or the code context the user had typed (only an indication of its length). ^
8: In fact, since this experiment has been made, GitHub Copilot has changed to require a minimum file content. So some of the suggestions flagged here would not have been shown by the current version. ^