How we built the good first issues feature

We’ve recently launched good first issues recommendations to help new contributors find easy gateways into open source projects. Read about the machine learning engine behind these recommendations.

Tiferet Gazit·@tiferet

January 22, 2020 | Updated April 16, 2020

| 5 minutes

GitHub is leveraging machine learning (ML) to help more people contribute to open source. We’ve launched the good first issues feature, powered by deep learning, to help new contributors find easy issues they can tackle in projects that fit their interests.

If you want to start contributing, or to attract new contributors to a project you maintain, get started with our overview of the good first issues feature. Read on to learn about how we leverage machine learning to detect easy issues.

In May 2019, our initial launch surfaced recommendations based on labels that were applied to issues by project maintainers. Some analysis of our data, together with manual curation, led to a list of about 300 label names used by popular open source repositories—all synonyms for either “good first issue” or “documentation”. Some examples include “beginner friendly”, “easy bug fix”, and “low-hanging-fruit”. We include documentation issues because they’re often a good way for new people to begin contributing, but rank them lower relative to issues specifically tagged as easy for beginners.

Relying on these labels, however, means that only about 40 percent of the repositories we recommend have easy issues we can surface. Moreover, it leaves maintainers with the burden of triaging and labeling issues. Instead of relying on maintainers to manually label their issues, we wanted to use machine learning to broaden the set of issues we could surface.

Last month, we shipped an updated version, that includes both label-based and ML-based issue recommendations. With this new version, we’re able to surface issues in about 70 percent of repositories we recommend to users. There is a tradeoff between coverage and accuracy, which is the typical precision and recall tradeoff found in any ML product. To prevent the feed from being swamped with false positive detections, we aim for extremely high precision at the cost of recall. This is necessary because only a tiny minority of all issues are good first issues. The exact numbers are constantly evolving, both as we improve our training data and modeling, and as we adjust the precision and recall tradeoff based on feedback and user behavior.

How it works

As with any supervised machine learning project, the first challenge is building a labeled training set. In this case, manual labeling is a difficult task that requires domain-area expertise in a range of topics, projects, and programming languages. Instead, we’ve opted for a weakly-supervised approach, inferring labels for hundreds of thousands of candidate samples automatically.

To detect positive training samples, we begin with issues that have any of the roughly-300 labels in our curated list, but this training set is not sufficiently large or diverse for our needs. We supplement it with a few sets of issues that are also likely to be beginner-friendly. This includes issues that were closed by a pull request from a user who had never previously contributed to the repository, and issues that were closed by a pull request that touched only a few lines in a single file. Because we prefer to miss some good issues than to swamp the feed with false positives, we define all issues that were not explicitly detected as positive samples to be negative samples in our training set. Naturally, this gives us a hugely imbalanced set, so we subsample the negative set in addition to weighting the loss function. We detect and remove near-duplicate issues, and separate the training, validation, and test sets across repositories to prevent data leakage from similar content.

In order for our classifiers to detect good issues as soon as they’re opened, we train them using only issue titles and bodies, and avoid relying on additional signals such as conversations. The titles and bodies undergo some preprocessing and denoising, such as removing parts of the body that are likely to come from an issue template, since they carry no signal for our problem. We’ve experimented with a range of model variants and architectures, including both classical ML methods such as random forests using tf-idf input vectors and neural network models such as 1D convolutional neural networks and recurrent neural networks. The deep learning models are implemented in TensorFlow, and use one-hot encodings of issue titles and bodies, fed into trainable embedding layers, as separate inputs to networks, with features from the two inputs concatenated towards the top of the network. Unsurprisingly, these networks generally outperform the classical methods using tf-idf, because they can use the full information contained in word ordering, context, and sentence structure, rather than just information on the unordered relative count of words in the vocabulary. Because both the training and inference happen offline, the added cost of deep learning methods is not prohibitive in this case. However, given the limited size of the positive training set, we’ve found various textual data augmentation techniques to be crucial to the success of these networks, in addition to regularization and early stopping.

To surface issue recommendations given a trained classifier, we run inference on all qualifying open issues from non-archived public repositories. Each issue for which the classifier predicts a probability above the required threshold is slated for recommendation, with a confidence score equal to its predicted probability. We also detect all open issues from non-archived public repositories that have at least one of the labels from the curated label list. These issues are given a confidence score based on the relevance of their labels, with synonyms of “good first issue” given higher confidence than synonyms of “documentation”. In general, label-based detections are given higher confidence than ML-based detections. Within each repository, all detected issues are then ranked primarily based on their confidence score, along with a penalty on issue age.

Our data acquisition, training, and inference pipelines run daily, using scheduled Argo workflows, to ensure our results remain fresh and relevant. As the first deep-learning-enabled product to launch on Github.com, this feature required careful design to ensure that the infrastructure would generalize to future projects.

What’s next

We continue to iterate on the training data, training pipeline, and classifier models to improve the surfaced issue recommendations. In parallel, we’re adding better signals to our repository recommendations to help users find and get involved with the best projects related to their interests. We also plan to add a mechanism for maintainers and triagers to approve or remove ML-based recommendations in their repositories. Finally, we plan on extending issue recommendations to offer personalized suggestions on next issues to tackle for anyone who has already made contributions to a project.

We hope you find these issue recommendations useful.

Learn more about good first issues

Written by

Git

How we built the good first issues feature

How it works

What’s next

Written by

Tiferet Gazit

Related posts

Git security vulnerabilities announced

Highlights from Git 2.50

4 trends shaping open source funding—and what they mean for maintainers

How it works

What’s next

Written by

Related posts

Git security vulnerabilities announced

Highlights from Git 2.50

4 trends shaping open source funding—and what they mean for maintainers

We do newsletters, too