Introducing the CodeSearchNet challenge

We’re announcing the CodeSearchNet Challenge and releasing a large dataset for natural language processing and machine learning.

Hamel Husain·@hamelsmu

September 26, 2019 | Updated May 7, 2021

| 3 minutes

Searching for code to reuse, call into, or to see how others handle a problem is one of the most common tasks in a software developer’s day. However, search engines for code are often frustrating and never fully understand what we want, unlike regular web search engines. We started using modern machine learning techniques to improve code search but quickly realized that we were unable to measure our progress. Unlike natural language processing with GLUE benchmarks, there is no standard dataset suitable for code search evaluation.

With our partners from Weights & Biases, today we’re announcing the CodeSearchNet Challenge evaluation environment and leaderboard. We’re also releasing a large dataset to help data scientists build models for this task, as well as several baseline models showing the current state of the art. Our leaderboard uses an annotated dataset of queries to evaluate the quality of code search tools.

Learn more from our technical report

The CodeSearchNet Corpus and models

We collected a large dataset of functions with associated documentation written in Go, Java, JavaScript, PHP, Python, and Ruby from open source projects on GitHub. We used our TreeSitter infrastructure for this effort, and we’re also releasing our data preprocessing pipeline for others to use as a starting point in applying machine learning to code. While this data is not directly related to code search, its pairing of code with related natural language description is suitable to train models for this task. Its substantial size also makes it possible to apply high-capacity models based on modern Transformer architectures.

Our fully preprocessed CodeSearchNet Corpus is available for download on Amazon S3, including:

Six million methods overall
Two million of which have associated documentation (docstrings, JavaDoc, and more)
Metadata that indicates the original location (repository or line number, for example) where the data was found

Building on our earlier efforts in semantic code search, we’re also releasing a collection of baseline models leveraging modern techniques in learning from sequences (including a BERT-like self-attentional model) to help data scientists get started on code search.

The CodeSearchNet Challenge

To evaluate code search models, we collected an initial set of code search queries and had programmers annotate the relevance of potential results. We started by collecting common search queries from Bing that had high click-through rates to code and combined these with queries from StaQC, yielding 99 queries for concepts related to code (i.e., we removed everything that was just an API documentation lookup).

We then used a standard Elasticsearch installation and our baseline models to obtain 10 likely results per query from our CodeSearchNet Corpus. Finally, we asked programmers, data scientists, and machine learning researchers to annotate the proposed results for relevance to the query on a scale from zero (“totally irrelevant”) to three (“exact match”). See our technical report for an in-depth explanation of the annotation process and data.

We want to expand our evaluation dataset to include more languages, queries, and annotations in the future. As we continue adding more over the next few months, we aim to include an extended dataset for the next version of CodeSearchNet Challenge in the future.

Other use cases

We anticipate other use cases for this dataset beyond code search and are presenting code search as one possible task that leverages learned representations of natural language and code. We’re excited to see what the community builds next.

Special thanks

The CodeSearchNet Challenge would not be possible without the Microsoft Research Team and core contributors from GitHub, including Marc Brockschmidt, Miltos Allamanis, Ho-Hsiang Wu, Hamel Husain, and Tiferet Gazit.

We’re also thankful for all of the contributors from the community who helped put this project together:

@nbardy, @raubitsj, @staceysv, @cvphelps, @tejaskannan, @s-zanella, @AntonioND, @goutham7r, @campoy, @cal58, @febuiles, @letmaik, @sebastiandziadzio, @panthap2, @CoderPat.

Learn more about the CodeSearchNet Challenge

Written by

Decorative header image showing the GitHub Copilot CLI ASCII art.

Engineering

Introducing the CodeSearchNet challenge

The CodeSearchNet Corpus and models

The CodeSearchNet Challenge

Other use cases

Special thanks

Tags:

Written by

Hamel Husain

Related posts

From pixels to characters: The engineering behind GitHub Copilot CLI’s animated ASCII banner

When protections outlive their purpose: A lesson on managing defense systems at scale

Post-quantum security for SSH access on GitHub

Tags:

Written by

Related posts

We do newsletters, too