CodeGen: Semantic’s improved language support system

The Semantic Code team shipped a massive improvement to the language support system that powers code navigation. Code navigation features only scratch the surface of possibilities that start to open up when we combine Semantic‘s program analysis potential with GitHub’s scale.

|
| 9 minutes

The Semantic Code team shipped a massive improvement to the language support system that powers code navigation. Code navigation features only scratch the surface of possibilities that start to open up when we combine Semantic‘s program analysis potential with GitHub’s scale.

GitHub is home to over 50 million developers worldwide. Our team’s mission is to analyze code on our platform and surface insights that empower users to feel more productive. Ideally, this analysis should work for all users, regardless of which programming language they use. Until now, however, we’ve only been able to support a handful of languages due to the high cost of adding and maintaining them. Our new language support system, CodeGen, cuts that cost dramatically by automating a significant portion of the pipeline, in addition to making it more resilient. The result is that it is now easier than ever to add new programming languages that get parsed by our library.

Language Support is mission-critical

Before Semantic can compute diffs or perform abstract interpretation, we need to transform code into a representation that is amenable to our analysis goals. For this reason, a significant portion of our pipeline deals with processing source code from files on GitHub.com into an appropriate representation—we call this our “language support system”.

Diagram showing semantic architecture

Adding languages has been difficult

Zooming into the part of Semantic that achieves language support, we see that it involved several development phases, including two parsing steps that required writing and maintaining two separate grammars per language.

Diagram showing language support pipeline

Reading the diagram from left to right, our historic language support pipeline:

  1. Parsed source code into ASTs. A grammar is hand-written for a given language using tree-sitter, an incremental GLR parsing library for programming tools.
  2. Read tree-sitter ASTs into Semantic. Connecting Semantic to tree-sitter‘s C library requires providing an interface to the C source. We achieve this through our haskell-tree-sitter library, which has Haskell bindings to tree-sitter.
  3. Parsed ASTs into a generalized representation of syntax. For these ASTs to be consumable by our Haskell project, we had to translate the tree-sitter parse trees into an appropriate representation. This required:
    • À la carte syntax types: generalization across programming languages Many constructs, such as If statements, occur in several languages. Instead of having different representations of If statements for each language, could we reduce duplication by creating a generalized representation of syntax that could be shared across languages, such as a datatype modeling the semantics of conditional logic? This was the reasoning behind creating our hand-written, generalized à la carte syntaxes based on Wouter Swierstra’s Data types à la carte approach, allowing us to represent those shared semantics across languages. For example, this file captures a subset of à la carte datatypes that model expression syntaxes across languages.
    • Assignment: turning tree-sitter‘s representation into a Haskell datatype representation We had to translate tree-sitter AST nodes to be represented by the new generalized à la carte syntax. To do this, a second grammar was written in Haskell to assign the nodes of the tree-sitter ASTs onto a generalized representation of syntax modeled by the à la carte datatypes. As an example, here is the Assignment grammar written for Ruby.
  4. Performed Evaluation. Next, we captured what it meant to interpret the syntax datatypes. To do so, we wrote a polymorphic type class called Evaluatable, which defined the necessary interface for a term to be evaluated. Evaluatable instances were added for each of the à la carte syntaxes.
  5. Handled Effects. In addition to evaluation semantics, describing the control flow of a given program also necessitates modeling effects. This helps ensure we can represent things like the file system, state, non-determinism, and other effectful computations.
  6. Validated via tests. Tests for diffing, tagging, graphing, and evaluating source code written in that language were added along the process.

Challenges posed by the system

The process described had several obstacles. Not only was it very technically involved, but it had additional limitations.

  1. The system was brittle. Each language’s Assignment code was tightly coupled to the language’s tree-sitter grammar. This meant it could break at runtime if we changed the structure of the grammar, without any compile-time error. To prevent such errors required tracking ongoing changes in tree-sitter, which was also tedious, manual, and error-prone. Each time a grammar changed, assignment changes had to be made to accommodate new tree-structures, such as nodes that changed names or shifted positions. Because improvements to the underlying grammars required changes to Assignment—which were costly in terms of time and risky in terms of the possibility of introducing bugs—, our system had inadvertently become incentivized against iterative improvement.
  2. There were no named child nodes. tree-sitter‘s syntax nodes didn’t provide us with named child nodes. Instead, child nodes were structured as ordered-lists, without any name indicating the role of each child. This didn’t match Semantic’s internal representation of syntax nodes, where each type of node has a specific set of named children. This meant more Assignment work was necessary to compensate for the discrepancy. One concern, for example, was about how we represented comments, which could be any arbitrary node attached to any part of the AST. But if we had named child nodes, this would allow us to associate comments relative to their parent nodes (like if a comment appeared in an if statement, it could be the first child for that if statement node). This would also apply to any other nodes that could appear anywhere within the tree, such as Ruby heredocs.
  3. Evaluation and à la carte sum types were sub-optimal. Taking a step back to examine language support also gave us an opportunity to rethink our à la carte datatypes and the evaluation machinery. À la carte syntax types were motivated by a desire to better share tooling in evaluating common fragments of languages. However, the introduction of these syntax types (and the design of the Evaluatable typeclass) did not make our analysis sensitive to minor linguistic differences, or even to relate different pieces of syntax together. We could overcome this by adding language-specific syntax datatypes to be used with Assignment, along with their accompanying Evaluatable instances—but this would defeat the purpose of a generalized representation. This is because à la carte syntax was essentially untyped; it enforced only a minimal structure on the tree. As a result, any given subterm could be any element of the syntax, and not some limited subset. This meant that many Evaluatable instances had to deal with error conditions that in practice can’t occur. To make this idea more concrete, consider examples showcasing a before and after syntax type transformation:
    -- former system: à la carte syntax
    
    data Op a = Op { ann :: a, left :: Expression a, right :: Expression a }
    -- new system: auto-generated, precisely typed syntax
    
    data Op a = Op { ann :: a, left :: Err (Expression a), right :: Err (Expression a) }

    The shape of a syntax type in our à la carte paradigm has polymorphic children, compared with the monomorphic children of our new “precisely-typed” syntax, which offers better guarantees of what we could expect.

  4. Infeasible time and effort was required. A two-step parsing process required writing two separate language-specific grammars by hand. This was time-consuming, engineering-intensive, error-prone, and tedious. The Assignment grammar used parser combinators in Haskell mirroring the tree-sitter grammar specification, which felt like a lot of duplicated work. For a long time, this work’s mechanical nature begged the question of whether we could automate parts of it. While we’ve open-sourced Semantic, leveraging community support for adding languages has been difficult because, until recently, it was backed by such a grueling process.

Designing a new system

To address challenges, we introduced a few changes:

  1. Add named child nodes. To address the issue of not having named child nodes, we modified the tree-sitter library by adding a new function called field to the grammar API and resultantly updating every language grammar. When parsing, you can retrieve a nodes’ children based on their field name. Here is an example of what a Python if_statement looks like in the old and new tree-sitter grammar APIs:
    Screenshot of a diff highlighting an example of what a Python if_statement looks like in the old and new tree-sitter grammar APIs
  2. Generate a Node Interface File. Once a grammar has this way of associating child references, the parser generation code also produces a node-types.json file that indicates what kinds of children references you can expect for each node type. This JSON file provides static information about nodes’ fields based on the grammar. Using this JSON, applications like ours can use meta-programming to generate specific datatypes for each kind of node. Here is an example of the JSON generated from the grammar definition of an if statement. This file provided a schema for a language’s ASTs and introduced additional improvements, such as the way we specify highlighting.
  3. Auto-generate language-specific syntax datatypes. Using the structure provided by the node-types.json file, we can auto-generate syntax types instead of writing them by hand. First, we deserialize the JSON file to capture the structure we want to represent in the desired shape of datatypes. Specifically, we have four distinct shapes that the nodes in our node-types JSON file take on: sumsproductsnamed leaves, and anonymous leaves. We then use Template Haskell to generate syntax datatypes for each of the language constructs represented by the Node Interface File. This means that our hand-written à la carte syntax types get replaced with auto-generated language-specific types, saving all of the developer time historically spent writing them. Here is an example of an auto-generated datatype representing a Python if statement derived from the JSON snippet provided above, which is structurally a product type.
  4. Build ASTs generically. Once we have an exhaustive set of language-specific datatypes, we need to have a mechanism that can map appropriate auto-generated datatypes onto the ASTs representing the source code being parsed. Historically, this was accomplished by manually writing an Assignment grammar. To obviate the need for a second grammar, we have created an API that uses Haskell’s generic metaprogramming framework to unmarshal tree-sitter’s parse trees automatically. We iterate over tree-sitter‘s parse trees using its tree cursor API and produce Haskell ASTs, where each node is represented by a Template Haskell generated datatype described by the previous step. This allows us to parse a particular set of nodes according to their structure, and return an AST with meta-data (such as range and span information). Here is an example of the AST generated if the Python source code is simply 1
    Screenshot of CodeGen language support pipeline

The final result is a set of language-specific, strongly-typed, TH-generated datatypes represented as the sum of syntax possible at a given point in the grammar. Strongly-typed trees give us the ability to indicate only the subset of the syntax that can occur at a given position. For example, a function’s name would be strongly typed as an identifier; a switch statement would contain case statements; and so on. This provides better guarantees about where syntax can occur, and strong compile-time guarantees about both correctness and completeness.

The new system bypasses a significant part of the engineering effort historically required; it cuts code from our pipeline in addition to addressing the technical limitations described above. The diagram below provides a visual “diff” of the old and new systems.

Diagram showing language support pipeline

A big testament to our approach’s success was that we were able to remove our à la carte syntaxes completely. In addition, we were also able to ship two new languages, Java and CodeQL, using precise ASTs generated by the new system.

Contributions welcome!

To learn more about how you can help, check out our documentation here.

Related posts