Improvements to CodeQL’s data flow library for C++

These changes will improve the experience for custom query authors and enable better precision in some of our standard queries. Learn how to enable them for your custom queries.

Robert Marsh·@rdmarsh2

March 30, 2023 | Updated April 20, 2023

| 5 minutes

We’ve recently made some changes to CodeQL’s data flow and taint tracking libraries for C++, which will improve the experience for custom query authors and enable better precision in some of our standard queries. While these changes are included in the standard queries already, you can also enable them for custom queries. We’ll show you how in this blog post.

CodeQL

CodeQL is the static analysis engine behind code scanning. CodeQL works by constructing a database of your code, and then running queries against that database. These queries depend on a variety of shared libraries that perform specific analyses, such as taint tracking and range analysis.

Dataflow

CodeQL’s dataflow library performs an analysis of what expressions may have their value copied to which other expressions, across the entire program. The taint tracking library generalizes this to what expressions may influence the value of which other expressions. Because this is potentially a very large relation, we use a query-specific configuration to restrict the analysis to a set of interesting sources and sinks for a given query before performing any interprocedural analysis. This allows the dataflow or taint tracking configuration to also include query-specific sanitizers and guards that prevent dangerous data from flowing.

Def-use vs use-use

Historically, the C++ dataflow library has followed a “def-use” pattern, where reads from a variable are modeled as steps going from a “definition” (an assignment to the variable) to a “use” (a read from that variable), and two subsequent uses of the same definition aren’t directly connected. Our analyses for other languages follow a “use-use” pattern, where reads from a variable are modeled as steps from the previous read (or the definition, if there is no previous read).

The major advantage of the use-use pattern for query authors is that it’s much simpler to implement a query that takes conditional sanitizers into account. Consider the following code:

char *str = source();
if (isSafe(str)) {
    sink(str)
}

Previously, in order to exclude this result, a query would need to use a separate control-flow analysis to show that the check occurs on every path to the dangerous use of the value. With use-use flow, that analysis is baked into the dataflow graph, simplifying queries for the end user.

Pointers and indirections

We now separate the value of a pointer from the value it points to in our analysis, and can include multiple levels of pointers between the value being passed at a function boundary and the tainted data. This enables more precise tracking of certain flows, especially for string values.

int main (int argc, char **argv) {
    if(argc >= 2) {
        fopen(argv[1])
    }
}

In the above example, the user-controlled value is not argv itself, but the value that it points to after being dereferenced twice. Previously, in queries that consider command-line input dangerous, we’d mark argv as a tainted value, and then any access to it would also be considered tainted. This would mean that dereferencing argv was considered dangerous in some queries, even though the value of the pointer is safe. In the new system, we’re able to express that the tainted data is only accessed by dereferencing argv twice, so those queries no longer treat the first dereference as being unsafe.

How to adopt it for custom queries

While the new library is now used in our standard queries, it is opt-in for custom queries because some of the changes to indirection handling may need changes to the query. Generally, the new library will behave much like the old one. However, there will be a few differences because the new library will handle the distinction between a pointer and its indirection more precisely. For instance, in the previous example, argv is a pointer to a pointer to a series of characters, and the first argument to fopen is a pointer to a series of characters. To find the flow from argv to that argument, we’d previously write a query like this:

import cpp
import semmle.code.cpp.dataflow.TaintTracking

class ArgvTaintedFopenConfig extends TaintTracking::Configuration {
  ArgvTaintedFopenConfig() { this = "ArgvTaintedFopenConfig" }

  override predicate isSource(DataFlow::Node node) {
    exists(Parameter argv |
      node.asParameter() = argv and
      argv.hasName("argv") and
      argv.getFunction().hasGlobalName("main")
    )
  }

  override predicate isSink(DataFlow::Node node) {
    exists(FunctionCall fopenCall |
      node.asExpr() = fopenCall.getArgument(0) and
      fopenCall.getTarget().hasGlobalOrStdName("fopen")
    )
  }
}

In the new system, we can instead use node.asParameter(2) to specify that rather than the value of argv itself, what we’re interested in is the series of characters it points to after being dereferenced twice. Similarly, we can use node.asIndirectArgument(1) to specify that the potentially dangerous data going into fopen isn’t the pointer, but instead the value it points to after one dereference.

import cpp
import semmle.code.cpp.dataflow.new.TaintTracking

class ArgvTaintedFopenConfig extends TaintTracking::Configuration {
  ArgvTaintedFopenConfig() { this = "ArgvTaintedFopenConfig" }

  override predicate isSource(DataFlow::Node node) {
    exists(Parameter argv |
      node.asParameter(2) = argv and
      argv.hasName("argv") and
      argv.getFunction().hasGlobalName("main")
    )
  }

  override predicate isSink(DataFlow::Node node) {
    exists(FunctionCall fopenCall |
      node.asIndirectArgument(1) = fopenCall.getArgument(0) and
      fopenCall.getTarget().hasGlobalOrStdName("fopen")
    )
  }
}

Alternatively, there is a more general predicate node.asIndirectExpr(int) that can be used to describe the value of an indirection of any expression of pointer type in the program.

To opt-in to the new library, replace import semmle.code.cpp.dataflow.DataFlow with import semmle.code.cpp.dataflow.new.DataFlow. Then, consider whether your isSource() and isSink() definitions using node.asExpr() or node.asParameter() are specifying the value you want to track, or just a pointer or reference to that value. If the latter, replace them with the new node.asIndirectArgument(int), node.asIndirectExpr(int), node.asDefiningArgument(int), or node.asParameter(int). Note that the existing node.asParameter() and node.asExpr() still exist, but they now refer to the value of the node itself, not to the dereferences of the node.

Learn more about GitHub security solutions

GitHub is committed to helping build safer and more secure software without compromising on the developer experience. To learn more or enable GitHub’s security features like code scanning, check out our documentation.

Written by

News & insights

Improvements to CodeQL’s data flow library for C++

CodeQL

Dataflow

Def-use vs use-use

Pointers and indirections

How to adopt it for custom queries

Learn more about GitHub security solutions

Tags:

Written by

Robert Marsh

Related posts

Q1 2025 Innovation Graph update: Bar chart races, data visualization on the rise, and key research

GitHub Availability Report: July 2025

Auf Wiedersehen, GitHub ♥️

Tags:

Written by

Related posts

We do newsletters, too