Improvements to CodeQL’s data flow library for C++
These changes will improve the experience for custom query authors and enable better precision in some of our standard queries. Learn how to enable them for your custom queries.
We’ve recently made some changes to CodeQL’s data flow and taint tracking libraries for C++, which will improve the experience for custom query authors and enable better precision in some of our standard queries. While these changes are included in the standard queries already, you can also enable them for custom queries. We’ll show you how in this blog post.
CodeQL
CodeQL is the static analysis engine behind code scanning. CodeQL works by constructing a database of your code, and then running queries against that database. These queries depend on a variety of shared libraries that perform specific analyses, such as taint tracking and range analysis.
Dataflow
CodeQL’s dataflow library performs an analysis of what expressions may have their value copied to which other expressions, across the entire program. The taint tracking library generalizes this to what expressions may influence the value of which other expressions. Because this is potentially a very large relation, we use a query-specific configuration to restrict the analysis to a set of interesting sources and sinks for a given query before performing any interprocedural analysis. This allows the dataflow or taint tracking configuration to also include query-specific sanitizers and guards that prevent dangerous data from flowing.
Def-use vs use-use
Historically, the C++ dataflow library has followed a “def-use” pattern, where reads from a variable are modeled as steps going from a “definition” (an assignment to the variable) to a “use” (a read from that variable), and two subsequent uses of the same definition aren’t directly connected. Our analyses for other languages follow a “use-use” pattern, where reads from a variable are modeled as steps from the previous read (or the definition, if there is no previous read).
The major advantage of the use-use pattern for query authors is that it’s much simpler to implement a query that takes conditional sanitizers into account. Consider the following code:
char *str = source();
if (isSafe(str)) {
sink(str)
}
Previously, in order to exclude this result, a query would need to use a separate control-flow analysis to show that the check occurs on every path to the dangerous use of the value. With use-use flow, that analysis is baked into the dataflow graph, simplifying queries for the end user.
Pointers and indirections
We now separate the value of a pointer from the value it points to in our analysis, and can include multiple levels of pointers between the value being passed at a function boundary and the tainted data. This enables more precise tracking of certain flows, especially for string values.
int main (int argc, char **argv) {
if(argc >= 2) {
fopen(argv[1])
}
}
In the above example, the user-controlled value is not argv
itself, but the value that it points to after being dereferenced twice. Previously, in queries that consider command-line input dangerous, we’d mark argv
as a tainted value, and then any access to it would also be considered tainted. This would mean that dereferencing argv
was considered dangerous in some queries, even though the value of the pointer is safe. In the new system, we’re able to express that the tainted data is only accessed by dereferencing argv
twice, so those queries no longer treat the first dereference as being unsafe.
How to adopt it for custom queries
While the new library is now used in our standard queries, it is opt-in for custom queries because some of the changes to indirection handling may need changes to the query. Generally, the new library will behave much like the old one. However, there will be a few differences because the new library will handle the distinction between a pointer and its indirection more precisely. For instance, in the previous example, argv is a pointer to a pointer to a series of characters, and the first argument to fopen is a pointer to a series of characters. To find the flow from argv to that argument, we’d previously write a query like this:
import cpp
import semmle.code.cpp.dataflow.TaintTracking
class ArgvTaintedFopenConfig extends TaintTracking::Configuration {
ArgvTaintedFopenConfig() { this = "ArgvTaintedFopenConfig" }
override predicate isSource(DataFlow::Node node) {
exists(Parameter argv |
node.asParameter() = argv and
argv.hasName("argv") and
argv.getFunction().hasGlobalName("main")
)
}
override predicate isSink(DataFlow::Node node) {
exists(FunctionCall fopenCall |
node.asExpr() = fopenCall.getArgument(0) and
fopenCall.getTarget().hasGlobalOrStdName("fopen")
)
}
}
In the new system, we can instead use node.asParameter(2)
to specify that rather than the value of argv
itself, what we’re interested in is the series of characters it points to after being dereferenced twice. Similarly, we can use node.asIndirectArgument(1)
to specify that the potentially dangerous data going into fopen
isn’t the pointer, but instead the value it points to after one dereference.
import cpp
import semmle.code.cpp.dataflow.new.TaintTracking
class ArgvTaintedFopenConfig extends TaintTracking::Configuration {
ArgvTaintedFopenConfig() { this = "ArgvTaintedFopenConfig" }
override predicate isSource(DataFlow::Node node) {
exists(Parameter argv |
node.asParameter(2) = argv and
argv.hasName("argv") and
argv.getFunction().hasGlobalName("main")
)
}
override predicate isSink(DataFlow::Node node) {
exists(FunctionCall fopenCall |
node.asIndirectArgument(1) = fopenCall.getArgument(0) and
fopenCall.getTarget().hasGlobalOrStdName("fopen")
)
}
}
Alternatively, there is a more general predicate node.asIndirectExpr(int)
that can be used to describe the value of an indirection of any expression of pointer type in the program.
To opt-in to the new library, replace import semmle.code.cpp.dataflow.DataFlow
with import semmle.code.cpp.dataflow.new.DataFlow
. Then, consider whether your isSource()
and isSink()
definitions using node.asExpr()
or node.asParameter()
are specifying the value you want to track, or just a pointer or reference to that value. If the latter, replace them with the new node.asIndirectArgument(int)
, node.asIndirectExpr(int)
, node.asDefiningArgument(int)
, or node.asParameter(int)
. Note that the existing node.asParameter()
and node.asExpr()
still exist, but they now refer to the value of the node itself, not to the dereferences of the node.
Learn more about GitHub security solutions
GitHub is committed to helping build safer and more secure software without compromising on the developer experience. To learn more or enable GitHub’s security features like code scanning, check out our documentation.
Tags:
Written by
Related posts
Inside the research: How GitHub Copilot impacts the nature of work for open source maintainers
An interview with economic researchers analyzing the causal effect of GitHub Copilot on how open source maintainers work.
OpenAI’s latest o1 model now available in GitHub Copilot and GitHub Models
The December 17 release of OpenAI’s o1 model is now available in GitHub Copilot and GitHub Models, bringing advanced coding capabilities to your workflows.
Announcing 150M developers and a new free tier for GitHub Copilot in VS Code
Come and join 150M developers on GitHub that can now code with Copilot for free in VS Code.