- Analyzed language: C/C++
If you are attending this workshop at GitHub Satellite, or watching a recording, the facilitators will guide you through the steps below. You can use this document as a written reference.
To take part in the workshop you will need to set up a CodeQL development environment. See the Prerequisites section in the README for full instructions.
When you have completed setup, you should have:
- Installed the Visual Studio Code IDE.
- Installed the CodeQL extension for Visual Studio Code.
- Cloned this repository with
git clone --recursive
. - Opened this repository in VS Code.
- Downloaded, imported, and selected the
example_db
CodeQL database from within VS Code. - A
workshop-queries
folder within your workspace, containing an example query. - A
codeql
folder within your workspace, containing the CodeQL standard libraries for most target languages. - A copy of this
workshop.md
guide in your workspace. - Open the query
workshop-queries/example.ql
and try running it!
Use-after-free vulnerabilities occur when a program retains a pointer to memory locations after they have been freed, and attempts to reference the freed memory. When the memory was freed, the system may choose to allocate that memory for another purpose. Attempting to reference the freed memory could result in a variety of unsafe behaviour: crashing the program, retrieving an unexpected value, corrupting data used by another program, or executing unsafe code.
The following C code shows a simple example of using memory after it has been freed.
free(s->x);
...
use(s->x);
The code frees the field x
of a struct s
, but does not immediately reset the field's value to zero. As a result, the struct now contains a 'dangling' pointer, which creates the potential for a use-after-free vulnerability. This becomes a real vulnerability when the code references s->x
again, passing it to use
.
A safer coding practice is to always immediately zero the field after freeing it, like this:
free(s->x);
s->x = 0;
Then until s->x
is reassigned, any attempts to reference it will simply obtain the null
memory address.
This is a well-known class of vulnerability, documented as CWE-416. A relatively recent example in the curl
tool was assigned CVE-2018-16840, and inspired the material here.
In security terminology, a reference to freed memory is considered a source of tainted data, and a pointer that is dereferenced (used) is considered a sink for a use-after-free vulnerability.
If the tainted reference is reassigned (e.g. to zero) before it reaches a use, it is considered safe.
In this workshop, we will use CodeQL to analyze a sample of C++ source code that demonstrates simple variants of use-after-free vulnerabilities, and write a CodeQL query to identify the vulnerable pattern with reasonable precision.
- Use the IDE's autocomplete suggestions (
Ctrl+Space
) and jump-to-definition command (F12
) to explore the CodeQL libraries. - To run a query, open the Command Palette (
Cmd+Shift+P
orCtrl+Shift+P
), and click CodeQL: Run Query. You can also see this command when right-clicking on a query file in the editor. - Try this out by running the example query
example.ql
in the workshop repository! - When the query completes, click on the results to jump to the corresponding location in the source code.
- To run a part of a query, such as a single predicate, open the Command Palette and click CodeQL: Quick Evaluation. You can also see this command when right-clicking on selected query text in the editor.
- To understand how the source code is represented in the CodeQL libraries, use the AST Viewer. You can see this in the left panel of the CodeQL view. Click on a query result to get to a source file, and then click View AST, or run CodeQL: View AST from the Command Palette.
The rest of the workshop is split into several steps. You can write one query per step, or work with a single query that you refine at each step.
Each step has a Hint that describes useful classes and predicates in the CodeQL standard libraries for C/C++ and keywords in CodeQL.
Each step has a Solution that indicates one possible answer. Note that all queries will need to begin with import cpp
to use the standard libraries, but for simplicity this may be omitted below.
- Find all references of
free
function calls in the code- Find all variables which are freed in the course of the program
- Find all references of variables after they are freed
- Find all the variables which are used at any point
- Hint: Variables are dereferenced after they are used at any point
- Wire the results of both of our queries above to find if there's a path between our Source and Sink
-
Find all function call expressions, such as
free(x)
anduse(y, z)
.Hint
After you have run the example query and clicked on a result, look at the AST Viewer for the
example.cpp
source file. A function call is called aFunctionCall
in the CodeQL C/C++ library.Solution
from FunctionCall call select call
-
Identify the expression that is used as the first argument for each call, such as
free(<first arg>)
anduse(<first arg>, z)
.Hint
- Add another variable to your
from
clause. Declare its type (this can beExpr
) and give it a name. - Add a
where
clause. - The AST viewer and autocomplete tell us that
FunctionCall
has a predicategetArgument(int)
to find the argument at a 0-based index.
Solution
from FunctionCall call, Expr arg where arg = call.getArgument(0) select arg
- Add another variable to your
-
Filter your results to only those calls to a function named
free
.Hint
FunctionCall
has a predicategetTarget()
to find theFunction
being called.- A
Function
(and most other named elements) has predicatesgetName()
andhasName(string)
to identify its name as a string. - You may also be interested in the predicate
hasGlobalOrStdName(string)
, which identifies named elements from the global orstd
namespaces. - Use the
and
keyword to add conditions to your query. - If you use
getName()
, use the=
operator to assert that two values are equal. If you usehas*Name(string)
, passing the name into the predicate makes the assertion.
Solution
from FunctionCall call, Expr arg where arg = call.getArgument(0) and call.getTarget().hasGlobalOrStdName("free") select arg
-
(Bonus) What other operations might free memory? Try looking for
delete
expressions using CodeQL. The example for this workshop only usesfree
but another codebase may use variations of this function name, or use different delete operators. -
Factor out your logic into a predicate:
predicate isSource(Expr arg) { ... }
.Hint
-
The
predicate
keyword declares a relation that has no explicit result / return value, but asserts a logical property about its variables. -
The
from
clause of a query allowed you to declare variables, and thewhere
clause described conditions on those variables.Within a predicate definition, variables are either declared as the parameters of the predicate, or 'locally' using the
exists
keyword. The first part of theexists
declares some variables, and the body acts like awhere
, enforcing some conditions on the variables.exists(<type> <variableName> | // some logic about the variable here )
-
Use Quick Evaluation to evaluate the predicate on its own.
Solution
predicate isSource(Expr arg) { exists(FunctionCall call | arg = call.getArgument(0) and call.getTarget().hasGlobalOrStdName("free") ) }
-
-
We are going to track the flow of information from the pointer that was freed. For this, we will use the CodeQL library for data flow analysis, which helps us answer questions like: does this expression ever hold a value that originates from a particular other place in the program?
We can visualize the data flow analysis problem as one of finding paths through a directed graph, where the nodes of the graph are places in the source code that may have a value, and the edges represent the flow of data between those elements. If a path exists, then the data flows between those two nodes.
The class
DataFlow::Node
describes all data flow nodes. These are different from the abstract syntax tree (AST) nodes, which only represent the structure of the source code.DataFlow::Node
has various subclasses that describe different types of node, depending on the type of program syntax element they correspond to.You can find out more in the documentation.
Modify your predicate to describe
arg
as aDataFlow::Node
, not anExpr
.Instructions
- Add
import semmle.code.cpp.dataflow.DataFlow
to your query file. - Change your predicate so that the parameter has type
DataFlow::Node
. - This will give you a compile error, since the types no longer match. Convert the data flow node back into an
Expr
using the predicateasExpr()
.
Solution
import semmle.code.cpp.dataflow.DataFlow predicate isSource(DataFlow::Node arg) { exists(FunctionCall call | arg.asExpr() = call.getArgument(0) and call.getTarget().hasGlobalOrStdName("free") ) }
- Add
-
Let's think about the meaning of the
free
function and the value of its argument.Before the function runs, the function argument is a pointer to memory, and is passed to the function by reference.
After the function body, the memory that was referenced by the pointer has been freed.
So the one expression for the function call argument in the program syntax actually two possible values to think about in the data flow graph:
- the pointer before it was freed
- the dangling pointer after it was freed.
Expand the Hint to see how to distinguish between these two cases. Modify your predicate so that
arg
describes the memory after it has been freed, not before.Hint
-
The value before the call is a
DataFlow::ExprNode
, a subtype ofDataFlow::Node
. -
We can call
asExpr()
on such a node to get the original syntactic expression. -
The value after the call is a
DataFlow::DefinitionByReferenceNode
. -
We can call
asDefiningArgument()
on such a node to get the original syntactic expression. -
Jump to the definition of
DataFlow::Node
to read more. -
Modify your predicate to describe
arg
usinggetDefiningArgument()
.
Solution
predicate isSource(DataFlow::Node arg) { exists(FunctionCall call | arg.asDefiningArgument() = call.getArgument(0) and call.getTarget().hasGlobalOrStdName("free") ) }
A dereference is a place in the program that uses the memory referenced by a pointer.
-
Write a
predicate isSink(DataFlow::Node sink)
that describes expressions that may be dereferenced.Hint
- Think of some examples of operations that might dereference a pointer. The
*
operator? Passing it to a function? Performing pointer arithmetic? Use autocomplete or the AST viewer to explore how these are modelled in CodeQL. - Search for
dereference
in autocomplete to find a predicate from the standard library that models all these patterns for you.
Solution
predicate isSink(DataFlow::Node sink) { dereferenced(sink.asExpr()) }
- Think of some examples of operations that might dereference a pointer. The
We have now identified (a) places in the program which reference freed memory and (b) places in the program which dereference a pointer to memory. We now want to tie these two together to ask: does a pointer to freed memory ever flow to a potentially unsafe a dereference operation?
This a data flow problem. We could approach it using local data flow analysis, whose scope would be limited to a single function. However, it is possible for the free and dereference operations to be in different functions. We call this a global data flow problem, and use CodeQL's libraries for this purpose.
In this section we will create a path-problem query capable of looking for global data flow, by populating this template:
/**
* @name Use after free
* @kind path-problem
* @id cpp/workshop/use-after-free
*/
import cpp
import semmle.code.cpp.dataflow.DataFlow
import DataFlow::PathGraph
class Config extends DataFlow::Configuration {
Config() { this = "Config: name doesn't matter" }
/* TODO move over solution from Section 1 */
override predicate isSource(DataFlow::Node source) {
exists(/* TODO fill me in from Section 1 */ |
/* TODO fill me in from Section 1 */
)
}
/* TODO move over solution from Section 2 */
override predicate isSink(DataFlow::Node sink) {
/* TODO fill me in from Section 2 **/
}
}
from Config config, DataFlow::PathNode source, DataFlow::PathNode sink
where config.hasFlowPath(source, sink)
select sink, source, sink, "Memory is $@ and $@, causing a potential vulnerability.", source, "freed here", sink, "used here"
-
Fill in or move the
isSource
predicate you wrote for Section 1. -
Fill in or move the
isSink
predicate you wrote for Section 2. -
You can now run the completed query. Use the path explorer in the results view to check the results.
Completed query
/** * @name Use after free * @kind path-problem * @id cpp/workshop/use-after-free */ import cpp import semmle.code.cpp.dataflow.DataFlow import DataFlow::PathGraph class Config extends DataFlow::Configuration { Config() { this = "Config: name doesn't matter" } override predicate isSource(DataFlow::Node source) { exists(FunctionCall call | source.asDefiningArgument() = call.getArgument(0) and call.getTarget().hasGlobalOrStdName("free") ) } override predicate isSink(DataFlow::Node sink) { dereferenced(sink.asExpr()) } } from Config config, DataFlow::PathNode source, DataFlow::PathNode sink where config.hasFlowPath(source, sink) select sink, source, sink, "Memory is $@ and $@, causing a potential vulnerability.", source, "freed here", sink, "used here"
-
Bonus: Does your query handle the false positives in the example code? How can we expand it to handle more real-world codebases?
- CodeQL overview
- CodeQL for C/C++
- Analyzing data flow in C/C++
- Using the CodeQL extension for VS Code
- CodeQL on GitHub Learning Lab
- CodeQL on GitHub Security Lab
This is a modified version of a Capture-the-Flag challenge devised by @kevinbackhouse, available at https://securitylab.github.com/ctf/eko2020.