CodeQL workshop for C/C++: Finding use-after-free security vulnerabilities

Analyzed language: C/C++

If you are attending this workshop at GitHub Satellite, or watching a recording, the facilitators will guide you through the steps below. You can use this document as a written reference.

Overview

Setup instructions
Problem statement
Workshop

Setup instructions for Visual Studio Code

To take part in the workshop you will need to set up a CodeQL development environment. See the Prerequisites section in the README for full instructions.

When you have completed setup, you should have:

Installed the Visual Studio Code IDE.
Installed the CodeQL extension for Visual Studio Code.
Cloned this repository with git clone --recursive.
Opened this repository in VS Code.
Downloaded, imported, and selected the example_db CodeQL database from within VS Code.
A workshop-queries folder within your workspace, containing an example query.
A codeql folder within your workspace, containing the CodeQL standard libraries for most target languages.
A copy of this workshop.md guide in your workspace.
Open the query workshop-queries/example.ql and try running it!

Problem statement

Use-after-free vulnerabilities occur when a program retains a pointer to memory locations after they have been freed, and attempts to reference the freed memory. When the memory was freed, the system may choose to allocate that memory for another purpose. Attempting to reference the freed memory could result in a variety of unsafe behaviour: crashing the program, retrieving an unexpected value, corrupting data used by another program, or executing unsafe code.

The following C code shows a simple example of using memory after it has been freed.

free(s->x);
...
use(s->x);

The code frees the field x of a struct s, but does not immediately reset the field's value to zero. As a result, the struct now contains a 'dangling' pointer, which creates the potential for a use-after-free vulnerability. This becomes a real vulnerability when the code references s->x again, passing it to use.

A safer coding practice is to always immediately zero the field after freeing it, like this:

free(s->x);
s->x = 0;

Then until s->x is reassigned, any attempts to reference it will simply obtain the null memory address.

This is a well-known class of vulnerability, documented as CWE-416. A relatively recent example in the curl tool was assigned CVE-2018-16840, and inspired the material here.

In security terminology, a reference to freed memory is considered a source of tainted data, and a pointer that is dereferenced (used) is considered a sink for a use-after-free vulnerability.

If the tainted reference is reassigned (e.g. to zero) before it reaches a use, it is considered safe.

In this workshop, we will use CodeQL to analyze a sample of C++ source code that demonstrates simple variants of use-after-free vulnerabilities, and write a CodeQL query to identify the vulnerable pattern with reasonable precision.

Workshop

Getting started

Use the IDE's autocomplete suggestions (Ctrl+Space) and jump-to-definition command (F12) to explore the CodeQL libraries.
To run a query, open the Command Palette (Cmd+Shift+P or Ctrl+Shift+P), and click CodeQL: Run Query. You can also see this command when right-clicking on a query file in the editor.
Try this out by running the example query example.ql in the workshop repository!
When the query completes, click on the results to jump to the corresponding location in the source code.
To run a part of a query, such as a single predicate, open the Command Palette and click CodeQL: Quick Evaluation. You can also see this command when right-clicking on selected query text in the editor.
To understand how the source code is represented in the CodeQL libraries, use the AST Viewer. You can see this in the left panel of the CodeQL view. Click on a query result to get to a source file, and then click View AST, or run CodeQL: View AST from the Command Palette.

The rest of the workshop is split into several steps. You can write one query per step, or work with a single query that you refine at each step.

Each step has a Hint that describes useful classes and predicates in the CodeQL standard libraries for C/C++ and keywords in CodeQL.

Each step has a Solution that indicates one possible answer. Note that all queries will need to begin with import cpp to use the standard libraries, but for simplicity this may be omitted below.

Plan

Source

Find all references of free function calls in the code
- Find all variables which are freed in the course of the program
- Find all references of variables after they are freed

Sink

Find all the variables which are used at any point
- Hint: Variables are dereferenced after they are used at any point

Find Paths

Wire the results of both of our queries above to find if there's a path between our Source and Sink

Finding references to freed memory

Find all function call expressions, such as free(x) and use(y, z).

Hint

After you have run the example query and clicked on a result, look at the AST Viewer for the example.cpp source file. A function call is called a FunctionCall in the CodeQL C/C++ library.
Solution
```
from FunctionCall call
select call
```
Identify the expression that is used as the first argument for each call, such as free(<first arg>) and use(<first arg>, z).
Hint
- Add another variable to your from clause. Declare its type (this can be Expr) and give it a name.
- Add a where clause.
- The AST viewer and autocomplete tell us that FunctionCall has a predicate getArgument(int) to find the argument at a 0-based index.
Solution
```
from FunctionCall call, Expr arg
where arg = call.getArgument(0)
select arg
```
Filter your results to only those calls to a function named free.
Hint
- FunctionCall has a predicate getTarget() to find the Function being called.
- A Function (and most other named elements) has predicates getName() and hasName(string) to identify its name as a string.
- You may also be interested in the predicate hasGlobalOrStdName(string), which identifies named elements from the global or std namespaces.
- Use the and keyword to add conditions to your query.
- If you use getName(), use the = operator to assert that two values are equal. If you use has*Name(string), passing the name into the predicate makes the assertion.
Solution
```
from FunctionCall call, Expr arg
where
  arg = call.getArgument(0) and
  call.getTarget().hasGlobalOrStdName("free")
select arg
```
(Bonus) What other operations might free memory? Try looking for delete expressions using CodeQL. The example for this workshop only uses free but another codebase may use variations of this function name, or use different delete operators.
Factor out your logic into a predicate: predicate isSource(Expr arg) { ... }.
Hint
- The predicate keyword declares a relation that has no explicit result / return value, but asserts a logical property about its variables.
- The from clause of a query allowed you to declare variables, and the where clause described conditions on those variables.
  
  Within a predicate definition, variables are either declared as the parameters of the predicate, or 'locally' using the exists keyword. The first part of the exists declares some variables, and the body acts like a where, enforcing some conditions on the variables.
```
exists(<type> <variableName> |
  // some logic about the variable here
)
```
- Use Quick Evaluation to evaluate the predicate on its own.
Solution
```
predicate isSource(Expr arg) {
  exists(FunctionCall call |
    arg = call.getArgument(0) and
    call.getTarget().hasGlobalOrStdName("free")
  )
}
```
We are going to track the flow of information from the pointer that was freed. For this, we will use the CodeQL library for data flow analysis, which helps us answer questions like: does this expression ever hold a value that originates from a particular other place in the program?

We can visualize the data flow analysis problem as one of finding paths through a directed graph, where the nodes of the graph are places in the source code that may have a value, and the edges represent the flow of data between those elements. If a path exists, then the data flows between those two nodes.

The class DataFlow::Node describes all data flow nodes. These are different from the abstract syntax tree (AST) nodes, which only represent the structure of the source code. DataFlow::Node has various subclasses that describe different types of node, depending on the type of program syntax element they correspond to.

You can find out more in the documentation.

Modify your predicate to describe arg as a DataFlow::Node, not an Expr.
Instructions
- Add import semmle.code.cpp.dataflow.DataFlow to your query file.
- Change your predicate so that the parameter has type DataFlow::Node.
- This will give you a compile error, since the types no longer match. Convert the data flow node back into an Expr using the predicate asExpr().
Solution
```
import semmle.code.cpp.dataflow.DataFlow

predicate isSource(DataFlow::Node arg) {
  exists(FunctionCall call |
    arg.asExpr() = call.getArgument(0) and
    call.getTarget().hasGlobalOrStdName("free")
  )
}
```
Let's think about the meaning of the free function and the value of its argument.

Before the function runs, the function argument is a pointer to memory, and is passed to the function by reference.

After the function body, the memory that was referenced by the pointer has been freed.

So the one expression for the function call argument in the program syntax actually two possible values to think about in the data flow graph:
1. the pointer before it was freed
2. the dangling pointer after it was freed.
Expand the Hint to see how to distinguish between these two cases. Modify your predicate so that arg describes the memory after it has been freed, not before.
Hint
- The value before the call is a DataFlow::ExprNode, a subtype of DataFlow::Node.
- We can call asExpr() on such a node to get the original syntactic expression.
- The value after the call is a DataFlow::DefinitionByReferenceNode.
- We can call asDefiningArgument() on such a node to get the original syntactic expression.
- Jump to the definition of DataFlow::Node to read more.
- Modify your predicate to describe arg using getDefiningArgument().
Solution
```
predicate isSource(DataFlow::Node arg) {
  exists(FunctionCall call |
    arg.asDefiningArgument() = call.getArgument(0) and
    call.getTarget().hasGlobalOrStdName("free")
  )
}
```

Finding dereferences

A dereference is a place in the program that uses the memory referenced by a pointer.

Write a predicate isSink(DataFlow::Node sink) that describes expressions that may be dereferenced.
Hint
- Think of some examples of operations that might dereference a pointer. The * operator? Passing it to a function? Performing pointer arithmetic? Use autocomplete or the AST viewer to explore how these are modelled in CodeQL.
- Search for dereference in autocomplete to find a predicate from the standard library that models all these patterns for you.
Solution
```
predicate isSink(DataFlow::Node sink) {
  dereferenced(sink.asExpr())
}
```

Finding use-after-free vulnerabilities

We have now identified (a) places in the program which reference freed memory and (b) places in the program which dereference a pointer to memory. We now want to tie these two together to ask: does a pointer to freed memory ever flow to a potentially unsafe a dereference operation?

This a data flow problem. We could approach it using local data flow analysis, whose scope would be limited to a single function. However, it is possible for the free and dereference operations to be in different functions. We call this a global data flow problem, and use CodeQL's libraries for this purpose.

In this section we will create a path-problem query capable of looking for global data flow, by populating this template:

/**
 * @name Use after free
 * @kind path-problem
 * @id cpp/workshop/use-after-free
 */
import cpp
import semmle.code.cpp.dataflow.DataFlow
import DataFlow::PathGraph

class Config extends DataFlow::Configuration {
  Config() { this = "Config: name doesn't matter" }

  /* TODO move over solution from Section 1 */
  override predicate isSource(DataFlow::Node source) {
    exists(/* TODO fill me in from Section 1 */ |
      /* TODO fill me in from Section 1 */
    )
  }

  /* TODO move over solution from Section 2 */
  override predicate isSink(DataFlow::Node sink) {
    /* TODO fill me in from Section 2 **/
  }
}

from Config config, DataFlow::PathNode source, DataFlow::PathNode sink
where config.hasFlowPath(source, sink)
select sink, source, sink, "Memory is $@ and $@, causing a potential vulnerability.", source, "freed here", sink, "used here"

Fill in or move the isSource predicate you wrote for Section 1.
Fill in or move the isSink predicate you wrote for Section 2.

You can now run the completed query. Use the path explorer in the results view to check the results.

Completed query

/**
 * @name Use after free
 * @kind path-problem
 * @id cpp/workshop/use-after-free
 */
import cpp
import semmle.code.cpp.dataflow.DataFlow
import DataFlow::PathGraph

class Config extends DataFlow::Configuration {
  Config() { this = "Config: name doesn't matter" }
  override predicate isSource(DataFlow::Node source) {
    exists(FunctionCall call |
      source.asDefiningArgument() = call.getArgument(0) and
      call.getTarget().hasGlobalOrStdName("free")
    )
  }
  override predicate isSink(DataFlow::Node sink) {
    dereferenced(sink.asExpr())
  }
}

from Config config, DataFlow::PathNode source, DataFlow::PathNode sink
where config.hasFlowPath(source, sink)
select sink, source, sink, "Memory is $@ and $@, causing a potential vulnerability.", source, "freed here", sink, "used here"

Bonus: Does your query handle the false positives in the example code? How can we expand it to handle more real-world codebases?

Follow-up material

CodeQL overview
CodeQL for C/C++
Analyzing data flow in C/C++
Using the CodeQL extension for VS Code
CodeQL on GitHub Learning Lab
CodeQL on GitHub Security Lab

Acknowledgements

This is a modified version of a Capture-the-Flag challenge devised by @kevinbackhouse, available at https://securitylab.github.com/ctf/eko2020.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

workshop.md

workshop.md

CodeQL workshop for C/C++: Finding use-after-free security vulnerabilities

Overview

Setup instructions for Visual Studio Code

Problem statement

Workshop

Getting started

Plan

Source

Sink

Find Paths

Finding references to freed memory

Finding dereferences

Finding use-after-free vulnerabilities

Follow-up material

Acknowledgements

Files

workshop.md

Latest commit

History

workshop.md

File metadata and controls

CodeQL workshop for C/C++: Finding use-after-free security vulnerabilities

Overview

Setup instructions for Visual Studio Code

Problem statement

Workshop

Getting started

Plan

Source

Sink

Find Paths

Finding references to freed memory

Finding dereferences

Finding use-after-free vulnerabilities

Follow-up material

Acknowledgements