Adding back the llvm frontend (#200)

* Added back LLVM stuff * Added hacks around too aggressive inlining * Made LLVM parts compile * Reformatted & updated headers * Removed calls to FrontendMetadata * Updated license * Added tests and library for LLVM * Added native library as subproject * Added doc * Modified doc * Added library * Added LLVM-16.0.6 to github actions * Only bulding on linux, added install llvm action * Updated llvm install script * Not copying to /lib anymore * Not building when llvm is not installed * Added logging for native flags * Added logging on enabledness * Added this@Build_gradle to enabled flag * Added exception handling to javac lookup * Updated library
ftsrg · Jul 24, 2023 · a90beeb · a90beeb
1 parent 4480126
commit a90beeb
Show file tree

Hide file tree

Showing 141 changed files with 10,517 additions and 14 deletions.
diff --git a/.github/actions/install-llvm/action.yml b/.github/actions/install-llvm/action.yml
@@ -0,0 +1,23 @@
+name: 'Install LLVM'
+inputs:
+    version:
+        required: true
+
+runs:
+  using: "composite"
+  steps:
+    - name: Run install script
+      shell: bash
+      run: |
+        wget https://apt.llvm.org/llvm.sh
+        chmod +x llvm.sh
+        sudo ./llvm.sh ${{ inputs.version }}
+        sudo ln -sf $(which clang-${{inputs.version}}) /usr/bin/clang
+        sudo ln -sf $(which llvm-config-${{inputs.version}}) /usr/bin/llvm-config
+    - name: Test version
+      shell: bash
+      run: |
+        ls -l $(which clang)
+        ls -l $(which llvm-config)
+        clang --version
+        llvm-config --version
diff --git a/.github/workflows/linux-build-test-deploy.yml b/.github/workflows/linux-build-test-deploy.yml
@@ -61,6 +61,10 @@ jobs:
     steps:
       - name: Checkout repository
         uses: actions/checkout@c85c95e3d7251135ab7dc9ce3241c5835cc595a9 # v3.5.3
+      - name: Install LLVM and Clang
+        uses: ./.github/actions/install-llvm
+        with:
+          version: "15"
       - name: Run tests
         uses: ./.github/actions/test-action
       - name: Positive outcome badge

diff --git a/.github/workflows/sonar.yml b/.github/workflows/sonar.yml
@@ -15,6 +15,10 @@ jobs:
         with:
           distribution: temurin
           java-version: 17
+      - name: Install LLVM and Clang
+        uses: ./.github/actions/install-llvm
+        with:
+          version: "15"
       - name: Analyze
         env:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

diff --git a/build.gradle.kts b/build.gradle.kts
@@ -57,7 +57,7 @@ tasks {
         }
 
         val reportTasks = subprojects.mapNotNull { subproject ->
-            subproject.tasks.named("jacocoTestReport", JacocoReport::class).orNull
+            subproject.tasks.findByName("jacocoTestReport")?.let { it as JacocoReport }
         }
 
         dependsOn(reportTasks.flatMap { it.dependsOn })

diff --git a/lib/.gitignore b/lib/.gitignore
@@ -1,3 +1,3 @@
 /msvcp110.dll
 /msvcr110.dll
-/vcomp110.dll
+/vcomp110.dll
diff --git a/lib/libtheta-llvm.so b/lib/libtheta-llvm.so
diff --git a/scripts/reformat.sh b/scripts/reformat.sh
diff --git a/settings.gradle.kts b/settings.gradle.kts
@@ -23,6 +23,7 @@ include(
 
     "frontends/c-frontend",
     "frontends/chc-frontend",
+    "frontends/llvm",
 
     "cfa/cfa",
     "cfa/cfa-analysis",
@@ -37,6 +38,7 @@ include(
     "xcfa/exec-graph-cli",
     "xcfa/c2xcfa",
     "xcfa/litmus2xcfa",
+    "xcfa/llvm2xcfa",
     "xcfa/xcfa-analysis",
     "xcfa/xcfa-cli",
     "xcfa/litmus-cli",

diff --git a/subprojects/frontends/llvm/README.md b/subprojects/frontends/llvm/README.md
@@ -0,0 +1,44 @@
+# Theta C Frontend
+This repository contains a library, which enables Theta to parse C programs into the XCFA or CFA format using LLVM.
+
+# Motivation
+Trying to directly transform a C program into a formal representation comes with several problems due to the complexity of the C language. Using an assembly-like internal representation, such as LLVM IR can largely decrease this complexity.
+
+Currently there is no official LLVM Java API, thus a C++ project is necessary to handle the parsing of LLVM bytecode.
+
+# Connection with Theta
+[Link to Theta](https://github.com/ftsrg/theta)
+
+Theta is a configurable and modular, CEGAR-based verification framework. It is capable of handling several formalisms, including Control Flow Automaton *(CFA)* and an extended version of CFA, called XCFA. *(The latter can handle functions and (concurrent) processes.)* 
+
+Theta has a CLI tool and a module for parsing CFA/XCFA files and can convert an XCFA to a CFA, if it is possible *(if it has only one function and one process)*, but has no capability to transform C programs or any other representation of them to these formalisms.
+
+# Connection with Gazer
+[Link to Gazer](https://github.com/ftsrg/gazer)
+
+Gazer is a BMC verification tool and Theta frontend written in C++, using the LLVM framework. It handles the transformation to CFA from the C program by itself with an intermediate CFA representation used in its BMC engine. It works as a literal frontend, a CLI tool which handles the transformation and which can call Theta for analysis, parsing and using it's output to get a counterexample. 
+
+Gazer is the predecessor of this library. Many of this its techniques and implementation details are used in this library, but this project has a different, simpler approach. It aims to carry out only a fraction of Gazer's functionality, namely it has no BMC engine, so it cannot carry out software verification and it will not output a CFA or an XCFA in any way. It carries out only the transformation of a C program to an intermediate representation containing the necessary information for the construction of an XCFA, adding some LLVM passes on the way.
+
+As opposed to Gazer, this library approaches its usage in a reversed manner, being called and used by Theta as a native C++ library, not the other way around. It's long term goal is to be a fairly static library in a way, that extensions and implementation of new features related to the formalisms or the analysis can happen mainly in Theta, as it has better maintainability and is actively developed.
+
+# Transforming the program
+![architecture](doc/theta-c-arch.png)
+*The red parts are modules/classes of Theta, the green ones are of this project and the blue ones are yet to be implemented - they are not properly connected yet, as their place in the project isn't certain yet*
+
+## Input
+The input file can either be bitcode (`.ll` pr `.bc`, LLVM passes are not used in this case) or a C program (a single `.c` or `.i` file). Implementing a linking step of more than one input files is not yet implemented. *(But it is an important feature for the future.)* 
+
+## Compilation (clang)
+This is a simple step - we just call clang with a few flags to compile the C program to bitcode. We use O0, as most optimization passes that we benefit from are executed in the next step.
+
+## Documentation of the remaining steps
+[Our intermediate representation](doc/intermediate-representation.md)
+
+[Passes and optimizations](doc/passes.md)
+
+[JNI interface](doc/jni-interface.md)
+
+[Analysis in the frontend](doc/analysis.md)
+
+[A simple example from input program to xcfa](doc/simple-example.md)
diff --git a/subprojects/frontends/llvm/build.gradle.kts b/subprojects/frontends/llvm/build.gradle.kts
@@ -0,0 +1,96 @@
+/*
+ *  Copyright 2023 Budapest University of Technology and Economics
+ *
+ *  Licensed under the Apache License, Version 2.0 (the "License");
+ *  you may not use this file except in compliance with the License.
+ *  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ *  Unless required by applicable law or agreed to in writing, software
+ *  distributed under the License is distributed on an "AS IS" BASIS,
+ *  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ *  See the License for the specific language governing permissions and
+ *  limitations under the License.
+ */
+
+import org.gradle.internal.os.OperatingSystem.current
+import java.io.ByteArrayOutputStream
+import java.io.IOException
+
+plugins {
+    id("cpp-library")
+}
+
+val enabled = current().isLinux &&
+        try {
+            runCommandForOutput("llvm-config")
+            true
+        } catch (e: IOException) {
+            println("LLVM not installed, not building native library.")
+            false
+        }
+
+fun runCommandForOutput(vararg args: String): Array<String> {
+    val process = ProcessBuilder(*args).start()
+    val outputStream = ByteArrayOutputStream()
+    process.inputStream.copyTo(outputStream)
+    process.waitFor()
+    val ret = outputStream.toString()
+            .trim()
+            .split(" ")
+            .filter { it.length > 1 }
+            .map { it.trim() }
+            .toTypedArray()
+    return ret
+}
+
+fun llvmConfigFlags(vararg args: String): Array<String> {
+    if (!enabled) return arrayOf()
+    return try {
+        runCommandForOutput("llvm-config", *args)
+    } catch (e: IOException) {
+        e.printStackTrace()
+        arrayOf()
+    }.also { println("LLVM flags (${args.toList()}): ${it.toList()}") }
+}
+
+fun jniConfigFlags(): Array<String> {
+    if (!enabled) return arrayOf()
+    val jdkHomeArr = runCommandForOutput("bash", "-c", "dirname \$(cd \$(dirname \$(readlink \$(which javac) || which javac)); pwd -P)")
+    check(jdkHomeArr.size == 1)
+    val jdkHome = File(jdkHomeArr[0])
+    check(jdkHome.exists())
+    val mainInclude = jdkHome.resolve("include")
+    val linuxInclude = mainInclude.resolve("linux")
+    return arrayOf(
+            "-I${mainInclude.absolutePath}",
+            "-I${linuxInclude.absolutePath}",
+    ).also { println("JNI flags: ${it.toList()}") }
+}
+
+library {
+    targetMachines.add(machines.linux.x86_64)
+    tasks.withType(CppCompile::class) {
+        compilerArgs.addAll(listOf(
+                "-Wall",
+                "-fpic",
+                *jniConfigFlags(),
+                *llvmConfigFlags("--cxxflags")))
+        onlyIf {
+            println("CppCompile is enabled: $enabled")
+            this@Build_gradle.enabled
+        }
+    }
+
+    tasks.withType(LinkSharedLibrary::class) {
+        linkerArgs.addAll(listOf(
+                "-rdynamic",
+                *llvmConfigFlags("--cxxflags", "--ldflags", "--libs", "core", "bitreader"),
+                "-ldl"))
+        onlyIf {
+            println("LinkSharedLibrary is enabled: $enabled")
+            this@Build_gradle.enabled
+        }
+    }
+}
diff --git a/subprojects/frontends/llvm/doc/analysis.md b/subprojects/frontends/llvm/doc/analysis.md
@@ -0,0 +1,20 @@
+## What analysis steps mean in this case
+These are simple "checks", executed while the IR is parsed into our representation (so we won't need to explicitly iterate through it twice).
+
+### Struct check
+Using an `llvm::Typefinder`, we check if there are any struct (or union) type definitions in the module. This information is then stored and can be queried.
+
+### Logical operator check
+This check is called for each instruction. The goal of this check is to decide, if Theta can use integer arithmetics or only the more costly bitwise arithmetics.
+
+#### How it works
+Steps executed on each instruction:
+- Is it an `and/or/xor` operation?
+    - If **yes**, is it only used later in `icmp` operations with 0 as the other operand?
+        - If **yes**, *integer arithmetics* are suitable
+        - If **no**, we'll need *bitwise arithmetics*
+- Else, is it a shift (`shl/lshr/ashr`) operation?
+    - If **yes**, we'll need *bitwise arithmetics*
+    - If **no**, *integer arithmetics* are suitable
+
+*Clarification: this is a global attribute in the sense, that if any instruction requires integer arithmetics, then the whole program has to be handled that way.*
diff --git a/subprojects/frontends/llvm/doc/intermediate-representation.md b/subprojects/frontends/llvm/doc/intermediate-representation.md
@@ -0,0 +1,10 @@
+## Intermediate representation between LLVM IR and the XCFA models
+As we can see in the architecture, the C input program is compiled into LLVM IR. LLVM intermediate representation is an assembly-like representation with around 60 instructions and lots of possibilities for adding metadata. 
+
+It can be easily parsed through the LLVM API, but contains a large quantity of complex information, from which we only need to extract particular elements required for the XCFA and the projection of counterexamples on to the original program. To accomplish this we use a simpler representation, which is in many points similar to the programmatic representation of LLVM IR, but differs in some particular places and contains no superfluous data (from our standpoint).
+
+The classes for this representation can be found in the [types](https://github.com/ftsrg/theta-c-frontend/tree/master/src/types) directory.
+
+It is similar to LLVM IR in that there is a module, which contains global variables and functions. Functions contain basic blocks, basic blocks are made of instructions, which have operands. It differs in that there are no explicit metadata classes, rather the above mentioned classes contain metadata about themselves. 
+
+Furthermore on instruction level the LLVM IR has no classes explicitly representing (virtual) registers and instruction operands - these are handled as instructions, constants or other appropriate types. In our representation instruction contains Operands, an abstract class and parent class of Register, BasicBlockOperand and StringOperand. The latter is used to handle function operands in `call` instructions and special strings like the condition code in `icmp` operations.
diff --git a/subprojects/frontends/llvm/doc/jni-interface.md b/subprojects/frontends/llvm/doc/jni-interface.md
@@ -0,0 +1,25 @@
+## JNI Interface
+*"In software design, the Java Native Interface (JNI) is a foreign function interface programming framework that enables Java code running in a Java virtual machine (JVM) to call and be called by native applications (programs specific to a hardware and operating system platform) and libraries written in other languages such as C, C++ and assembly."* (from [wikipedia](https://en.wikipedia.org/wiki/Java_Native_Interface))
+
+This project behaves as a native library of the Theta framework and communicates through a JNI interface. Although it is possible to send objects through a native interface, we are currently not using that feature. 
+
+The easiest way to get familiar with the interface is to check out the C++ side [here](https://github.com/ftsrg/theta-c-frontend/blob/master/src/hu_bme_mit_theta_llvm2xcfa_LlvmIrProvider.h) and the Java side in Theta (`LllvmIrProvider` class of the XCFA subproject).
+
+### Configuring passes and parsing the input program
+Currently there are four switches to configure pass groups (the rest of the passes are mandatory).
+- Function inlining (we do not inline global variables)
+- Cleanup passes
+- Optimizations
+- Debug printing pass
+
+After configuration the native function `JniParseIr` has to be called to let the frontend iterate through the LLVM IR.
+
+### Querying the resulting data
+The data can be queried based on these groups:
+- global variables
+- functions
+    - basic blocks
+        - instructions
+
+The number of these in a given parent and their attributes can be queried one by one.
+Furthermore analysis results can be queried as well.
diff --git a/subprojects/frontends/llvm/doc/passes.md b/subprojects/frontends/llvm/doc/passes.md
@@ -0,0 +1,22 @@
+## Passes used
+### About LLVM passes in general
+Passes provide a system to run optimization, transformation and analysis steps on LLVM IR. These can be LLVMs official passes, but custom passes as well. This project uses both.
+[more on LLVM passes](https://llvm.org/docs/Passes.html)
+
+This part of the project adopts the most from Gazer. LLVM and custom passes are used there similarly as well, although Gazer has a more complex and less naive approach, e.g. using early and late optimization sets. In this project we are approaching the problems and functionality in a direct manner, thus we have only a simpler, single execution line of passes.
+
+### What are we using passes for
+1. Optimizations
+2. Transformations 
+3. Annotations
+
+*Disclaimer: this part of the project isn't static, as we are constantly experimenting with changing/adding new passes. Due to this, some important custom passes are described here, but to get the precise state of what passes we use and what they do, start from [this](https://github.com/ftsrg/theta-c-frontend/blob/master/src/utilities/CPipeline.cpp) source file.*
+### Custom passes
+- **Toposort pass**
+    Strongly connected components (e.g. loops) of basic blocks aren't necessarily in topological order in LLVM IR as this way it is possible to iterate through the blocks in order and have unknown values only in phi nodes, which makes the model transformation in Theta simpler
+- **Eliminate GEP instructions pass**
+    The getelementptr instruction is a special instruction in LLVM IR, which usually pairs with one or more store and load instructions to get/set memory values. In the XCFA this instruction pair is handled as a single instruction on a single edge, but transforming to this format is complex. Instead of that we use this transformation pass to combine the information from these pairs into a single `theta.dbg.getArrayElement_typename` or `theta.dbg.setArrayElement_typename` function calls while removing the `getelementptr` and `load/store` instructions. *Support in this pass is currently incomplete, as there can be special parametrizations in these instructions when structs or unions are used, so that has to be added when extending the frontend with struct support in the future.*    
+- **Eliminate Phi nodes pass**
+    When verifying a model with CEGAR, the number of variables can be a crucial point. Phi nodes can basically add a variable per each node, although it wasn't present in the original program and wouldn't be necessary. This pass eliminates these unnecessary variables, where possible and uses a global variable with unique identifiers to store these values instead.
+- **Branch dbg call pass** *(work in progress)*
+    This pass adds a function called `theta.dbg.control` before each `br` *(branch)* instruction. On one hand, adding metadata to this call will make it easier to add *control* information to witnesses, on the other hand it tones down the CFG simplification pass, which is an important optimization, but can also be quite aggressive when merging control flow branches with which we lose *(or even get wrong)* program line metadata. False information on program lines can lead to unusable witnesses.