cprover
|
The purpose of this section is to explain several key concepts used throughout the CPROVER framework at a high level, ignoring the details of the actual implementation. In particular, we will discuss different ways to represent programs in memory, three important analysis methods and some commonly used terms.
One of the first questions we should be considering is how we represent programs in such a way that we can easily analyze and reason about them.
As it turns out, the best way to do this is to use a variety of different representations, each representing a different level of abstraction. These representations are designed in such a way that for each analysis we want to perform, there is an appropriate representation, and it is easy to go from representations that are close to the source code to representations that focus on specific semantic aspects of the program.
The representations that the CPROVER framework uses mirror those used in modern compilers such as LLVM and gcc. I will point out those places where the CPROVER framework does things differently, attempting to give rationales wherever possible.
One in-depth resource for most of this section is the classic compiler construction text book ''Compilers: Principles, Techniques and Tools'' by Aho, Lam, Sethi and Ullman.
To illustrate the different concepts, we will consider a small example program. While the program is in C, the general ideas apply to other languages as well - see later sections of this manual to understand how the specific features of those languages are handled. Our running example will be a program that calculates factorials.
The question of this first section is: how do we represent this program in memory so that we can do something useful with it?
One possibility would be to just store the program as a string, but this is clearly impractical: even finding whether there is an assignment to a specific variable would require significant parsing effort. For this reason, the first step is to parse the program text into a more abstract representation.
The first step in representing a program in memory is to parse the program, at the same time checking for syntax errors, and store the parsing result in memory.
The key data structure that stores the result of this step is known as an Abstract Syntax Tree, or AST for short (cf. Wikipedia). ASTs are still relatively close to the source code, and represent the structure of the source code while abstracting away from syntactic details, e.g., dropping parentheses, semicolons and braces as long as those are only used for grouping.
Considering the example of the C program given above, we first notice that the program describes (in C terms) a single translation unit, consisting of four top-level declarations (the two function forward declarations of atoi and printf, and the function definitions of factorial and main). Let us start considering the specification of atoi. This gives rise to a subtree modeling that we have a function called atoi whose return type is int, with an unnamed argument of type const char *. We can represent this using a tree that has, for instance, the following structure (this is a simplified version of the tree that the CPROVER framework uses internally):
This graph shows the (simplified) AST structure for the atoi function. The top level says that this is a global entity, namely one that has code (i.e., a function), called atoi and yielding int. Furthermore, it has a child node initiating a parameter list, and there is a node in the parameter list for each parameter, giving its type and name, if a name is given.
Extending this idea, we can represent the structure of the factorial function using ASTs. The idea here is that the code itself has a hierarchical structure. In the case of C, this starts with the block structure of the code: at the top, we start with a block of code, having three children, each being a ''statement'' node:
The first statement is already a basic statement: we represent it as a local declaration (similar to the global declarations above) of a variable.
The second statement is a compound statement, which we can decompose further. At the top level, we have a node stating that this is a for statement, with four child nodes:
Finally, the third statement is again simple: it consists of a return statement node, with a child node for the variable expression fac. Since the AST is very similar to the first AST above, we omit it here. All in all, the AST for the function body looks like this:
Using the AST for the function body, we can easily produce the definition of the factorial function:
In the end, we produce a sequence of trees modeling each declaration in the translation unit (i.e., the file factorial.c).
This data structure is already useful: at this level, we can easily derive simple information such as ''which functions are being defined?'', ''what are the arguments to a given function'' and so on.
Nevertheless, for many analyses, this representation needs to be transformed further. In general, the first step is to resolve variable names. This is done using an auxiliary data structure known as the symbol table.
The issue that this step addresses is that the meaning of a variable name depends on a given scope - for instance, in a C program, we can define the variable i as a local variable in two different functions, so that the name i refers to two different objects, depending on which function is being executed at a given point in time.
To resolve this problem, a first transformation step is performed, changing (short) variable names to some form of unique identifier. This ensures that each variable identifier is used only once, and all accesses to a given variable can be easily found by just looking for the variable identifier. For instance, we could change each variable name into a pair of the name and a serial number. The serial number gets increased whenever a variable of that name is declared. In the example, the ASTs for factorial and main after resolving variable names would look roughly like this:
Note that the parameter n in factorial and the local variable n in main are now disambiguated as n.1 and n.2; furthermore, we leave the names of global objects as-is. In the CPROVER framework, a more elaborate system is used: local variables are prefixed with the function names, and further disambiguation is performed by adding indices. For brevity, we use indices only.
Further information on ASTs can be found in the Wikipedia page and the materials linked there. Additionally, there is an interesting discussion on StackOverflow about abstract versus concrete syntax trees.
At this level, we can already perform a number of interesting analyses, such as basic type checking. But for more advanced analyses, other representations are better: the AST contains many different kinds of nodes (essentially, one per type of program construct), and has a rather intricate structure.
The ASTs in the previous section represent the syntax of a program, including all the features of a given programming language. But in practice, most programming languages have a large amount of ''syntactic sugar'': constructs that are, technically speaking, redundant, but make programming a lot easier. For analysis, this means that if we immediately try to work on the initial AST, we would have to handle all these various cases.
To simplify analysis, it pays off to bring the AST into simpler forms, known as intermediate representations (short IR). An IR is usually given as some form of AST, using a more restricted subset or a variant of the original language that is easier to analyze than the original version.
Taking the example from above, we rewrite the program into a simpler form of the C language: instead of allowing powerful control constructs such as while and for, we reduce everything to if and goto. In fact, we even restrict if statements: an if statement should always be of the form if (*condition*) goto *target* else goto *target*;. As it turns out, this is sufficient to represent every C program. The factorial function in our example program can then be rewritten as follows:
We leave it up to the reader to verify that both versions of the function behave the same way, and to draw the function as an AST.
In the CPROVER framework, a number of different IRs are employed to simplify the program under analysis into a simple core language step-by-step. In particular, expressions are brought into much simpler forms. This sequence of transformations is described in later chapters.
TODO: Link to corresponding sections.
Another important representation of a program can be gained by transforming the program structure into a control flow graph, short CFG. While the AST focuses more on the syntactic structure of the program, keeping constructs like while loops and similar forms of structured control flow, the CFG uses a unified graph representation for control flow.
In general, for analyses based around Abstract Interpretation (see Abstract Interpretation), it is usually preferable to use a CFG representation, while other analyses, such as variable scope detection, may be easier to perform on ASTs.
The general idea is to present the program as a graph. The nodes of the graph are instructions or sequences of instructions. In general, the nodes are basic blocks: a basic block is a sequence of statements that is always executed in order from beginning to end. The edges of the graph describe how the program execution may move from one basic block to the next. Note that single statements are always basic blocks; this is the representation used inside the CPROVER framework. In the examples below, we try to use maximal basic blocks (i.e., basic blocks that are as large as possible); this can be advantageous for some analyses.
Let us consider the factorial function as an example. As a reminder, here is the code, in IR:
We rewrite the code with disambiguated variables (building the AST from it is left as an exercise):
This function consists of four basic blocks:
One way to understand which functions form basic blocks is to consider the successors of each instruction. If we have two instructions A and B, we say that B is a successor of A if, after executing A, we can execute B without any intervening instructions. For instance, in the example above, the loop initialization statement unsigned int i.1 = 1 is a successor of unsigned long fac.1 = 1. On the other hand, return fac.1 is not a successor of unsigned long fac.1 = 1: we always have to execute some other intermediate statements to reach the return statement.
Now, consider the if statement, if (i.1 <= n.1) goto for_loop_entry else goto for_loop_end. This statement has two successors: fac.1 *= i.1 and return fac.1.
Similarly, we say that A is a predecessor of B if B is a successor of A. We find that the if statement has two predecessors, unsigned int i.1 = 1 and goto for_loop_start.
A basic block is a sequence of instructions with the following property:
In particular, each member of the sequence but the first must have exactly one predecessor, and each member of the sequence but the last must have exactly one successor. These criteria explain why we have the basic blocks described above.
Putting everything together, we get a control flow graph like this:
The graph can be read as follows: each node corresponds to a basic block. The initial basic block (where the function is entered) is marked with a double border, while those basic blocks that leave the function have a gray background. An edge from a basic block B to a basic block B', means that if the execution reaches the end of B, execution may continue in B'. Some edges are labeled: the edge leaving the comparison basic block with the label true, for instance, can only be taken if the comparison did, in fact, return true.
Note that this representation makes it very easy to interpret the program, keeping just two pieces of state: the current position (which basic block and which line), and the values of all variables (in real software, this would also include parts of the heap and the call stack). Execution proceeds as follows: as long as there are still instructions to execute in the current basic block, run the next instruction and move the current position to right after that instruction. At the end of a basic block, take one of the available outgoing edges to another basic block.
In the CPROVER framework, we often do not construct CFGs explicitly, but instead use an IR that is constructed in a way very similar to CFGs, known as ''GOTO programs''. This IR is used to implement a number of static analyses, as described in sections Frameworks: and Specific analyses:.
While control flow graphs are already quite useful for static analysis, some techniques benefit from a further transformation to a representation known as static single assignment, short SSA. The point of this step is to ensure that we can talk about the entire history of assignments to a given variable. This is achieved by renaming variables again: whenever we assign to a variable, we clone this variable by giving it a new name. This ensures that each variable appearing in the resulting program is written to exactly once (but it may be read arbitrarily many times). In this way, we can refer to earlier values of the variable by just referencing the name of an older incarnation of the variable.
We illustrate this transformation by first showing how the body of the for loop of factorial is transformed. We currently have:
We now give a second number to each variable, counting the number of assignments so far. Thus, the SSA version of this code turns out to be
This representation now allows us to state facts such as ''i is increasing'' by writing i.1.1 < i.1.2.
At this point, we run into a complication. Consider the following piece of code for illustration:
The corresponding control flow graph looks like this:
When we try to transform to SSA, we get:
Depending on which path the execution takes, we have to return either x.1 or x.2! The way to make this work is to introduce a function Φ that selects the right instance of the variable; in the example, we would have
In the CPROVER framework, we provide a precise implementation of Φ, using explicitly tracked information about which branches were taken by the program. There are also some differences in how loops are handled (finite unrolling in CPROVER, versus a Φ-based approach in compilers); this approach will be discussed in a later chapter.
For the time being, let us come back to factorial. We can now give an SSA using Φ functions:
The details of SSA construction, plus some discussion of how it is used in compilers, can be found in the original paper.
The SSA is an extremely helpful representation when one wishes to perform model checking on the program (see next section), since it is much easier to extract the logic formulas used in this technique from an SSA compared to a CFG (or, even worse, an AST). That being said, the CPROVER framework takes a different route, opting to convert to intermediate representation known as GOTO programs instead.
Control granularity of object accesses.
Field sensitivity is a transformation of the instructions goto-program which mainly replaces some expressions by symbols but can also add assignments to the target equations produced by symbolic execution. The main goal of this transformation is to allow more constants to be propagated during symbolic execution. Note that field sensitivity is not applied as a single pass over the whole goto program but instead applied as the symbolic execution unfolds.
On a high level, field sensitivity replaces member operators, and array accesses with atomic symbols representing a field when possible. In cases where this is not immediately possible, like struct assignments, some things need to be added. The possible cases are described below.
A member expression struct_expr.field_name is replaced by the symbol struct_expr..field_name; note the use of .. to visually distinguish the symbol from the member expression. This is valid for both lvalues and rvalues. See field_sensitivityt::apply.
In an rvalue, a symbol struct_expr which has a struct type with fields field1, field2, etc, will be replaced by {struct_expr..field_name1; struct_expr..field_name2; …}. See field_sensitivityt::get_fields.
When the symbol is on the left-hand-side, for instance for an assignment struct_expr = other_struct, the assignment is replaced by a sequence of assignments: struct_expr..field_name1 = other_struct..field_name1; struct_expr..field_name2 = other_struct..field_name2; etc. See field_sensitivityt::field_assignments.
An index expression array[index] when index is constant and array has constant size is replaced by the symbol array[[index]]; note the use of [[ and ]] to visually distinguish the symbol from the index expression. When index is not a constant, array[index] is replaced by {array[[0]]; array[[1]]; …index]. Note that this process does not apply to arrays whose size is not constant, and arrays whose size exceed the bound max_field_sensitivity_array_size
. See field_sensitivityt::apply.
In an rvalue, a symbol array which has array type will be replaced by {array[[0]]; array[[1]]; …}[index]. See field_sensitivityt::get_fields.
When the array symbol is on the left-hand-side, for instance for an assignment array = other_array, the assignment is replaced by a sequence of assignments: array[[0]] = other_array[[0]]; array[[1]] = other_array[[1]]; etc. See field_sensitivityt::field_assignments.
One of the most important analysis techniques by the CPROVER framework, implemented using the CBMC (and JBMC) tools, is bounded model checking, a specific instance of a method known as Model Checking.
The basic question that model checking tries to answer is: given some system (in our case, a program) and some property, can we find an execution of the system such that it reaches a state where the property holds? If yes, we would like to know how the program reaches this state - at the very least, we want to see what inputs are required, but in general, we would prefer having a trace, which shows what statements are executed and in which order.
In general, a trace describes which statements of the program were executed, and which intermediate states were reached. Often, it is sufficient to only provide part of the intermediate states (omitting some entirely, and only mentioning parts that cannot be easily reconstructed in others).
As it turns out, model checking for programs is, in general, a hard problem. Part of the reason for this is that many model checking algorithms strive for a form of ''completeness'' where they either find a trace or return a proof that such a trace cannot possibly exist.
Since we are interested in generating test cases, we prefer a different approach: it may be that a certain target state is reachable only after a very long execution, or not at all, but this information does not help us in constructing test cases. For this reason, we introduce an execution bound that describes how deep we go when analyzing a program.
Model checking techniques using such execution bounds are known as bounded model checking; they will return either a trace, or a statement that says ''the target state could not be reached in n steps'', for a given bound n. Thus, for a given bound, we always get an underapproximation of all states that can be reached: we can certainly find those reachable within the given bound, but we may miss states that can be reached only with more steps. Conversely, we will never claim that a state is not reachable within a certain bound if there is, in fact, a way of reaching this state.
The bounded model checking techniques used by the CPROVER framework are based on symbolic model checking, a family of model checking techniques that work on sets of program states and use advanced tools such as SAT solvers (more on that below) to calculate the set of reachable states. The key step here is to encode both the program and the set of states using an appropriate logic, mostly propositional logic and (fragments of) first-order logic.
In the following, we will quickly discuss propositional logic, in combination with SAT solving, and show how to build a simple bounded model checker. Actual bounded model checking for software requires a number of additional steps and concepts, which will be introduced as required later on.
Many of the concepts in this section can be found in more detail in the Wikipedia article on Propositional logic.
Let us start by looking at propositional formulas. A propositional formula consists of propositional variables, say x, y and z, that can take the Boolean values true and false, connected together with logical operators (often called junctors), namely and, or and not. Sometimes, one introduces additional junctors, such as xor or implies, but these can be defined in terms of the three basic junctors just described.
Examples of propositional formulas would be ''x and y'' or ''not x or y or z''. We can evaluate formulas by setting each variable to a Boolean value and reducing using the follows rules:
An important related question is: given a propositional formula, is there a variable assignment that makes it evaluate to true? This is known as the SAT problem. The most important things to know about SAT are:
As an example in how to use a SAT solver, consider the following formula (in conjunctive normal form):
''(x or y) and (x or not y) and x and y''
We can represent this formula in the minisat input format as:
Compare the Minisat user guide. Try to run minisat on this example. What would you expect, and what result do you get?
Next, try running minisat on the following formula: ''(x or y) and (x or not y) and (not x) and y'' What changed? Why?
So far, we have seen how to reason about simple propositional formulas. This seems to be quite far from our goal of reasoning about the behavior of programs. As a first step, let us see how we can lift SAT-based reasoning to reasoning about data such as machine integers.
Remember the factorial function. It starts with the line
Now, suppose that fac will be represented as a binary number with 64 bits (this is standard on, e.g., Linux). So, if we wish to reason about the contents of the variable fac, we might as well represent it as a vector of 64 propositional variables, say fac0 to fac63, where fac = 263 fac63 + ... + 20 fac0. We can then assert that fac=1 using the propositional formula fac63 = 0 and ... and fac1 = 0 and fac0 = 1, where we define the formula A = B as ''(A or not B) and (B or not A)''.
We call this a bit vector representation. Compare the Wikipedia page on Binary numbers.
Bit vector representations can also be used to describe operations on binary numbers. For instance, suppose we have two four-bit numbers A_3, ... A_0 (representing a number A) and B_3, ..., B_0 (representing a number b) and wish to add them. As detailed on the page on Binary adders, we define three additional bit vectors, the carries C0, ..., C3, the partial sum P0, ..., P4 and the sum S0, ..., S4 , representing a number S such that S=A+B. Note that the sum vector has one bit more - why? How is this related to arithmetic overflow in C?
For convenience, we first define the half-adder. This is given as two formulas, HA_S(A,B) = (A and not B) or (B and not A), which gives the sum of the bits A and B, and HA_C(A,B) = A and B, which indicates whether the result of the sum was too big to fit into one bit (so we carry a one to the next bit).
Using the half-adder formulas, we can define the full adder, again given as two formulas, one for sum and the other for carry. We have FA_S(A,B,C_in) = HA(HA(A,B),C_in), giving the sum of A, B and C_in, and FA_C(A,B,C_in) = HA_C(A,B) or (C_in and HA_S(A,B)), which states whether the result is too big to fit into a bit. Note that the full adder has an additional input for carries; in the following step, we will use it to chain the full adders together to compute the actual sum.
Using the helper variables we have above, we can give the sum of the four-bit numbers as
C_0 = FA_C(A_0,B_0,0) and C_1 = FA_C(A_1,B_1,C_0) and C_2 = FA_C(A_2,B_2,C_1) and C_3 = FA_C(A_3,B_3,C_2) and S_0 = FA_S(A_0,B_0,0) and S_1 = FA_S(A_1,B_1,C_0) and and S_2 = FA_S(A_2,B_2,C_1) and S_3 = FA_S(A_3,B_3,C_2) and S_3 = FA_S(0,0,C_3).
Other arithmetic operations on binary numbers can be expressed using propositional logic as well; the details can be found in the linked articles, as well as Two's complement for handling signed integers and IEEE 754 for floating point numbers.
In the following, we will simply write formulas such as S=A+B, with the understanding that this is internally represented using the appropriate bit vectors.
At this point, we can start considering how to express program behavior in propositional logic.
Let us start with a very simple program:
To describe the behavior of this program, we introduce the appropriately-sized bit vectors A and B, and an additional helper vector return. The A and B bit vectors reflect the values of the parameters a and b, while return contains the return value of the function. As we have seen above, we can describe the value of a+b as A+B – remember that this is an abbreviation for a moderately complex formula on bit vectors!
From the semantics of the return instruction, we know that this program will return the value a+b, so we can describe its behavior as return = A+B.
Let us consider a slightly more complex program.
We again introduce several bit vectors. For the parameter x, we introduce a bit vector X, and for the return value, return. But we also have to deal with the (local) variable y, which gets two assignments. Furthermore, we now have a program with three instructions.
Thinking about the approach so far, we find that we cannot directly translate y=y+x into a propositional formula: On the left-hand side of the =, we have the ''new'' value of y, while the right-hand side uses the ''old'' value. But propositional formulas do not distinguish between ''old'' and ''new'' values, so we need to find a way to distinguish between the two. The easiest solution is to use the Static Single Assignment form described in section SSA. We transform the program into SSA (slightly simplifying the notation):
In this form, we know that each variable is assigned to at most once. To capture the behavior of this program, we translate it statement-by-statement into a propositional formula. We introduce two bit vectors Y1 and Y2 to stand for y.1 and y.2 (we map X to x.1 and return to the return value). int y.1 = x.1 * x.1 becomes Y1 = X * X, y.2 = y.1 + x.1 becomes Y2 = Y1 + X and return y.2 becomes return = Y2.
To tie the three formulas together into a description of the whole program, we note that the three instructions form a single basic block, so we know they are always executed as a unit. In this case, it is sufficient to simply connect them with ''and'': Y1 = X * X and Y2 = Y1 + X and return = Y2. Note that thanks to SSA, we do not need to pay attention to control flow here.
One example of non-trivial control flow are if statements. Consider the following example:
Bringing this into SSA form, we have the following program (we write Phi for Φ):
We again introduce bit vectors A (for parameter a), B (for parameters b), R1 (for result.1), R2 (for result.2) and return (for the return value). The interesting question in this case is how we can handle the Φ node: so far, it is a ''magic'' operator that selects the correct value.
As a first step, we modify the SSA form slightly by introducing an additional propositional variable C that tracks which branch of the if was taken. We call this variable the code guard variable, or guard for short. Additionally, we add C to the Φ node as a new first parameter, describing which input to use as a result. The corresponding program looks something like this:
For the encoding of the program, we introduce the implication junctor, written ⇒, where ''A ⇒ B'' is equivalent to ''(not A) or B''. It can be understood as ''if A holds, B must hold as well''.
With these ingredients, we can encode the program. First of all, we translate the basic statements of the program:
Finally, the Φ node is again resolved using the ⇒ junctor: we can encode the return statement as (C ⇒ return = R1) and ((not C) ⇒ return = R2).
At this point, it remains to tie the statements together; we find that we can again simply connect them with ''and''. We get:
C = a<b and R1 = B and R2 = A and (C ⇒ return = R1) and ((not C) ⇒ return = R2).
So far, we have only discussed how to encode the behavior of programs as propositional formulas. To actually reason about programs, we also need to a way to describe the property we want to prove. To do this, we introduce a primitive ASSERT. Let e be some expression; then ASSERT(e) is supposed to do nothing if e evaluates to true, and to abort the program if e evaluates to false.
For instance, we can add prove that max(a,b) <= a by modifying the max function into
The corresponding SSA would be
We translate ASSERT(Phi(C,result.1,result.2) <= a) into
Φ(C,result.1,result.2) <= a The resulting formula would be C = a<b and R1 = B and R2 = A and (C ⇒ R1 <= A) and ((not C) ⇒ R2 <= A). (C ⇒ return = R1) and ((not C) ⇒ return = R2).
We can extend this approach quite straightforwardly to other constructs, but one obvious problem remains: We have not described how to handle loops. This turns out to be a substantial issue for theoretical reasons: Programs without loop (and without recursive functions) always run for a limited amount of execution steps, so we can easily write down formulas that, essentially, track their entire execution history. But programs with loops can take arbitrarily many steps, or even run into infinite loops, so we cannot describe their behavior using a finite propositional formula in the way we have done above.
There are multiple approaches to deal with this problem, all with different trade-offs. CBMC chooses bounded model checking as the underlying approach. The idea is that we only consider program executions that are, in some measure, ''shorter'' than a given bound. This bound then implies an upper bound on the size of the required formulas, which makes the problem tractable.
To make this concrete, let's go back to the factorial example. We consider the function in the following form (in CPROVER, we actually use a goto-based IR, but the while-based version is a bit simpler to understand):
We set the following ''length bound'': The loops in the program are allowed to be executed at most three times; we will ignore all other executions. With this in mind, we observe that the following two classes of programs behave exactly the same:
and
This transformation is known as loop unrolling or unwinding: We can always replace a loopby an if that checks if we should execute, a copy of the loop body and the loop statement.
So, to reason about the factorial function, we unroll the loop three times and then replace the loop with ASSERT(!(condition)). We get:
Since this program is now in a simpler form without loops, we can use the techniques we learned above to transform it into a propositional formula. First, we transform into SSA (with code guard variables already included):
Note that we may be missing possible executions of the program due to this translation; we come back to this point later.
The corresponding propositional formula can then be written as (check that this is equivalent to the formula you would be getting by following the translation procedure described above):
fac.1 = 1 and i.1 = 1 and C1 = i.1 <= n and fac.2 = fac.1 * i.1 and i.2 = i.1 + 1 and C2 = i.2 <= n and fac.3 = fac.2 * i.2 and i.3 = i.2 + 1 and C3 = i.3 <= n and fac.4 = fac.3 * i.3 and i.4 = i.3 + 1 and C4 = i.4 <= n and not (i.4 <= n) and ((C1 and C2 and C3) ⇒ result = fac.4) and ((C1 and C2 and not C3) ⇒ result = fac.3) and ((C1 and not C2) ⇒ result = fac.2) and ((not C1) ⇒ result = fac.1) In the following, we reference this formula as FA(n, result).
At this point, we know how to encode programs as propositional formulas. Our goal was to reason about programs, and in particular, to check whether a certain property holds. Suppose, for example, that we want to check if there is a way that the factorial function returns 6. One way to do this is to look at the following propositional formula: FA(n, result) and result = 6. If this formula has a model (i.e., if we can find a satisfying assignment to all variables, and in particular, to n), we can extract the required value for the parameter n from that model. As we have discussed above, this can be done using a SAT solver: If you run, say, MiniSAT on this formula, you will get a model that translates to n=3.
Be aware that this method has very clear limits: We know that the factorial of 5 is 120, but with the formula above, evaluating ''FA(n, result) and result=120'' would yield ''unsatisfiable''! This is because we limited the number of loop iterations, and to reach 120, we have to execute the loop more than three times. In particular, a ''VERIFICATION SUCCESSFUL'' message, as output by CBMC, must always be interpreted keeping the bounds that were used in mind.
In the case that we found a model, we can get even more information: We can even reconstruct the program execution that would lead to the requested result. This can be achieved by tracing the SSA, using the guard variables to decide which branches of if statements to take. In this way, we can simulate the behavior of the program. Writing down the steps of this simulation yields a trace, which described how the corresponding program execution would look like.
The above section gives only a superficial overview on how SAT solving and bounded model checking work. Inside the CPROVER framework, we use a significantly more advanced engine, with numerous optimizations to the basic algorithms presented above. One feature that stands out is that we do not reduce everything to propositional logic, but instead use a more powerful logic, namely quantifier-free first-order logic. The main difference is that instead of propositional variables, we allow expressions that return Boolean values, such as comparisons between numbers or string matching expressions. This gives us a richer logic to express properties. Of course, a simple SAT solver cannot deal with such formulas, which is why we go to SMT solvers instead - these solvers can deal with specific classes of first-order formulas (like the ones we produce). One well-known SMT solver is Z3.
While BMC analyzes the program by transforming everything to logic formulas and, essentially, running the program on sets of concrete states, another approach to learn about a program is based on the idea of interpreting an abstract version of the program. This is known as abstract interpretation. Abstract interpretation is one of the main methods in the area of static analysis.
The key idea is that instead of looking at concrete program states (e.g., ''variable x contains value 1''), we look at some sufficiently-precise abstraction (e.g., ''variable x is odd'', or ''variable x is positive''), and perform interpretation of the program using such abstract values. Coming back to our running example, we wish to prove that the factorial function never returns 0.
An abstract interpretation is made up from four ingredients:
The first ingredient we need for abstract interpretation is the abstract domain. The domain allows us to express what we know about a given variable or value at a given program location; in our example, whether it is zero or not. The way we use the abstract domain is for each program point, we have a map from visible variables to elements of the abstract domain, describing what we know about the values of the variables at this point.
For instance, consider the factorial example again. After running the first basic block, we know that fac and i both contain 1, so we have a map that associates both fac and i to "not 0".
An abstract domain is a set $D$ (or, if you prefer, a data type) with the following properties:
Algebraically speaking, $D$ needs to be a semi-lattice.
For our example, we use the following domain:
The second ingredient we need are the abstract state transformers. An abstract state transformer describes how a specific expression or statement processes abstract values. For the example, we need to define abstract state transformers for multiplication and addition.
Let us start with multiplication, so let us look at the expression x*y. We know that if x or y are 0, x*y is zero. Thus, if x or y are "equals 0", the result of x*y is also "equals 0". If x or y are "could be 0" (but neither is "equals 0"), we simply don't know what the result is - it could be zero or not. Thus, in this case, the result is "could be 0".
What if x and y are both "not 0"? In a mathematically ideal world, we would have x*y be non-zero as well. But in a C program, multiplication could overflow, yielding 0. So, to be correct, we have to yield "could be 0" in this case.
Finally, when x is bottom, we can just return whatever value we had assigned to y, and vice versa.
For addition, the situation looks like this: consider x+y. If neither x nor y is "not 0", but at least one is "could be 0", the result is "could be 0". If both are "equals 0", the result is "equals 0". What if both are "not 0"? It seems that, since the variables are unsigned and not zero, it should be "not 0". Sadly, overflow strikes again, and we have to make do with "could be 0". The bottom cases can be handled just like for multiplication.
The way of defining the transformation functions above showcases another important property: they must reflect the actual behavior of the underlying program instructions. There is a formal description of this property, using Galois connections; for the details, it is best to look at the literature.
The third ingredient is straightforward: we use a simple map from program locations and variable names to values in the abstract domain. In more complex analyses, more involved forms of maps may be used (e.g., to handle arbitrary procedure calls, or to account for the heap).
At this point, we have almost all the ingredients we need to set up an abstract interpretation. To actually analyze a function, we take its CFG and perform a fixpoint algorithm.
Concretely, let us consider the CFG for factorial again. This time, we have named the basic blocks, and simplified the variable names.
We provide an initial variable map: n is "could be 0" before BB1. As a first step, we analyze BB1 and find that the final variable map should be:
At this point, we look at all the outgoing edges of BB1 - we wish to propagate our new results to all blocks that follow BB1. There is only one such block, BB2. We analyze BB2 and find that it doesn't change any variables. Furthermore, the result of i <= n does not allow us to infer anything about the values in the variable map, so we get the same variable map at the end of BB2 as before.
Again, we look at the outgoing edges. There are two successor blocks, BB3 and BB4. The information in our variable map does not allow us to rule out either of the branches, so we need to propagate to both blocks. We start with BB3 and remember that we need to visit BB4.
At this point, we know that n "could be 0", while fac and i are "not 0". Applying the abstract transformers, we learn that afterwards, both fac and i "could be 0" (fac ends up as "could be 0" since both fac and i were "not 0" initially, i ends up as "could be 0" because both i and 1 are "not 0"). So, the final variable map at the end of BB3 is
At this point, we propagate again, this time to BB2. But wait, we have propagated to BB2 before! The way this is handled is as follows: We first calculate what the result of running BB2 from S; this yields S again. Now, we know that at the end of BB2, we can be either in state S or N. To get a single state out of these two, we merge: for each variable, merge the mapping of that variable in S and N. We get:
At this point, we propagate to BB3 and BB4 again. Running BB3, we again end up with state S at the end of BB3, so things don't change. We detect this situation and figure out that we do not need to propagate from BB3 to BB2 - this would not change anything! Thus, we can now propagate to BB4. The state at the end of BB4 is also S.
Now, since we know the variable map at the end of BB4, we can look up the properties of the return value: in S, fac maps to "could be 0", so we could not prove that the function never returns 0. In fact, this is correct: calculating factorial(200) will yield 0, since the 64-bit integer fac overflows.
Nevertheless, let us consider what would happen in a mathematically ideal world (e.g., if we used big integers). In that case, we would get "not 0" * "not 0" = "not 0", "not 0" + x = "not 0" and x + "not 0" = "not 0". Running the abstract interpretation with these semantics, we find that if we start BB3 with variable map N, we get variable map N at the end as well, so we end up with variable map N at the end of BB4 - but this means that fac maps to "not 0"!
The algorithm sketched above is called a fixpoint algorithm because we keep propagating until we find that applying the transformers does not yield any new results (which, in a mathematically precise way, can be shown to be equivalent to calculating, for a specific function f, an x such that f(x) = x).
This overview only describes the most basic way of performing abstract interpretation. For one, it only works on the procedure level (we say it is an intra-procedural analysis); there are various ways of extending it to work across function boundaries, yielding inter-procedural analysis. Additionally, there are situations where it can be helpful make a variable map more abstract (widening) or use information that can be gained from various sources to make it more precise (narrowing); these advanced topics can be found in the literature as well.
To instrument a piece of code means to modify it by (usually) inserting new fragments of code that, when executed, tell us something useful about the code that has been instrumented.
For instance, imagine you are given the following function:
and you want to design an analysis that figures out which lines of code will be covered when aha is executed with the input 5.
We can instrument the code so that the function will look like:
All we have to do now is to implement the function void __runtime_line_seen(int line) so that when executed it logs somewhere what line is being visited. Finally we execute the instrumented version of aha and collect the desired information from the log.
More generally speaking, and especially within the CPROVER code base, instrumenting the code often refers to modify its behavior in a manner that makes it easier for a given analysis to do its job, regardless of whether that is achieved by executing the instrumented code or by just analyzing it in some other way.
As we have seen above, we often operate on many different representations of programs, such as ASTs, control flow graphs, SSA programs, logical formulas in BMC and so on. Each of these forms is good for certain kinds of analyses, transformations or optimizations.
One important kind of step in dealing with program representations is going from one representation to the other. Often, such steps are going from a more ''high-level'' representation (closer to the source code) to a more ''low-level'' representation. Such transformation steps are known as flattening or lowering steps, and tend to be more-or-less irreversible.
An example of a lowering step is the transformation from ASTs to the GOTO IR, given above.
In the CPROVER framework, the term verification condition is used in a somewhat non-standard way. Let a program and a set of assertions be given. We transform the program into an (acyclic) SSA (i.e., an SSA with all loops unrolled a finite number of times) and turn it into a logical formula, as described above. Note that in this case, the formula will also contain information about what the program does after the assertion is reached: this part of the formula, is, in fact, irrelevant for deciding whether the program can satisfy the assertion or not. The verification condition is the part of the formula that only covers the program execution until the line that checks the assertion has been executed, with everything that comes after it removed.
This section provides a graphical overview of how CBMC fits together. CBMC takes C code or a goto-binary as input and tries to emit traces of executions that lead to crashes or undefined behaviour. The diagram below shows the intermediate steps in this process.
For an explanation of the data structures involved in the modeling of a GOTO program (the GOTO Intermediate Representation used by CBMC and assorted tools) please see Central Data Structures .
To be documented.
For an overview of the transformations applied to goto programs after the generation of the initial goto program and before BMC, see Goto Program Transformations .
To be documented.
For an explanation of part of how the BMC (Symex) process works, please refer to Symex and GOTO program instructions
To be documented.
To be documented.
To be documented.
To be documented.
To be documented.
To be documented.
The following is some light technical documentation of the major data structures used in the input transformation to Intermediate Representation (IR) inside CBMC and the assorted CProver tools.
The goto_modelt class is the top-level data structure that CBMC (and the other tools) use for holding the GOTO intermediate representation. The following diagram is a simplified view of how the data structures relate to each other -
A goto_modelt is effectively a pair, consisting of:
In pseudocode, the type looks this:
There is an abstract interface for goto models provided by the abstract_goto_modelt class. This is defined and documented in the file src/goto-programs/abstract_goto_model.h. Ideally code which takes a goto model as input should accept any implementation of the abstract interface rather than accepting the concrete goto_modelt exclusively. This helps with compatibility with jbmc which uses lazy loading. See the lazy_goto_modelt class which is defined and documented in jbmc/src/java_bytecode/lazy_goto_model.h for details of lazy loading.
For further information about symbols see the symbolt class which is defined and documented in src/util/symbol.h and the symbol_exprt (symbol expression) class which is defined and documented in src/util/std_expr.h.
A goto_functiont is also defined as a pair. It's designed to represent a function at the goto level, and effectively it's the following data type (in pseudocode):
The goto_programt denoting the body of the function will be the subject of a more elaborate explanation in the next section. The in-memory structure of a goto function allows the body to be optional, but only functions with bodies are included in the serialised goto binaries.
The parameters subcomponent is a list of identifiers that are to be looked-up in the symbol-table for their definitions.
A goto program is a sequence of GOTO instructions (goto_instructiont). For details of the goto_programt class itself see the documentation in goto_program.h. For further details of the life cycle of goto programs see goto programs
An instruction (goto_instructiont) is a triple (an element with three subcomponents), describing a GOTO level instruction with the following 3 component subtypes, again in pseudocode:
The pseudocode subtypes above require some more elaboration, so we will provide some extra detail to illuminate some details at the code level:
Another important data structure that needs to be discussed at this point is source_locationt.
source_locationt are attached into various exprts (the data structure representing various expressions, usually the result of some early processing, e.g. the result of the frontend parsing a file).
This means that codets, representing GOTO instructions also have a source_locationt attached to them, by virtue of inheriting from exprt.
For the most part, source_locationt is something that is only being used when we print various nodes (for property listing, counterexample/trace printing, etc).
It might be possible that a specific source location might point to a CBMC instrumentation primitive (which might be reported as a built-in-addition) or there might even be no-source location (because it might be part of harnesses generated as an example, that have no presence in the user code).
ireps are the underlying data structure used to implement many of the data structures in CBMC and the assorted tools. These include the exprt, typet, codet and source_locationt classes. This is a tree data structure where each node is expected to contain a string/ID and may have child nodes stored in both a sequence of child nodes and a map of strings/IDs to child nodes. This enables the singular irept data structure to be used to model graphs such as ASTs, CFGs, etc.
The classes extending irept define how the higher level concepts are mapped onto the underlying tree data structure. For this reason it is usually advised that the member functions of the sub-classes should be used to access the data held, rather than the member functions of the base irept class. This aids potential future restructuring and associates accesses of the data with the member functions which have the doxygen explaining what the data is.
The strings/IDs held within irept are of type irep_idt. These can be converted to std::string using the id2string function. There is a mechanism provided for casting expressions and types in src/util/expr_cast.h. In depth documentation of the irept class itself can be found in src/util/irep.h.
The core structures used for representing abstract syntax trees are all documented in util.
See goto-programs, goto_programt and instructiont.
See langapi.
See ansi-c.
See cpp.
See java_bytecode.
See Overview.
See Overview.
See Overview.
See analyses and pointer-analysis.
The CBMC source code is available on its GitHub page.
Instructions for compiling CBMC using makefiles are available in COMPILING.md in the root of the CBMC repository. They cover Linux, Solaris 11, FreeBSD 11, MacOS X and Windows.
There is also support for compiling using CMake. Instructions are available in COMPILING.md in the root of the CBMC repository.
Two files, config.inc and common, are included in every makefile. config.inc contains configuration options relating to compilation so that they can be conveniently edited in one place. common contains commands that are needed in every makefile but which the developer is not expected to edit. (There is also another config.inc which is also included in every makefile in the jbmc folder.)
Note, these files are not involved in the CMake build.
If the macro DEBUG is defined during compilation of CBMC (for example by using a compiler flag) then extra debug code will be included. This includes print statements and code checking that data structures are as expected.
The regression tests are contained in regression/ and jbmc/regression/. Inside these folders there is a directory for each of the modules. Each of these contains multiple test directories, with names describing what they test. When there are multiple tests in a test directory then they should all test very similar aspects of the program's behaviour. Each test directory contains input files and one or more test description files, which have the ending .desc. The test description files specify what command to run, what output is expected and so on. The test framework is a Perl script, test.pl, located in regression/test.pl.
The --help option to test.pl outlines the format of the test description files. Most importantly, the first word in a test description file is its level, which is one of: CORE (should be run in CI, should succeed), THOROUGH (takes too long to be run in CI, should succeed), FUTURE (will succeed when a planned feature is added) or KNOWNBUG (will succeed when a bug is fixed).
Test descriptions may also include a number of tags. test.pl -I<tag> will only run tests with a particular <tag>, and test.pl -X<tag> will run all tests except for those with a particular <tag>. See regression/README.md for the current set of tags and their intended use.
If you have compiled using make then you can run the regression tests using make test. Run it from regression/ to run all the regression tests, or any of its subfolders to just run the tests for that module. The number of tests that are run in parallel can be controlled through the environment variable TESTPL_JOBS.
If you have not compiled using make then this won't work, because the makefile is expecting to find binaries like cbmc and jbmc in the source folders.
If you have compiled using CMake then you can run the regression tests using CTest. (Note: this will also run the unit tests.)
Here are two example commands, to be run from the build/ directory:
ctest -V -L CORE -R cpp ctest -V -L CORE -R cpp -E cbmc-cpp
-V makes it print out more useful output. -L CORE makes it only run tests that have been tagged CORE. -R regular_expression can be used to limit which tests are run to those which match the given regular expression, and -E regex excludes tests to those which match the given regular expression. So the first command will run all the CORE tests in regression/cbmc/cpp and regression/cbmc/cbmc-cpp, and the second will run all the CORE tests in regression/cbmc/cpp only. Another useful option is -N, which makes CTest list which tests it will run without actually running them.
It can be useful to run a single test folder in isolation. This can be done by running regression/test.pl directly. The way that test.pl is run varies between the different modules, and can be ascertained by looking at the test target in the makefile. The simple case is when there isn't a file called chain.sh. Then you can directly run test.pl on a single test folder with the following command from the module directory (note that it is recommended to use absolute paths as it avoids many issues, e.g. the path to the binary should be relative to <test-folder>):
<absolute-path-to-test.pl> -p -c <absolute-path-to-binary> <test-folder>
-p makes it print a log of failed tests and -c tells it where to find the binary to run, e.g. cbmc, jbmc or goto-analyzer. If <test-folder> is not provided then all test directories are run. The --help option lists all command line options, including -j for running multiple tests in parallel and -C, -T, -F and -K for controlling whether CORE, THOROUGH, FUTURE or KNOWNBUG tests are run.
When there is a file called chain.sh then the command should look like
<absolute-path-to-test.pl> -p -c '<absolute-path-to-chain-sh> <arg-1> ... <arg-n>' <test-folder>
Note that the test binary and its initial command line arguments are a single value for the -c option, so they must be be wrapped in quotes if they contain any unescaped spaces. What to put for the arguments <arg-1> to <arg-n> varies from module to module. To find out, look in chain.sh and see what arguments it expects. You can also look in the Makefile and see how it calls chain.sh in the test target.
The unit tests are contained in the unit/ folder. They are written using the Catch unit test framework.
If you have compiled with make, you can run the unit tests for CBMC directly by going to unit/, running make to compile the unit tests and then make test to run them. You can run the unit tests for JBMC directly by going to jbmc/unit/ and running the same commands.
If you have compiled with CMake, you can run the unit tests for CBMC directly by going to unit/ and running
../<build-folder>/bin/unit
and you can run the unit tests for JBMC directly by going to jbmc/unit/ and running
../../<build-folder>/bin/java-unit
If you have compiled with CMake you can also run the unit tests through CTest, with the names unit and java-unit. So, from the build/ directory, run
ctest -V -L CORE -R ^unit ctest -V -L CORE -R java-unit
to run the CBMC unit tests and the JBMC unit tests respectively. (The ^ is needed to make it a regular expression that matches unit but not java-unit.)
Note that some tests run which are expected to fail - see the summary at the end of the run to see how many tests passed, how many failed which were expected to and how many tests failed which were not expected to.
For more information on the structure of unit/ and how to tag tests, see the section on unit tests in CODING_STANDARD.md in the root of the CBMC repository
On Unix-style systems you can automatically generate a code coverage report. To obtain an HTML report for the test and unit tests, first build the dedicated coverage configuration using CMake (setting enable_coverage and building the coverage target):
cmake -S . -Bcov-build -Denable_coverage=1 -Dparallel_tests=2 make -C cov-build coverage
This configures a build environment in the cov-build/ folder with coverage recording at runtime enabled. The actual build (using make in the above case) will run the test suite, running parallel_tests-many tests concurrently (in the above case: 2). The HTML report is generated using lcov and stored in cov-build/html/.
By default, CBMC will assume MiniSat 2 as the SAT back-end. Several other solvers are supported (see also [config.inc](compilation-and-development-subsubsection-config-inc) above). As a more general option, which is not limited to a single SAT solver, you may use the IPASIR interface. For example, to use the SAT solver RISS, proceed as follows:
1) Build RISS (in particular its IPASIR configuration):
git clone https://github.com/conp-solutions/riss riss.git cd riss.git mkdir build cd build cmake .. make riss-coprocessor-lib-static cd ../..
2) Build CBMC while enabling the IPASIR back-end: make -C src IPASIR=../../riss.git/riss \ LIBS="../../riss.git/build/lib/libriss-coprocessor.a -lpthread"
3) Run CBMC - note that RISS is highly configurable via the RISSCONFIG environment variable: export RISSCONFIG=VERBOSE:BMC1 ... run CBMC ...
Apart from the (user-orientated) CBMC user manual and this document, most of the rest of the documentation is inline in the code as doxygen and some comments. A man page for CBMC, goto-cc and goto-instrument is contained in the doc/ directory and gives some options for these tools. All of these could be improved and patches are very welcome. In some cases the algorithms used are described in the relevant papers.
The doxygen documentation can be accessed online. To build it locally, run scripts/run_doxygen.sh. HTML output will be created in doc/html/. The index page is doc/html/index.html. This script will filter out expected warning messages from doxygen, so that new problems are more obvious. It is important to use the correct version of doxygen, as specified in run_doxygen.sh, so that there are no unexpected changes to the list of expected warnings. In the event that any change fixes an old warning, then the corresponding line(s) should be deleted from scripts/expected_doxygen_warnings.txt. We want to avoid adding any more warnings to this list of expected warnings, but that can be done to work around limitations in Doxygen (where the code and documentation are completely correct).
The CODING_STANDARD.md file in the root of the CBMC repository contains guidance on how to write code for the CBMC repository. This includes which language features can be used and formatting rules.
C++ code can be automatically reformatted in the correct way by running clang-format. There are more details in CODING_STANDARD.md.
There is also a linting script, scripts/cpplint.py. There is a wrapper script to run cpplint.py only on lines that differ from another branch, e.g. to run it on lines that have been changed from develop:
scripts/run_lint.sh develop
There are also instructions for adding this as a git pre-commit hook in CODING_STANDARD.md.
To do time profiling with a tool like gprof, the flags -g (build with debug symbols) and -pg (enable profiling information) must be used when compiling, and -pg must be used when linking. If you are building with cmake you can just add "-Denable_profiling=1" to your cmake invocation, and reload cmake before building your desired binary. Note that these flags require everything to be rebuilt, so it will take a long time even if you are using ccache.
Run your binary as normal. A file called gmon.out will be created in the working directory of the binary at the end of execution. In most instances this will be the same as the working directory at the beginning of execution. It is also possible to choose the output location by setting the environment variable GMON_OUT_PREFIX - the output file location is then whatever you set it to with the process id appended to the end.
Make sure gprof is installed by running gprof -v. If it is not installed then run sudo apt install binutils.
Run gprof <path-to-binary> <path-to-gmon.out> and redirect the output to a file. This will take a while to run - e.g. 12 minutes for test-gen run on a trivial function.
The output file will now be a large text file. There are two sections: the "flat profile", which ignores context, and just tells you how much time was spent in each function; and the "call graph", which includes context, and tells you how much time was spent within each call stack. For more information see online tutorials, like https://ftp.gnu.org/old-gnu/Manuals/gprof-2.9.1/html_chapter/gprof_5.html
The source code is divided into a number of sub-directories, each containing the code for a different part of the system.
In the top level of src there are only a few files:
Contains the CBMC man page. Doxygen HTML pages are generated into the doc/html directory when running doxygen from src.
The regression/ directory contains the regression test suites. See Compilation and Development for information on how to run and develop regression tests.
The unit/ directory contains the unit test suites. See Compilation and Development for information on how to run and develop unit tests.
This diagram shows intended directory dependencies. Arrows should be read transitively - dependencies of dependencies are often used directly.
There are module_dependencies.txt files in each directory, which are checked by the linter. Where directories in module_dependencies.txt are marked with comments such as 'dubious' or 'should go away', these dependencies have generally not been included in the diagram.
This section lists the transformation passes that must have been applied for the goto model to be in symex ready goto format.
Note that the passes are listed below in the order they are currently called in CBMC. While all dependencies on the ordering are not fully documented, the following order should be used.
This pass goes through the goto model and removes or lowers instances of assembly intructions. Assembly instructions are stored in instances of the other instruction.
The implementation of this pass is called via the remove_asm(goto_modelt &, message_handlert &) function. For full documentation of this pass see remove_asm.h
This pass has no predecessor.
This pass links the function definitions from the Cprover standard library models to the current goto model.
The implementation of this pass is called via the link_to_library function.
Predecessor pass is Removal/Lowering of Assembly .
This pass finds all the instructions which are function calls to the value of a function pointer. Each instruction is then replaced with a switch block. The switch block contains a case for each of the potential targets of the function pointer. The targets are found by looking for all functions in the goto program that approximately match the type of the function pointer.
Note that for this pass to work as intended, all potential targets of calls to function pointers must be in the model. Otherwise they will not be added to the relevant switch blocks. This may cause the assertion that the switch blocks' fallthrough is unreachable to be shown to be violated via an invalid counter example.
The implementation of this pass is called via the remove_function_pointers function. Detailed documentation of this pass belongs in remove_function_pointers.h
Predecessor pass is the Linking of Standard Libraries or the optional String Instrumentation if it is being used.
This pass finds pointer dereferences and adds corresponding calls to the __CPROVER_mm_io_r and __CPROVER_mm_io_w modelling functions if they exist. See the device behaviour section of modeling-mmio.md for details of modeling memory-mapped I/O regions of device interfaces. This pass is always carried out but will only make changes if one of the modelling functions exist.
The implementation of this pass is called via the mm_io function. Further documentation of this pass can be found in mm_io.h
Predecessor pass is Removal/Lowering of Function Pointers .
This pass moves preconditions associated with functions to call sites rather than at the head of the function. Note that functions may have multiple call sites and in this case the instrumentation is copied to these multiple call sites.
The implementation of this is called via instrument_preconditions(goto_modelt &goto_model) . Documentation of this pass belongs in instrument_preconditions.h
Predecessor pass is Memory Mapped IO Instrumentation.
This pass replaces returns of values with writes and reads to global variables.
The implementation of this is called via remove_returns(goto_modelt &) . Detailed documentation of this pass can be found in remove_returns.h
The predecessor passes is the Instrument/Remove Preconditions or the optional Partial Inlining if is being used.
This pass removes/lowers expressions corresponding to CPU instruction sets such as MMX, SSE and AVX. For more details on how this is done see vector_typet and remove_vector.cpp. The implementation of this is called via remove_vector(goto_modelt &goto_model)
Predecessor pass is Removal/Lowering of Return Statements.
This pass is for the handling of complex numbers in the mathematical sense of a number consisting of paired real and imaginary parts. These are removed/lowered in this pass. The implementation of this is called via remove_complex(goto_modelt &) . Documentation for this pass belongs in remove_complex.h
Predecessor pass is Remove/Lower Vector Typed Expressions.
This pass converts union member access into byte extract and byte update operations.
The implementation of this pass is called via rewrite_union(goto_modelt &)
Predecessor pass is Remove/Lower Complex Typed Expressions.
This is a pass that can add many checks and instrumentation largely related to the definition of the C language. Note that this transformation pass is mandatory in the current implementation, however it may have no effect depending on the command line options specified.
The implementation of this pass is called via goto_check_c(const optionst &, goto_modelt &, message_handlert &)
Predecessor pass is Rewrite Unions.
This is a transform from general mathematical operations over floating point types into floating point specific operator variations which include a rounding mode.
The implementation of this pass is called via adjust_float_expressions(goto_modelt &) . Documentation of this pass can be found in adjust_float_expressions.h
Predecessor pass is goto_check_c or the optional Transform Assertions Assumptions if it is being used.
This transformation updates in memory data goto program fields which may have been invalidated by previous passes. Note that the following are performed by this pass:
The implementation of this pass is called via goto_functionst::update()
Predecessor pass is Adjust Float Expressions or the optional String Abstraction if it is being used.
This pass adds failed symbols to the symbol table. See src/pointer-analysis/add_failed_symbols.h for details. The implementation of this pass is called via add_failed_symbols(symbol_table_baset &) . The purpose of failed symbols is explained in the documentation of the function goto_symext::dereference(exprt &, goto_symex_statet &, bool)
Predecessor pass is Goto Functions Update or the optional Add Non-Deterministic Initialisation of Global Scoped Variables if it is being used.
This transformation removes skip instructions. Note that this transformation is called in many places and may be called as part of other transformations. This instance here is part of the mandatory transformation to reach symex ready goto format. If there is a use case where it is desirable to preserve a "no operation" instruction, a LOCATION type instruction may be used in place of a SKIP instruction. The LOCATION instruction has the same semantics as the SKIP instruction, but is not removed by the remove skip instructions pass.
The implementation of this pass is called via remove_skip(goto_modelt &)
Predecessor pass is Add Failed Symbols or the optional Remove Unused Functions if it is being used.
This transformation adds newly generated unique property identifiers to assert instructions. The property identifiers are stored in the location data structure associated with each instruction.
The implementation of this pass is called via label_properties(goto_modelt &)
Predecessor pass is Remove Skip Instructions or the optional Add Coverage Goals if it is being used.
The sections lists the optional transformation passes that are optional and will modify a goto model. Note that these are documented here for consistency, but not required for symex ready goto format.
Note for each optional pass there is a listed predeceesor pass. This is the pass currently called before the listed pass in CBMC. While the ordering may not be fully documented, this should be followed.
This (optional) pass adds checks on calls to some of the C standard library string functions. It uses the tracking information added by the String Abstraction pass. It is switched on by the --string-abstraction command line option.
The implementation of this pass is called via the string_instrumentation(goto_modelt &goto_model) function. Detailed documentation of this pass belongs in string_instrumentation.h
The predecessor pass is Linking of Standard Libraries .
This pass does partial inlining when this option is switched on. Partal inlining is inlining of functions either: marked as inline, or smaller than a specified limit. For further detail see the implementation function goto_partial_inline(goto_modelt &, message_handlert &, unsigned, bool)
Predecessor pass is Instrument/Remove Preconditions.
This pass removes user provided assertions and assumptions when user has opted to do so. Note that this pass removes skip instructions if any other changes are made. The implementation of this pass is called via the transform_assertions_assumptions(const optionst &, goto_modelt &) function.
Predecessor pass is goto_check_c.
This (optional) transformation pass adds tracking of length of C strings. It is switched on by the --string-abstraction command line option. This changes made by this pass are documented as part of the documentation for the string_abstractiont class. The implementation of this pass is called via the string_abstraction(goto_modelt &, message_handlert &) function.
Predecessor pass is Adjust Float Expressions.
This transformation adds non-deterministic initialisation of global scoped variables including static variables. For details see src/goto-instrument/nondet_static.h. The initialisation code is added to the CPROVER_initialize function in the goto model. The implementation of this pass is called via the nondet_static(goto_modelt &) function.
Predecessor pass is Goto Functions Update.
This pass removes unused functions from the goto model. In practice this builds a collection of all the functions that are potentially called, and then removes any function not in this collection.
This pass cannot handle function calls via function pointers. Attempting to run this pass against a goto model which contains such a function call will result in an invariant violation. Therefore the function pointer removal pass must always be applied to the goto model before the remove unused functions pass.
The implementation of this pass is called via the remove_unused_functions(goto_modelt &, message_handlert &) function.
Predecessor pass is Add Failed Symbols.
This transformation adds coverage goals instrumentation and is only applied if the --cover option has been specified. The implementation of this pass is called via the instrument_cover_goals(const cover_configt &, goto_modelt &, message_handlert &) function.
Predecessor pass is Remove Skip Instructions .
This includes several slicing passes as specified by various slicing command line options. The reachability slicer is enabled by the --reachability-slice command line option. The implementation of this pass is called via the reachability_slicer(goto_modelt &, message_handlert &) function. The full slicer is enabled by the --full-slice command line option. The implementation of this pass is called via the full_slicer(goto_modelt &, message_handlert &) function.
Predecessor pass is Label Properties .
This is an introduction to hacking on the cprover codebase. It is not intended as a user guide to CBMC or related tools. It is structured as a series of programming exercises that aim to acclimatise the reader to the basic data structures and workflow needed for contributing to CBMC.
Clone the CBMC repository and build it. The build instructions are written in a file called COMPILING.md in the top level of the repository. I recommend that you build using CMake—this will place all of the CBMC tools in a single directory, which you can add to your $PATH. For example, if you built the codebase with the following two commands at the top level of the repository:
cmake -DCMAKE_BUILD_TYPE=Debug \ -DCMAKE_CXX_COMPILER=/usr/bin/clang++ \ -DCMAKE_C_COMPILER=/usr/bin/clang \ -B build -S . cmake --build build
then all the CBMC binaries will be built into build/bin, and you can run the following commands to make CBMC invokable from your terminal.
# Assuming you cloned CBMC into ~/code export PATH=~/code/cbmc/build/bin:$PATH # Add to your shell's startup configuration file so that you don't have to run that command every time. echo 'export PATH=~/code/cbmc/build/bin:$PATH' >> .bashrc
If you are using Make instead of CMake, the built binaries will be scattered throughout the source tree. This tutorial uses the binaries src/cbmc/cbmc, src/goto-instrument/goto-instrument, and src/goto-cc/goto-gcc, so you will need to add each of those directories to your $PATH, or symlink to those binaries from a directory that is already in your $PATH.
Ensure that graphviz is installed on your system (in particular, you should be able to run a program called dot). Install Doxygen and generate doxygen documentation:
# In the src directory doxygen doxyfile # View the documentation in a web browser firefox ~/code/cbmc/doc/html/index.html
If you've never used doxygen documentation before, get familiar with the layout. Open the generated HTML page in a web browser; search for the class goto_programt in the search bar, and jump to the documentation for that class; and read through the copious documentation.
CBMC's code is located under the src directory. Even if you plan to contribute only to CBMC, it is important to be familiar with several other of cprover's auxiliary tools.
If you built using CMake on Unix, you should be able to run the goto-gcc command. Find or write a moderately-interesting C program; we'll call it main.c. Run the following commands:
goto-gcc -o main.gb main.c cc -o main.exe main.c
Invoke ./main.gb and ./main.exe and observe that they run identically. The version that was compiled with goto-gcc is larger, though:
du -hs *.{goto,exe}
Programs compiled with goto-gcc are mostly identical to their clang- or gcc-compiled counterparts, but contain additional object code in cprover's intermediate representation. The intermediate representation is (informally) called a goto-program.
goto-instrument is a Swiss army knife for viewing goto-programs and performing single program analyses on them. Run the following command:
goto-instrument --show-goto-functions main.gb
Many of the instructions in the goto-program intermediate representation are similar to their C counterparts. if and goto statements replace structured programming constructs.
Find or write a small C program (2 or 3 functions, each containing a few varied statements). Compile it using goto-gcc as above into an object file called main. You can write the diagram to a file and then view it:
goto-instrument --dot main.gb | tail -n +2 | dot -Tpng > main.png open main.png
(the invocation of tail is used to filter out the first line of goto-instrument output. If goto-instrument writes more or less debug output by the time you read this, read the output of goto-instrument --dot main and change the invocation of tail accordingly.)
There are a few other views of goto-programs. Run goto-instrument -h and try the various switches under the "Diagnosis" section.
In this section, you will learn about the basic goto-program data structures. Reading from and manipulating these data structures form the core of writing an analysis for CBMC.
The entry point of goto-instrument is in goto_instrument_main.cpp. Follow the control flow into goto_instrument_parse_optionst::doit(), located in goto_instrument_parse_options.cpp. At some point in that function, there will be a long sequence of if statements.
$ goto-instrument --greet main.gb hello, world! $ goto-instrument --greet Leperina main hello, Leperina!
You will also need to add the greet option to the goto_instrument_parse_options.h file in order for this to work. Notice that in the .h file, options that take an argument are followed by a colon (like (property):), while simple switches have no colon. Make sure that you return 0; after printing the message.
The idea behind goto-instrument is that it parses a goto-program and then performs one single analysis on that goto-program, and then returns. Each of the switches in doit function of goto_instrument_parse_options does something different with the goto-program that was supplied on the command line.
At this point in goto-instrument_parse_options (where the if statements are), the goto-program will have been loaded into the object goto_functions, of type goto_functionst. This has a field called function_map, a map from function names to functions.
The following is quite difficult to follow from doxygen, but: the value type of function_map is goto_function_templatet<goto_programt>.
Each goto_programt object contains a list of goto_programt::instructiont called instructions. Each instruction has a field called code, which has type codet.
index * type: unsignedbv * width: 8 * #c_type: char 0: symbol * type: array * size: nil * type: * #source_location: * file: src/main.c * line: 18 * function: * working_directory: /some/dir 0: unsignedbv * width: 8 * #c_type: char ...
The sub-nodes of a particular node in the pretty representation are numbered, starting from 0. They can be accessed through the op0(), op1() and op2() methods in the exprt class.
Every node in the pretty representation has an identifier, accessed through the id() function. The file util/irep_ids.def lists the possible values of these identifiers; have a quick scan through that file. In the pretty representation above, the following facts are true of that particular node:
The fact that the op0() child has a symbol ID means that you could cast it to a symbol_exprt (which is a subtype of exprt) using the function to_symbol_expr.
FIXME: The text in this section is a bit outdated.
The CPROVER subversion archive contains a number of separate programs. Others are developed separately as patches or separate branches.Interfaces are have been and are continuing to stablise but older code may require work to compile and function correctly.
In the main archive:
Model checkers and similar tools:
Test case generation:
Alternative front-ends and input translators:
Other tools:
There are tools based on the CPROVER framework from other research groups which are not listed here.