Data Flow Analysis (DFA) is a type of static analysis. It is a global optimization technique that provides information about the data flow along the line of a program/code execution with the goal of evaluating analysis at each and every program point. The most important characteristic of Data Flow Analysis is its “flow sensitiveness”. Flow sensitivity represents that DFA takes into account both the absolute order of the instructions within the programmes as well as their relative order, which is crucial to the analysis’s outcomes.
This flow sensitivity is crucial because of the concept of dependency relations, which means that the instruction which is second in absolute order depends on a variable that has been defined in the first instruction. Due to the immense importance of these dependence relations, Data Flow Analysis (DFA) is usually represented in a data format that helps to not only visualize the entire program pathflow but also to represent the interrelations and the order of different instructions/statements with each other.
Data Flow Analysis
Conventionally a format known as data flow graphs (DFG) is used in Data Flow Analysis in which the entire program is considered as a flow graph where all the possible outcomes of a statement can be visualized, and it becomes possible to describe how a specific statement shifts the program to an alternate pathway.
Constituents of Data Flow Analysis
Different constituents of DFA help in the optimization of the program in a specific way. These are as follows:
Available Expression Analysis
Available Expression Analysis is a forward analysis that analyzes data end-to-end, from start to finish. The goal of applying available expression analysis is to find out if, at a given program point, previously computed values or expressions can be reused along every path instead of recomputing the expressions again.
It is a useful optimization technique to prevent recomputing of statements/expressions, i.e. to remove common subexpressions from the program/code. The occurrence of common subexpressions is fairly abundant, and it is important to optimize this problem in order to avoid complexity and ambiguity. Consider the following block of code:
x=a+b+p
d=q+r
y=a+b+d
By observing carefully, it is clear that the expression (a+b) has already been computed in the first equation. The reappearance of (a+b) in the third equation qualifies as a common subexpression, provided that the values of the variables “a” and “b” remain constant. The available expression analysis works to optimize the program by eliminating common subexpressions such as this (a+b) by substituting a single variable that stores the computed value for the common subexpressions.
Reaching Definition Analysis
It is also a type of forward analysis and is usually applied in cases of program optimization that involves one of the following scenarios:
- Elimination of dead code from Def-Use (DU) and Use-Def (UD) chains:
A typical example of the Def-Use chain:
By closely observing this DU chain, it can be concluded that only the variable “x” is active/live at the termination point, whereas all the other assignments, i.e. y,z and m, are regarded as dead code and must be removed through reading definition analysis in order to optimize this program.
- Copy Propagation:
The concept of copy propagation relates to that of common subexpressions. If dx: x = y is the only definition that gets to the point p with the requirement that y stays constant between the points of dx and p, then a copy propagation can be described as the replacement of a variable x at any program point p with a variable y. For example:
if, x = y
a = x + z,
then x in this second equation can be replaced with y. The advantage of this optimization is that if all the “x” variables get replaced with “y”, then the definition of x becomes useless and thus can be removed. Even if all the variables are not replaced, it can still lead to simplification of the code with less instructions leading to its optimization and increased running speed.
Live Variable Analysis
Live variable analysis is used in the calculation of live variables, which are variables containing a value that might be required in the future. For example, consider a block of the following algorithm:
d = 6
b = 10
a = f(d * b)
In the above example, b and d are live variables between line 2 and 3.
The Control flow graph of live variables is below, in which the liveness of the variable is through the edges:
In[b]= fb(out[b]}
Join node: a node with multiple successors
Meet operator:
Out[b] + in[s1]U in[s2]U …. Uin[sn]. where
S1,…..sn are all successors of b
Consider the following algorithms related to the above flowchart that will be used to represent a few properties associated with live variables in a general dataflow analysis framework:
input: CFG = (N, E, Entry, Exit)
// Boundary condition
in[Exit] =
// Initialization for iterative algorithm
For each basic block B other than Exit
in[B] =
// iterate
While (changes to any in[] occur) {
Every block B other than Exit {
out[B] = U(in[s]),
in[B] = fB(out[B]) // in[B]=Use[B] U(out[B]-Def[B])
}
Properties of Live Variables
The properties of live variables in a general framework of a dataflow analysis in relation to the above algorithm are:
- The direction of live variables is backwards:
in[b] = fb(out[b])
out[b] = in[succ(b)]
- The transfer function associated with live variables is:
fb(x) = Useb (x -Defb)
- Boundary Condition for live variables in the data analysis framework is:
in[exit] = Φ
Types of Liveness
Each node of the flowgraph has an associated set of live variables, depending upon the Variable liveness is of two types:
- Semantic liveness: A variable x is semantically live at a node n if there is some execution sequence starting at n whose behavior is affected by the changing value of x. It is usually undecidable and depends on the execution behavior of the application.
- Syntactic liveness: This liveness is found in a variable x at a node if there is a path to the exit of the flowgraph along which its value and it may be used before having to be redefined. It is decidable and depends on the syntactic code of the program/code.
Conclusion
Data Flow Analysis is a form of static analysis that evaluates the data flow at each program point. It’s a flow-sensitive process, so it considers both the order of instructions within the program and their relative orders. That has significance for dependency relations. DFA optimizes a program through expression analysis, reaching definition analysis, and live variable analysis. All of these are crucial steps for static analysis that identifies errors in the code. That’s why we discussed them here. We also have articles on other analysis types, such as lexical and taint analysis. For more help, give them a read!