当前位置:网站首页>On the execution of static analysis technology symbols, starting with a story······

On the execution of static analysis technology symbols, starting with a story······

2020-11-10 15:13:40 Huawei cloud developer community

Abstract : The outline of this paper introduces symbolic execution , Let's understand the main role and challenges of this technology , Lead everyone into the pit .

1. introduction

Program static analysis (Program Static Analysis) In a way that doesn't run code , Through lexical analysis 、 Syntax analysis 、 control flow 、 Data flow analysis and other technologies scan program code , Verify that the code meets the specification 、 Security 、 reliability 、 Maintainability and other indicators of a code analysis technology [8]. The history of program static analysis is almost as long as that of program , Since there was a program, there has been program analysis . Especially with the development of compiler technology , It greatly promotes the automatic analysis technology of the program . At present, static analysis technology is developing to simulation execution technology to find more defects that can be found by dynamic testing in the traditional sense , For example, symbol execution 、 Abstract interpretation 、 Value dependence analysis and so on, and use mathematical constraint solving tools to reduce the path or analyze the accessibility to find more problems 、 Reduce false positives 、 Increase of efficiency .

As the program becomes larger and more complex , In the specific application of modern program analysis , The examination of a problem is often not a single use of one technology , It's about using a variety of analytical techniques at the same time , Using multiple iterations , The process of repeated refinement . This requires us to plan in a unified way , Strategies deal with different issues . It's like a chef , Make every dish , With what materials , What to put first , What's on the back , What kind of heat , We should control how much we put in it .

Here I'm going to make a series of this technology introduction , Explore various techniques in static analysis , I hope through these introductions , Can let more programmers can be involved in the ranks of program analysis , Together with a deeper level of software automation and Research .

2. A story about symbolic execution

Everyone likes to listen to stories , Let's start with a story about symbol execution as an introduction .

The annual network security industry event RSA The conference , Top 10 innovation sandboxes selected every year , Will become the pursuit of investors in the industry , At the same time, from these projects , We can also infer the hot direction of network security technology innovation .

One of the top ten this year is called Mayhem Products , about Mayhem The name , I'm afraid people who engage in symbolic execution are not unfamiliar .

First of all, the name Mayhem It's very impressive , Chinese translation is “ confusion ”, The world was a mess , And call the tool “ confusion ”, It's "Complete mayhem"( It's a mess ).

And Mayhem Relevant information , Arrange it in chronological order , You can see the main line of the story :

· ForAllSecure Founded on 2012 year , The technology comes from Carnegie Mellon University (CMU) Patent technology obtained through research ;

· from 2012 Year to 2017 year , Us department of defense (DoD) Vulnerabilities have been found in almost all of the critical weapon systems being developed .

· 《M National Defense Law 》 Report on strengthening software security of critical systems is required , It's suggested that DARPA “ The big challenge of the Internet ” Under the development of binary analysis and symbol execution tools ;

· 2016 American Advanced Research Projects Agency (DAPRA) Hosted the automatic network attack and defense competition (CyberGrandChallenge), It aims to establish a real-time automated network defense system , And be able to respond quickly to a large number of new attacks , In response to frequent network attacks . Symbol execution plays an important role in the team's automatic attack and defense system , It is widely used in the analysis of program vulnerability . Carnegie Mellon University ForAllSecure Team Mayhem To obtain the CGC The winner of the ;

· 2018 year ,GAO-19-128 According to the report no , Because of the computerization of weapon systems , Weapon systems rely more on software than ever before , And more networked , In order to protect weapon systems from increasingly complex network security attacks , Us department of defense (DoD) More secure weapon systems are being developed ;

· Mayhem It's a security solution recommended by the federal government , Used continuously , automation , Accurate testing ;

· Multiple entities in the U.S. Department of defense use Mayhem, Including but not limited to : The air force's first 96 Cyberspace testing team , The air force's first 90 Cyberspace operations squadron , Naval Sea Systems Command (NAVSEA) And the U.S. Army Command , control , signal communication , Computer , The Internet , intelligence , Surveillance and Reconnaissance Center (C5ISR).

Mayhem yes ForAllSecure The solution of auxiliary intelligent behavior testing is introduced . It uses symbolic execution to generate test cases automatically , And then through the fuzzy test (fuzz) Finding software defects .Mayhem Combine these two testing techniques , Integrated in the DevOps in , Continuous discovery of security vulnerabilities .

notes : from ForAllSecure Website (https://forallsecure.com/government-and-defense), You can get the information above .

another :Mayhem[4] Using a binary symbolic execution framework , It belongs to the category of dynamic analysis , It's just for us to start with , There's no need to tangle .

Next , Our outline introduces symbolic execution , Let's understand the main role and challenges of this technology , Lead everyone into the pit , More details can be obtained from the following references .

3. Classic symbolic execution technology

Symbol execution [1] It's a static analysis technique , It can be analyzed to get input for specific areas to execute . Initially in 1976 Year by year King JC stay ACM Put forward on . That is, by using abstract symbols instead of specific values to simulate the execution of the program , When a branch statement is encountered , It will explore every branch , The branching condition is added to the corresponding path constraint , If the constraint is solvable , The path is reachable . The purpose of symbol execution is to , Explore as many paths as possible . According to the purpose of the operation , There are two main ones :

  • From a testing point of view , It can simulate the input values of each path , To create a high coverage test suite . Here is the static analysis program to get the input needed for the test , It's different from dynamic execution testing , Dynamic execution program testing relies more on complete test cases to improve test coverage , To find out the problem . For example, once used IBM Of Purify To detect memory leaks in code . You need to read the code manually , Develop a large number of test cases , Then execute these inputs separately by letting the program execute , To improve code coverage , So as to find the memory leak problem as much as possible .
  • From the perspective of defect finding , It provides developers with specific input to the triggered defect , With this input , Programs can be used to confirm or debug defects . Symbolic execution is not limited to finding problems such as buffer overflow , And it can be done by using the conditions of defect discovery , Generating complex assertions , To determine the possibility of defects .

Here's a classic example [2], To illustrate the specific process of symbol execution .

1    int twice(int v){
 2        return 2*v;
 3    }
 4    
 5    void testme(int x, int y){
 6        z = twice(y);
 7        if (z == x){
 8            if (x > y+10)
 9                ERROR;
10        }
11    }   
12   
13    int main(){
14       x = sym_input();
15       y = sym_input();
16       testme(x, y);
17       return 0;
18    }

The function in this code testme() There are three execution paths . The purpose of symbol execution , Within a given time budget , Generate a set of inputs , And explore the execution path as much as possible through these inputs .

  • Execution path (execution path): One true and false Sequence seq ={p0, p1, …, pn}. among , If it's a conditional statement , that pi=ture The value of the conditional statement is :true, Otherwise take false;
  • Execution tree (execution tree): All execution paths of a program can be represented as an execution tree .
    The following figure shows the execution tree of the sample code :

Symbol execution maintains symbol state and path constraints , In order to transfer information during operation .

  • Sign state (symbolic state): Symbol execution maintains a symbol state σ, Mapping variables to symbolic expressions .
  • Symbolic path constraints (symbolic path constraint): Symbolic path constraints PC, It is the first-order formula of quantifier in symbolic expression .
  • At the beginning of execution , take σ Initialize to an empty map , take PC Initialize to true;
  • In the execution of symbols ,σ and PC Will update .
  • At the end of symbol execution along the program execution path , Use the constraint solver to PC To solve the , To generate specific input values . If the program executes on these specific input values , It will take exactly the same path as the symbol execution , And terminate in the same way .

For example code , The specific process is as follows

  • initialization : Initialize symbol state σ It's empty , Symbolic path constraints PC by true;
  • In each assignment v = e It's about , The execution of a symbol is done by passing v Mapping to σ(e) To update σ, The mapping is done by evaluating the current symbol state , And the resulting symbolic expression . for example :
    • main() The first two lines of the function ( The first 16-17 That's ok ) The execution of the symbol leads to σ= {x ↦x0,y ↦ y0}, among x0,y0 Are two initial unconstrained symbolic values ;
    • In the implementation of the 6 After line ,σ = {x ↦x0,y ↦y0,z ↦ 2y0}.
  • For each conditional statement :if(e) then S1 else S2.
    • In the 7 After line , Two symbol execution instances are created , Each has path constraints x0 = 2y0 and x ≠ 2y0;
    • In the 8 After line , Create two symbol execution instances with path constraints respectively (x0 = 2y0)∧(x0 > y0 + 10) and (x0 = 2y0)∧(x0 ≤ y0 + 10).
    • “then” Branch : PC Is updated to PC ∧ σ(e);
    • “else” Branch : Make a new one PC', PC' Is initialized to :PC∧¬ σ(e);
    • If the state of the branch σ Of PC To satisfy , Then symbol execution continues along the branch , Otherwise, the path ends .
      for example :
  • If a symbol execution instance encounters exit or error when , The current symbol execution instance is terminated , And use a ready-made constraint solver to generate an assignment that can satisfy the current path constraint . for example :
    • The three paths are solved according to the constraints , We get the three sets of inputs we expect respectively :{x=0, y=1},{x=2, y=1} and {x=30, y=15}.
  • If the code contains a loop or recursive structure , And their termination conditions are symbolic , It can lead to an infinite number of paths . In the process of practice , You need to set a limit on the path search , for example timeout, Limit the number of paths , Loop iterations or probe depth .

Classical symbolic execution has a key drawback , If the symbolic path constraint in accordance with the execution path cannot be solved effectively by constraint solver , The input cannot be generated .

4. Modern symbolic execution technology

Classical symbolic execution , Excessive dependence on the constraint solving ability of symbolic execution , This limits the ability of traditional symbols to perform . It soon became clear that in the course of the analysis , If you can add specific values for analysis , Will greatly simplify the analysis process , Reduce the difficulty of analysis and improve efficiency ; But in the process of analysis , It is still inevitable to express various conditions , After symbolic abstraction, it becomes a constraint to participate in execution . The precision of converting program statements into symbolic constraints , It has a significant impact on the coverage of symbol execution and the scalability of constraint solving . So how to do a good job of mixing concrete (Concrete) Execution and symbols (Symbolic) The balance of the ability to execute , It becomes the key point of modern symbol execution .

Mixed execution testing (Concolic testing)[5][6] And executing build tests (Execution-Generated Testing (EGT))[7] These two representatives of modern symbolic execution are based on this idea . The next step is to perform the test in a mixed fashion (Concolic testing) As an example to illustrate the main process of modern symbol execution .

Unlike the classical symbolic execution , Because of the mixed execution, in the whole execution process , The specific state of the program needs to be maintained , Therefore, its input needs initial specific value . Mixed execution testing starts with a given or random input to execute the program , Collect symbolic constraints on input along the executed conditional statements , Then use constraint solving to infer the change of previous input , In order to guide the program to the next execution path . Repeat the process , Until all the execution paths have been explored , Or meet user-defined coverage standards 、 Until the time setting expires .

Mixed execution tests maintain both states at the same time :

· Specific state : Mapping all variables with specific values ;

· Sign state : Only map variables that have no specific values .

For example code , The execution process is as follows :

· Mixed execution generates random input values , such as {x=22, y=7}, And then symbolically and concretely execute the program together .

· basis {x=22, y=7}, The procedure is in 7 That's ok , This specific implementation will lead to else Branch ; Symbol execution along the execution path generates x ≠ 2y0 Path constraints of ;

· The hybrid test will link in the path constraint (x ≠ 2y0) Take the opposite , Generate a new constraint x0=2y0, And solve to get the test input {x=2, y=1}. This new program will force then route .

· basis {x=2, y=1}, The procedure is in 8 Do it else Branch . The mixed test performs symbolic execution along with the specific execution , And generate path constraints (x0 = 2y0)∧(x0 > y0 + 10);

· The hybrid test will link in the path constraint ((x0 > y0 + 10)) Take the opposite , A new constraint will be generated (x0 = 2y0)∧(x0 ≤ y0 + 10) Test of , Solve to get the test input {x=30, y=15}. With this input, the program goes to ERROR sentence .

· The hybrid test reports all explored execution paths , And terminate the generation of test input .

Compare mixed execution testing to traditional symbolic execution , It is not difficult to find that due to the introduction of specific values , The difficulty of constraint solving is simplified .

5. Main challenges and solutions

Symbol execution technology already has 40 Years of development , stay 2017 year 8 month Google In academia , The title has “ Symbol execution ” The article has 742 piece [3], To 2020.11 month , The number of articles has risen to 1490, It can be seen that symbol execution has developed rapidly . But the complexity and scale of the program is also growing rapidly , Symbol execution still has path explosion ( Code size 、 Complexity )、 Constraint solving ( Calculation algorithm )、 Memory model ( Tool design ) Etc .

5.1. Path explosion (Path Explosion)

By default, symbol execution has filtered the following two paths in process processing :

· Independent of the path of symbolic input ;

· For the current path constraint , The unsolvable path . however , Although symbol execution has done this filtering , Path explosion is still the biggest challenge of symbol execution . Path explosion is not a unique challenge to symbol execution , It is the biggest problem that needs to be considered in the whole program analysis .

The solution to the path explosion , There are two ways to think about it :

· Reduce the total number of paths ( Give priority to the most promising path , Path merging , prune );

· No more similar path analysis ( Function summary , cache );

According to this idea, the industry has proposed two solutions .

· heuristic (Heuristic): Most heuristics focus on achieving high statement and branch coverage .

1. A particularly effective approach is to use static control flow diagrams (CFG) To guide exploration , To the nearest path or preferentially select the statement with the least previous execution times ;

2. In every possible branch of the symbol , Randomly select the side to explore ; Or the symbol test and random test are interleaved ;

3. Using prior knowledge , Explore the error prone functions in the past ; At present, there are also studies through AI The way to get these recommended analysis paths ;

· Using sophisticated program analysis techniques (Sound program analysis techniques): This method mainly uses various ideas in program analysis and software verification , Reduce the complexity of path exploration in a reasonable way .

1. Merge paths statically , And then solve it ;

2. Through the function summary , Cache or reuse calculated information for subsequent calculations ;

3. By pruning , Remove the influence of irrelevant variables on the path solution ;

5.2. Constraint solving (Constraint Solving)

Although in the last few years , Constraint solution technology has made great progress , But constraint solving is the technical bottleneck of symbol execution .

· Reduce irrelevant constraints (Irrelevant constraint elimination) Most queries in symbolic execution are to determine the feasibility of a branch , An effective optimization method is to remove the constraints which are irrelevant to the results of the current branch from the path conditions .

· Incremental solution (incremental solving) An important feature of the constraint set generated during symbol execution is , They are represented by a fixed set of static branches from the program source code . therefore , Many paths have similar sets of constraints , So a similar solution can be used .

1. By reusing the results of previous similar queries , To improve the speed of constraint solving ;

2. By the superset of the constraint set , Reduce the occurrence of no solution ; We currently use symbol execution tools KLEE, There are two ways in the design .

5.3. Memory modeling (Memory Modeling)

In symbolic execution, we map variables to a memory model , To represent the type of the variable 、 Value or range of values . This abstract pattern of variables has a great impact on the accuracy of program statements converted into symbolic constraints and the coverage of symbolic execution . Too precise , It's easy to get into complex calculations , We can't get a concrete solution ; It's too general , It will also lead to underreporting . So there is a trade-off between precision and scalability .

At present, the main reference for this trade-off is :

1. Analyze the nature of the problem ;

2. The constraints of the constraint solver used ;

6. Symbol execution tool

The following is a list of our commonly used symbol execution tools as your reference .

Language Symbol actuators link Notes
LLVM KLEE https://klee.github.io/ Cadar et al., 2006
LLVM Cloud9 http://cloud9.epfl.ch/ Bucur et al., 2011, be based on KLEE Parallel symbolic execution of
LLVM Kite http://www.cs.ubc.ca/labs/isd/Projects/Kite/ do Val, 2014, be based on KLEE
Java JPF-Symbc https://github.com/SymbolicPathFinder/jpf-symbc 2008, Used for testing NASA Software for
Java jayhorn http://jayhorn.github.io/jayhorn/ be based on soot, Support Z3, 2016
Java JDart https://github.com/psycopaths/jdart Luckow et al., 2016
Python PyExZ3 https://github.com/thomasjball/PyExZ3 Ball and Daniel,2015
JavaScript Jalangi https://github.com/Samsung/jalangi2 Sen et al., 2013
Binaries angr http://angr.io/ Shoshitaishvili et al., 2015, python frame

7. Conclusion

At present, symbols execute , In practical application, it is also mainly used with fuzz Use a combination of , Used to zoom out fuzz Value range of . Because of the main bottleneck of symbol execution -- Performance and limitations of constraint solving , Not in the business tools of static analysis , Large scale use . But we have reason to believe that , In the near future, with parallel technology 、 Computing performance improvement 、 And the optimization of the solver , Symbolic execution can play an increasingly important role in static analysis .

8. reference

[1] King JC. Symbolic execution and program testing. Communications of the ACM, 1976,19(7):385−394.

[2] Cadar C, Sen K. Symbolic execution for software testing: three decades later[J]. Communications of the ACM, 2013, 56(2): 82-90.

[3] Baldoni R, Coppa E, D’elia D C, et al. A survey of symbolic execution techniques[J]. ACM Computing Surveys (CSUR), 2018, 51(3): 50.

[4] Sang Kil Cha, Thanassis Avgerinos, Alexandre Rebert and David Brumley. Unleashing Mayhem on Binary Code. 2012 IEEE Symposium on Security and Privacy.

[5] P. Godefroid, N. Klarlund, and K. Sen. DART: Directed Automated Random Testing. In PLDI’05, June 2005.

[6] K. Sen, D. Marinov, and G. Agha. CUTE: A concolic unit testing engine for C. In ESEC/FSE’05, Sep 2005.、

[7] C. Cadar and D. Engler. Execution generated test cases: How to make systems code crash itself (invited paper). In SPIN’05,Aug 2005.

[8] Baidu Encyclopedia : Program static analysis  https://baike.baidu.com/item/ Program static analysis

[9] C. Cadar, D. Dunbar, and D. Engler. KLEE: Unassisted and automatic generation of high-coverage tests for complex systems programs. In OSDI’08, Dec 2008.

[10] L. De Moura and N. Bjørner. Satisfiability modulo theories: introduction and applications. Commun. ACM, 54:69–77, Sept. 2011.

[11] Ye Zhibin , Yan Bo . A survey of symbolic execution [J]. Computer science , 2018, 45(6A): 28-35.

 

Click to follow , The first time to learn about Huawei's new cloud technology ~

版权声明
本文为[Huawei cloud developer community]所创,转载请带上原文链接,感谢