Code Analysis

Michael L. Collard, Ph.D.

Department of Computer Science, The University of Akron

Artifacts for Analysis

  • Source Code The primary component for analysis.
  • Design Documentation Includes UML and other design diagrams.
  • Issue Tracking Systems Helps understand bugs, fixes, and features.
  • Additional Artifacts For instance, requirement documents, test cases, etc.

Types of Code Analysis

  • Static Program Analysis Analyzing software source code and related artifacts without executing them.
  • Dynamic Program Analysis Evaluating the output or trace data when the program is executed.

Benefits of Static Analysis

  • Code Search & Query Helps in quick codebase navigation.
  • Metrics Extraction Offers insights about code quality, complexity, etc.
  • Code Comprehension & Review Assists developers in understanding and reviewing the code.
  • Reverse Engineering Deciphering how the software operates.
  • Program Transformation Altering code structure while retaining functionality.
  • Code Optimization Making software run more efficiently.
  • Ensuring Program Correctness Helps identify bugs and ensure code correctness.

Source-Code Granularity/Levels

  • Example
  • Tokens Smallest elements like variables, keywords
  • Statements Single lines of instructions
  • Methods/Functions Collection of statements performing a specific task
  • Classes Encapsulation of data and methods
  • Individual Files Can contain multiple classes or methods
  • Group of Files Collections of related files
  • Complete Programs Entire software consisting of all files

Code-Level Approaches

  • Regular Expressions
  • lexical view, "Program is a stream of tokens"
  • Abstract Syntax Tree (AST)
  • Fully parsed syntax view
  • Really an Abstract Syntax Graph (ASG)

Source Code is Messy

  • Comments
  • Literal values
  • Preprocessor statements
  • Code fragments
  • Uncompilable code
  • Incomplete set of files

Regular Expressions

  • Example: ^((From|To)|Subject): ((?(2)\w+@\w+\.[a-z]+|.+))
  • grep
  • Fast, faster, fastest
  • API's in most languages
  • Great for simple "parsing"
  • Corresponds to lexical analysis (lexer): Characters into tokens
  • Works with code of any kind
  • Major disadvantage: Context, e.g., "if"

Abstract Syntax Tree

  • Compiler view
  • Better for more complex parsing
  • Corresponds to the parser in compilers: Tokens into trees
  • Understands syntax
  • Answers questions that compilers need to ask

Abstract Syntax Tree: Disadvantages

  • Compiler view
  • No code fragments
  • Uni-preprocessor view
  • Cannot handle non-compilable code
  • Takes a lot of space
  • Slow, slow, slow