A Sample Measurement

Table 8-1 contains a summary of all the metrics used in this study, and the symbols denoting these metrics in the rest of the tables and figures.

Table 8-1. Selected metrics for the study

VariableMetric (symbol)
SizeSource Lines of Code (SLOC), Lines of Code (LOC)
 Number of C functions (FUNC)
ComplexityMcCabe’s cyclomatic complexity—maximum of all functions (MCYCLO)
 McCabe’s cyclomatic complexity—average (ACYCLO)
 Halstead’s length (HLENG), volume (HVOLUM), level (HLEVE), and mental discriminations (HMD)

The elements of code that provide input to all these metrics are illustrated in the sample source code file shown in Example 8-2. This file was extracted from the package urlgfe, a cross-platform download manager. (urlgfe has recently changed its name to uget, so it is no longer found in the ArchLinux repositories with its original name.) The file contains preprocessor directives (such as 1), comments (such as 3), and only one function (starting at line 4) containing a while loop.

Example 8-2. A sample C source code file

#ifdef HAVE_CONFIG_H 1
# include <config.h>
#endif
2
/* Specification.  */ 3
#include "hash-string.h"


/* Defines the so called `hashpjw' function by P.J. Weinberger
   [see Aho/Sethi/Ullman, COMPILERS: Principles, Techniques and Tools,
   1986, 1987 Bell Telephone Laboratories, Inc.]  */
unsigned long int 4
__hash_string (const char *str_param)
{
  unsigned long int hval, g;
  const char *str = str_param;

  /* Compute the hash value for the given string.  */
  hval = 0;
  while (*str != '')
    {
      hval <<= 4;
      hval += (unsigned char) *str++;
      g = hval & ((unsigned long int) 0xf << (HASHWORDBITS - 4));
      if (g != 0)
        {
          hval ^= g >> (HASHWORDBITS - 8);
	  hval ^= g;
        }
    }
  return hval;
}
1

Preprocessor directives, counted both for LOC and SLOC.

2

Blank lines. Counted for LOC but not for SLOC.

3

Comment lines. Counted for LOC but not for SLOC.

4

Code, counted both for LOC and SLOC.

The simplest complexity metrics that we can measure are the ones related to lines of code (total lines of code and source lines of code). The number of functions also can be extracted easily from the source code. The rest of the complexity metrics that we measured are slightly more sophisticated: McCabe’s cyclomatic complexity and the set of Halstead’s Software Science metrics.

Except for McCabe’s cyclomatic complexity, all the metrics are defined at the file level. So we measured all of them for all the C files (as identified by SlocCount), ignoring header files, which we identified by filename (all files ending in .h).

Because of the definition of McCabe’s cyclomatic complexity, it must be measured over complete functions or programs because it is defined for entities that have one starting point and one or more exit points. We decided to run the formula over each function and to summarize the cyclomatic complexity of a whole file using two values: the maximum of all the functions included in the file and the average value over all the functions in the file.

Additionally, we also calculated the MD5 hash for every file, so we could discard repeated files from the statistical analysis, as including the same file more than once would introduce a bias in the results.

Source Lines of Code (SLOC)

For this classic measure, we use the definition given by Conte [Conte 1986]:

A line of code is any line of program text that is not a comment or blank line, regardless of the number of statements or fragments of statements on the line. This specifically includes all lines containing program headers, declarations, and executable and non-executable statements.

In the case of our sample file in Example 8-2, when we ignore blanks and comments but include preprocessor directives and all the rest of the lines, the file contains 23 SLOC.

Lines of Code (LOC)

For this we measured the total number of lines in each source code file, including comments, blank lines, etc., using the Unix wc utility.

This is straightforward to measure because it counts blanks, comments, etc. Example 8-2 contains 32 LOC (plus 18 LOC of the license comment text, which was removed for clarity purposes).

Number of C Functions

We counted the number of functions inside each file using the exuberant-ctags tool combined with wc.

This metric is even easier to measure. Example 8-2 contains only one function, so CFUNC is 1 for this file.

McCabe’s Cyclomatic Complexity

We use the definition given in the original paper by McCabe [McCabe 1976], which indicates the number of regions in a graph representing a source code file. Any program can be represented as a graph. The simplest element is a flat series of statements with no conditions, loops, or branches, which is represented by graph (a) in Figure 8-1. An if statement is a bifurcation, as shown in graph (b) in Figure 8-1. A loop would be shown through an edge that returns back to an earlier node.

Two sample graphs. Graph (a) has a CYCLO of 1, and graph (b) has a CYCLO of 3.

Figure 8-1. Two sample graphs. Graph (a) has a CYCLO of 1, and graph (b) has a CYCLO of 3.

For a graph G with n vertices, e edges, and p exit points (e.g., function returns in the case of C), the complexity v is defined as follows:

Two sample graphs. Graph (a) has a CYCLO of 1, and graph (b) has a CYCLO of 3.

The minimum value for the cyclomatic complexity metric (CYCLO) is 1, corresponding to the flat series of statements with no bifurcations or loops. Every additional region in the control flow graph increases the CYCLO by one unit. For instance, a program that contains an if statement (with no else) has a CYCLO of 2, because it creates a new region in the control flow graph, in addition to the surrounding region.

The program shown in Example 8-2, whose control flow graph is shown in the (b) graph of Figure 8-1, has a CYCLO of 3. The if bifurcation creates one new region (A in Figure 8-1), and the while loop creates another (B in Figure 8-1). Finally, C is the surrounding region, which always counts as 1 in the cyclomatic complexity.

Halstead’s Software Science Metrics

For this we use the definition given by Kan [Kan 2003]. We measured four metrics: length, volume, level, and mental discriminations. These metrics are based on the redundancy of operands and operators in a program.

In the C language, operands are string constants and variable names. Operators are symbols (such as +, -, ++, and --), the * indirection operator, the sizeof operator, preprocessor constants, control flow statements (such as if and while), storage class specifiers (such as static and extern), type specifiers (such as int and char), and structure specifiers (struct and union).

The metrics are obtained by counting the number of distinct operators n1, the number of distinct operands n2, along with the total number of operators N1, and the total number of operands N2. The length L of a program is the total number of operators and operands:

Halstead’s Software Science Metrics

The volume V of a program is its physical size, defined as:

Halstead’s Software Science Metrics

The level lv of a program is a parameter with an upper limit of 1; the closer it is to 1, the less complex is the program. The level is defined as the ratio between the volume of the program and its potential volume (the least redundant implementation of the algorithm):

Halstead’s Software Science Metrics

The inverse of this metric is sometimes called the code’s difficulty. The minimum difficulty is 1, and it increases without an upper bound.

The effort E that a programmer needs to invest to comprehend a program is defined as:

Halstead’s Software Science Metrics

This metric is sometimes called the number of mental discriminations that a developer must do to understand a program.

Halstead obtained all these formulas by making an analogy between programming and natural languages. The main idea is that the implementation of an algorithm is an expression written in a language, and this expression will be easier to understand if it is shorter or contains more redundancy of operators and operands because the number of different concepts that a programmer must retain in memory at once will be smaller. For instance, Halstead’s level is related to the redundancy of operands. If the redundancy is very high, with lots of repeated operands, the Halstead value will be lower, indicating a less complex program. This approach makes intuitive sense, because redundancy should help one learn and understand a program more quickly.

Halstead’s Software Science metrics are similar to McCabe’s cyclomatic complexity, being defined over whole, non-modular programs, without imports. But contrary to McCabe’s metric, Halstead’s metrics do not rely on the structure of the code to measure complexity. This is to say, Halstead’s metrics are defined purely for the textual elements making up a program, not for any kind of semantic unit such as functions, and so we apply Halstead’s metrics to whole files.

The sample file shown in Example 8-2 has a Halstead’s length of 97, which represents the total number of operators (string constants and variable names) and operands (symbols, statements, etc.). The Halstead’s volume is 526, and the Halstead’s level is 0.036. The number of mental discriminations is 14,490, the quotient between the volume and level. All those metrics indicate a greater complexity for higher values, except in the case of Halstead’s level, where a lower value indicates a more complex program.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset