Introduction

Software organizations can benefit greatly from an early estimation regarding the quality of their product. Because product quality information is available late in the process, corrective actions tend to be expensive [Boehm 1981]. The IEEE standard [IEEE 1990] for software engineering terminology defines the waterfall software development cycle as “a model of the software development process in which the constituent activities, typically a concept phase, requirements phase, design phase, implementation phase, test phase, and installation and checkout phase, are performed in that order, possibly with overlap but with little or no iteration.” During the development cycle, different metrics can be collected that can be related to product quality. The goal is to use such metrics to make estimates of post-release failures early in the software development cycle, during the implementation and testing phases. For example, such estimates can help focus testing and code and design reviews and affordably guide corrective actions.

The selection of metrics is dependent on using empirical techniques such as the G-Q-M principle [Basili et al. 1994] to objectively assess the importance of selecting the metrics for analysis. An internal metric (measure), such as cyclomatic complexity [McCabe 1976], is a measure derived from the product itself. An external metric is a measure of a product derived from the external assessment of the behavior of the system. For example, the number of failures found in the field is an external metric. The ISO/IEC standard [ISO/IEC 1996] states that an internal metric is of little value unless there is evidence that it is related to an externally visible attribute. Internal metrics have been shown to be useful as early indicators of externally visible product quality when they are related in a statistically significant and stable way to the field quality/reliability of the product. The validation of internal metrics requires a convincing demonstration that (1) the metric measures what it purports to measure and (2) the metric is associated with important external metrics such as field reliability, maintainability, or fault-proneness [El Emam 2000].

In this chapter we discuss six different sets of metrics for failure prediction. Each set of metrics is described, and its importance to failure prediction is illustrated. Results of published industrial case studies at Microsoft are provided with references for future reading. We then provide a summary of failure prediction and discuss future areas of importance. We discuss six different sets of internal metrics to predict failures:

  1. Code coverage

  2. Code churn

  3. Code complexity

  4. Code dependencies

  5. People and organizational metrics

  6. Integrated/combined approach

Executable binaries are the level of measurement we use in reference to the Windows case studies. Binaries are the result of compiling source files to form an .exe, .dll, or .sys file. Our choice of binaries was governed by the facts that: (1) binaries are the lowest level at which field failures are mapped back to; (2) fixes usually involve changes to several files, most of which are compiled into one binary; (3) binaries are the lowest level at which code ownership is maintained, thereby making our results more actionable. Historically, Microsoft has used binaries as the unit of measurement due to the ability to map back customer failures accurately.

For each of the six sets of metrics, we provide evidence from a case study at Microsoft predicting failures in Windows Vista and/or Windows Server 2003. For each of the predictions, we provide the related precision and recall values or accuracy values. Precision measures the false-negative rate, which is the ratio of failure-prone binaries that were classified as not-failure-prone. Recall measures the false-positive rate, which denotes the ratio of not-failure-prone binaries that were classified as failure-prone. We focus here on the metrics for failure prediction rather than on the statistical or machine-learning techniques for which standard techniques such as logistic regression, decision trees, support vector machines, etc., can be used [Han and Kamber 2006].

As with all research on empirical evidence on software engineering, drawing general conclusions from empirical studies in software engineering is difficult because any process depends to a large degree on a potentially large number of relevant context variables [Basili et al. 1999]. For this reason, we cannot assume a priori that the results of a study generalize beyond the specific environment in which it was conducted [Basili et al. 1999]. Researchers become more confident in a theory when similar findings emerge in different contexts [Basili et al. 1999]. We encourage readers to investigate and, if possible, replicate the studies in their specific environments to draw conclusions. We hope this chapter serves as a documentation of the metrics and feature sets that have been shown to be successful (or not) at predicting failures at Microsoft.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset