Identifying fault-prone files in large industrial software systems

E. Weyuker; T. Ostrand    Mälerdalen University, Västerås, Sweden

Abstract

We provide an overview of a decade-long research program aimed at identifying the most fault-prone files of large industrial software systems. We describe the motivation, approach used and results observed when we applied this technology to 170 releases of 9 different systems running continuously in the field. In all cases the files predicted to be most fault-prone accounted for most of the bugs in the system.

Keywords

Software fault prediction; Fault-proneness; Predictive model; Standard Model; Industrial software

Acknowledgment

This work was supported in part by the Swedish Research Council through grant 621-2014-4925.

A person walks down the street and sees a man on his hands and knees under a lamp post.

What are you doing?

I’m looking for my keys.

You dropped them here?

No, but the light is so much better here than where I dropped them, I thought I’d look here!

A silly old joke, but one that motivated the work that we did on software fault prediction for almost a decade while working at AT&T. A common paradigm used at AT&T to develop large systems was to design and create an initial version, and then continue to maintain it with subsequent releases at roughly 3-month intervals. Each new release might contain added functionality, redesign of existing code, removal of parts no longer needed, and fixes for bugs that had been documented since the last release.

Testing is a critical and expensive step needed to create a software system or new release. Every tester would love to be able to focus their lamp posts exactly where the bugs are so that they could test the relevant program parts most intensively. If only we knew beforehand where the bugs were likely to be, we could test there first, test there hardest, assign our best testers, allocate sufficient time where it’s needed, etc. In short, we could use that information to prioritize our testing efforts to find bugs faster and more efficiently, leading to more reliable software. That is exactly what our software fault prediction algorithms aim to do—identify those parts of a software system that are most likely to contain the largest numbers of bugs.

To help identify the parts of a system most likely to contain bugs, we needed to identify which file characteristics correlated most closely with files containing the largest numbers of bugs. We identified both static code characteristics, such as file size and programming language, and historical characteristics, such as the number of recent bugs and the number of recent changes made to each file, that would be used to predict the number of bugs that will be in each file in the next release of the system. As the basis for our research, we looked for real industrial software projects with multiple releases, comprehensive version history, and an accessible bug database. We needed to work with practitioners and get their interest in the project so that they would give us access to their software and data. As we studied their systems and built our first prediction models, we continued working with practitioners to make sure we were asking the right questions and providing feedback in a form that would be useful to them.

We first looked at an inventory control system containing roughly a half million lines of code that had been in the field for about 3 years. We had data for 12 releases. Like many large industrial systems, this software had been developed using a traditional waterfall model, and had four releases a year. The development team used a proprietary configuration management system that included both version control and change management functionality. Underlying the configuration management system was a very large database that contained all of the information about the system, including the source code and bug repository. That was the primary source of the data used to make predictions.

We identified the most buggy files in each release and looked at what they had in common in order to identify file characteristics that correlated with fault-proneness. We started with characteristics that our intuition, experience, and the folklore told us were most important. For example, everyone knows that “big is bad”—that has been a defining mantra of the software engineering world for the last few decades, and so we looked at file size as a potential characteristic, and found that it did indeed correlate with fault-proneness. Another common belief is “once buggy, always buggy”; files that had bugs in the past are likely to have bugs later. We found, in fact, that while past bugginess was important, it was only the most recent release’s status that had a strong influence on bugs in the next release. We continued looking at other characteristics we expected to be most important, and carried out empirical studies to assess the relevance and predictive power of each characteristic we considered. Eventually we determined that by mining the database associated with the configuration management system, five simple file characteristics were sufficient to allow us to build a model that made accurate predictions: file size, the number of changes in the two previous releases, the number of bugs in the most recent previous release, the age of the file in terms of the number of releases the file had been part of the system, and the language in which the file was written.

It took us a number of tries to get a predictive model that seemed to make accurate predictions. We eventually decided to use the statistical method of negative binomial regression. For the initial system, we found that, averaged over the 17 releases of this system, the files predicted to have the largest numbers of bugs did indeed contain a large majority of the bugs. To assess the effectiveness of the predictions, for each release we measured the percent of all real bugs that turned out to be located in the 20% of files that were predicted to have the largest numbers of bugs. Over the system’s 17 releases, the percent of real bugs actually detected in the predicted “worst 20%” of files averaged 83%. Armed with these promising results, we were able to attract other projects whose software we could study and for which we could make predictions.

We refined and improved the prediction model while working with three additional projects, eventually settling on a Standard Model that uses the five characteristics mentioned above. The basic ideas and method to calculate the predictions are explained in [1], although that paper describes a slightly different early version of the prediction model.

Because easy access to the prediction technology is just as important as the technology itself, we built a tool that provides testers and developers with a simple GUI interface to input the information needed to request bug predictions for their software. Users provide their system name, the types of files they are interested in (Java, C, C++, SQL, etc.), and the release they want predictions for. The tool does the calculations and presents a list of the release’s files sorted in decreasing order of the number of predicted bugs. The results can be used to prioritize the order and intensity of testing files, to assign the most appropriate tester(s) to specific files or parts of the system, to help decide which files should be revised or completely rewritten, and to determine whether other quality assurance procedures such as a detailed code review should be carried out. Because a typical run of the tool is very quick, often under 1 min even for a multi-million line system, users could run it repeatedly to get different views, perhaps to restrict their interest to just Java files, or to a subset of the entire system, or to find out if a refactoring of the code changes the expected fault likelihood.

Over the course of several years we studied a total of nine industrial projects and made predictions for a total of 170 releases. All nine systems were in the field for multiple years, all ran continuously (24/7) and ranged from a system with just nine releases over the course of 2 years, to two systems that were in the field for almost 10 years, with each having 35 quarterly releases. The systems performed all sorts of different tasks, were written in different languages, and ranged in size from under 300,000 lines of code to over 2,100,000 lines of code. The prediction accuracy ranged from 75% to 93%.

An unexpected episode at AT&T gave us even more confidence in the usefulness of the prediction models. At a meeting attended by several development managers, we demonstrated the prediction tool and showed the results we had obtained for the last released version of one of the major systems. After seeing this, that system’s manager asked whether we could run the model on the version that was currently under development and next in line to be released. On the spot, we were able to access the system’s version control and bug database, and generate predictions. The results were highly useful to the manager, as they provided fresh insight to his system. The files that were predicted to be most faulty included several that the manager already knew had potential weaknesses, but also included several that he had not considered problematic. His first action on leaving the meeting was to instruct his testing team to do intensified testing on the files that were previously considered safe.

The Standard Model does not account for some variables that many people (including us!) felt might have an effect on the potential bugginess of code. We experimented with augmented models that included counts of the number of different programmers who had changed the code [2], the complexity of the code’s calling structure [3], and detailed counts of the number of lines added, changed or deleted in a file [4]. None of these additional characteristics significantly improved the Standard Model’s predictions, and we even found that augmenting the model with additional variables sometimes made the predictions worse.

A variety of different approaches to software fault prediction have been investigated over the past 20 years, with mixed results. Much of the research has attempted to categorize parts of the software as either fault-prone or not fault-prone, and validation has typically been done using so-called hold-out experiments where predictions were made by training the algorithms on a subset of the files to make predictions on the remaining files within the same release. In contrast, our algorithms perform the more ambitious task of ordering the files from most fault-prone to least fault-prone, using data extracted from earlier releases to make predictions for later releases, always using industrial software systems that are running in the field. Our consistent positive results on 170 releases demonstrate that at least for the types of systems and development processes we’ve studied, accurate prediction results are possible, and can provide useful information to testers, developers, and project managers.

References

[1] Ostrand T.J., Weyuker E.J., Bell R.M. Predicting the location and number of faults in large software systems. IEEE Trans Softw Eng. 2005;31:340–355.

[2] Weyuker E.J., Ostrand T.J., Bell R.M. Do too many cooks spoil the broth? Using the number of developers to enhance defect prediction models. Empir Softw Eng. 2008;13:539–559.

[3] Shin Y., Bell R., Ostrand T.J., Weyuker E. On the use of calling structure information to help predict software fault proneness. Empir Softw Eng. 2012;17:390–423.

[4] Bell R.M., Ostrand T.J., Weyuker E.J. Does measuring code change improve fault prediction? In: Proceedings 7th international conference on predictive models in software engineering (Promise '11); 2011 This work was supported in part by the Swedish Research Council through grant 621-2014-4925.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset