The Warning Label

In the process of developing our fault prediction technology and tool, we did a large series of empirical studies. Because it is fairly unusual to see repeated large industrial empirical studies that follow multiple systems for multiple years, it would perhaps be interesting for readers to hear which parts of the work are most challenging and why, so they can understand why this sort of evidence is so critical even though it is difficult to provide.

Where will I get systems to study?

Many people seem to feel that since we work for a large company that has hundreds of millions or perhaps billions of lines of code that run continuously, obtaining systems to study would be trivial. Nothing could be further from the truth, especially at early stages of the research. It’s sort of a chicken and egg syndrome. System owners are reluctant to allow you access to their systems for a number of reasons, most of which are perfectly sensible from their points of view.

The first thing they worry about is that you will take time, and they typically have very tight deadlines. If they spend time answering your questions, they will see no direct benefit and will have less chance of deploying their system on schedule. Therefore they are unwilling to get involved with what they perceive to be high-risk research projects. Most research projects are in fact high-risk because they never get to a stage that can be used by practitioners.

System owners may also fear that you will modify the system or their data in some way that will have a negative impact on their project. Although you may promise to look and not touch, they fear you will deliberately or inadvertently impact their system.

A third issue is similar to a problem common to studies in other fields of research. For example, in medical treatment or drug trials, it is rare that the people who are the study subjects will actually benefit from the study, and it is critical to make sure that subjects understand that and know the risks of participation.

In fact, the projects that served as our earliest study subjects derived no benefit, because we were only identifying the characteristics most closely associated with faulty files and just beginning to develop the prediction models.

Even after we had a prediction model that seemed to work, the next couple of systems were used to validate the model. Only after that was done did we begin to build an automated tool. Until the tool was complete, doing the predictions required an understanding of data mining and statistics, and familiarity with a statistics tool such as R. This knowledge is outside the normal set of skills held by software developers or testers.

Once you have a number of successes with the technology under your belt, it will be much easier to find projects eager to try it, but of course, there first need to be projects willing to act as guinea pigs. This usually requires previous relationships developed by working with project personnel that convince them you understand their time constraints and the importance of not impacting the project while studying it.

Preliminary studies are difficult and time-consuming

As explained earlier, before we could begin to do any actual work on fault prediction, we had to first assure ourselves that faults were not uniformly distributed and determine which characteristics were most closely correlated with faulty files by doing preliminary empirical studies to collect evidence. Of course, there was a fair amount of anecdotal evidence, especially about the Pareto distribution, and there had been earlier studies that considered which properties were associated with fault-proneness, but anecdotal evidence is not the same as a careful study done on a large industrial system. We needed to verify that we would observe this behavior in our environment. Finally, some of the earlier studies had conflicting results about the importance of different file characteristics or did not consider both code-based and history-based characteristics.

We spent roughly two years doing these preliminary studies and getting to the point at which we were ready to consider building a statistical prediction model for the first system.

Data acquisition and analysis is difficult and time-consuming

Once we had determined the most important code and history metrics to use to build the statistical model, we had to understand the change management and version control system that the projects use and determine exactly how to extract the necessary data. In the course of looking at the second system we studied, we learned that different projects used certain fields of the underlying database in different or non-obvious ways. This helped us determine which data was usable and which was not. Following that, we needed to write scripts to actually do the data extraction, and build the initial prediction model.

This took us roughly one additional year until we were able to make the preliminary predictions for the first system using a custom-built prediction model.

Measuring success is difficult

Our model associates a predicted number of faults in the next release with every file. We considered a number of different ways of measuring how well the predictions were doing.

One way might be to compare how close the predicted numbers of faults came to the actual numbers of faults in each file. For the first system we studied, we observed that typically, the numbers of faults did not match closely. However, the files with the largest numbers of actual faults tended to be among those files with the largest numbers of predicted faults. Discussing this with practitioners, we decided that the most useful information and best way to assess our success was using the 20% metric described earlier in this chapter.

We have considered other metrics proposed in the literature and found many other metrics inappropriate for this work because they were designed for predicting binary decisions, such as whether or not a file will contain faults, rather than ranking the files as our model does. We have continued to assess and propose alternative metrics to make sure that the assessment used is most appropriate for our needs.

Getting “customers” is difficult

Even after completing several large empirical studies that showed consistently good results, and having developed a fully automated tool, it is still difficult to convince project leaders that they should change their development methodology and incorporate our prediction models and tool into their process. We have recently identified a project whose management believes this will be very useful and we are negotiating the transfer of our technology. It remains to be seen whether the use of the prediction tool will impact the types and numbers of faults identified by the users.

So what is the bottom line here? We have spent eight years working on this research. Was all that time necessary? It is our hope that some of the factors outlined earlier will convince you that this was indeed time well spent. If you want to be able to collect evidence on the scope that we did, it is often very time-consuming, especially in the early stages while the technology is still being developed and matured, during which most stages need to be done manually. But having provided a considerable amount of evidence that our prediction model works well in many different settings and circumstances, and having built a fully automated tool, we believe we are now ready for prime time!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset