Replicated results are more trustworthy

M. Shepperd    Brunel University London, Uxbridge, United Kingdom

Abstract

The need to replicate results is a central tenet throughout science, and empirical software engineering is no exception. The reasons are threefold.

Keywords

Replication; Meta-analysis; Reproducible studies; Validity; Blind analysis; Reporting protocols

The Replication Crisis

The need to replicate results is a central tenet throughout science, and empirical software engineering is no exception. The reasons are threefold.

First, it's a means of testing for errors, perhaps in the experimental set up or instrumentation. The Fleischmann-Pons cold fusion experiment is a famous example where other teams were unable to replicate the initial findings, which then cast doubt over the initial experimental results and analysis.

Second, it helps the scientific community better understand how general the findings are, throughout time, and through different settings. For example, an intervention may work well for small agile projects, and be unhelpful for large safety-critical systems.

Third, it helps us better approximate the confidence limits on a reported effect size. Note that even for well-conducted studies undertaken by the most scrupulous researchers, we can still find Type I (ie, false positives) or Type II (ie, false negatives) errors at a frequency approximately determined by the α and β settings, so for example, we would typically expect 1 in 20 studies to wrongly reject the no-effect hypothesis, and perhaps as many as 2 in 10 to studies fail to reject the null hypothesis (of no effect) when there actually is an effect. This means relying upon a single study has an appreciable degree of risk. For a more detailed, but accessible, account see Ellis [14].

So it's quite surprising to observe that there is no single agreed-upon set of terminology for replication of research. Schmidt [1], writing for an audience of psychologists, observed the paucity of clear-cut guidance or a single set of definitions. He differentiates between narrow replication, which entails replication of the experimental procedure, and wider, or conceptual replication, which entails testing of the same hypothesis or research question, but via different means.

Both narrow and wide replications are carried out in software engineering. An example of the former is Basili et al.'s [2] idea of families of experiments and materials being explicitly shared. Clearly, in such circumstances, it's essential to ensure that you have confidence in the validity of such materials. Wide, or conceptual, replications are more commonplace, but frequently appear in a less structured fashion, which can lead to considerable difficulties in making meaningful comparisons. This has been highlighted by various meta-analyses such as the Hannay et al. [3] study on pair programming and the Shepperd et al. [4] analysis of software defect classifiers.

In a mapping study de Magalhães et al. [5] report that whilst there has been some growth in the number of replication studies in recent years for empirical software engineering, the numbers remain a very small proportion of the total number of studies conducted. They found a total of 135 papers reporting replications, published between 1994 and 2012. Miller [6] comments that one reason for the poor uptake of replication as an important research technique is that it's perceived as “only [my emphasis] about proving robustness of pre-existing results,” in other words, a narrow view of replication with potentially negative connotations. This needs to be changed.

In fact, the problem is even more serious, and came forcibly to my attention when attempting a meta-analysis of results from the field of software defect prediction. For some time I and other researchers were concerned about the lack of agreement between individual studies; informally it seemed that for every study by Team A that reported strong evidence for X, the next study by Team B reported strong evidence for NOT(X) Menzies and Shepperd [7].

Lack of agreement between studies triggered a formal meta-analysis of defect classifiers all derived from competing statistical and machine-learning approaches conducted with co-authors Bowes and Hall. Despite locating over 200 primary studies, we had to reject more than 75% due to incompatible experimental designs, lack of reported details, and inconsistent or inappropriate response variables [4]. This is an dispiriting waste of research energy and effort. Nevertheless, we were still able to meta-analyze 600 individual experimental results. Our conclusions were quite stark. The choice of machine learning algorithm had almost no bearing upon the results of the study. We modeled this using a random effects linear model using ANalysis Of VAriance (ANOVA), and then added moderators to characterize aspects of the experimental design such as training data, input metrics, and research group (the latter being determined through a single linkage clustering algorithm based upon co-authorship). To our astonishment, we found that research group is many times (of the order 25 ×) more strongly associated with the actual results than the choice of prediction algorithm. We believe the causes of this researcher bias include:

 differing levels of expertise

 incomplete reporting of experimental details

 the preference for some results over others; scientists are human after all!

In this setting it is no surprise that there is presently little agreement, that is, reliability between research results.

Note that replication for nonexperimental investigations is an interesting and complex question that is not covered by this chapter (see, for instance, Eisenhardt [8] on replication logic for case studies).

Reproducible Studies

In order for a study to be repeated naturally it must be reproducible, by this I mean there must be sufficient information available to enable this to happen. Schmidt [1] quotes Popper who states “Any empirical scientific statement can be presented (by describing experimental arrangements, etc.) in such a way, that anyone who has learned the relevant technique can test it.” However, this begs the question of what characteristics should a study report? There are four areas that must be considered, although their relative importance may vary from study to study:

1. Explicit research question: this needs to be carefully articulated, as exploratory trawls through data sets are difficult to replicate.

2. Data and pre-processing procedures: this militates against studies that use data not in the public domain, and such authors need to find persuasive arguments as to why using such data are necessary.

3. Experimental design: differences in design can significantly impact reliability. Note that for example in machine learning the impact of using a Leave-one-out cross-validation (LOOCV) as opposed to m × n cross-validation is not well understood.

4. Details of algorithms employed—most machine learners are complex, with large free spaces for their parameter settings. Choices are often governed by experience or trial and error.

Much of the preceding list could be simplified once specific research communities are able to define and agree upon standard reporting protocols. This would be particularly beneficial for new entrants to the field and for meta-analysts.

Reliability and Validity in Studies

We need studies that are reliable, that is, if repeated will tend to produce the same result. They also need to be valid since reliably invalid is, to put it mildly, unhelpful! Hannay et al. [3] in their meta-analysis report that “the relatively small overall effects and large between-study variance (heterogeneity) indicate that one or more moderator variables might play a significant role.” This is a similar finding to the Shepperd et al. [4] meta-analysis in that experimental results are typically confounded by differences in experimental design and conduct. This harms reliability. Likewise, Jørgensen et al. [9] randomly sampled 150 software engineering experiments from their systematic review. Again, they found significant problems with the validity. Two particular problems appear to be researcher bias and low experimental power.

So What Should Researchers Do?

There is much consensus from researchers both within software engineering, for example, Jørgensen et al. [9] and beyond, for example, the psychologist Schooler [10] for more replications and meta-studies. In order to accomplish this I believe we need:

 Blind analysis: this is a technique to reduce researcher bias, whereby the experimental treatments are blinded by means of re-labeling so that the statistician or analyst is unaware which is the new “pet” treatment and which are the benchmarks. This makes fishing for a particular result more difficult [11].

 Agreed-upon reporting protocols: so that replications are more easily possible without having to guess as to the original researchers' implementation.

 To place more value on replications and nonsignificant results: Ioannidis [12] demonstrates—as a somewhat hypothetical exercise—how a game-theoretic analysis of the different goals of the different stakeholders in the research community can lead to problematic outcomes and that we need to therefore address the “currency” or payoff matrix. I see no easy shortcuts to achieving this but, minimally, awareness may lead to better understanding and more fruitful discussions.

So What Should Practitioners Do?

Clearly the (long-term) purpose of software engineering research is to influence its practice. Hence, practitioners are concerned with research that generates “actionable” results. This implies:

 Practitioners should be wary of implementing the results of a single study. The findings should not simply be extrapolated to all settings because there are likely to be many sources of potentially important variation between different contexts.

 Given the importance of context, “one size fits all” types of result should be treated with caution.

 More generally, we need to promote evidence-based practice [13], which in turn requires that all relevant evidence be collated, combined, and made available.

References

[1] Schmidt S. Shall we really do it again? The powerful concept of replication is neglected in the social sciences. Rev Gen Psychol. 2009;13(2):90–100.

[2] Basili V., et al. Building knowledge through families of experiments. IEEE Trans Softw Eng. 1999;25(4):456–473.

[3] Hannay J., et al. The effectiveness of pair programming: a meta-analysis. Inf Softw Technol. 2009;51(7):1110–1122.

[4] Shepperd M., et al. Researcher bias: the use of machine learning in software defect prediction. IEEE Trans Softw Eng. 2014;40(6):603–616.

[5] de Magalhães C., et al. Investigations about replication of empirical studies in software engineering: a systematic mapping study. Inf Softw Technol. 2015;64:76–101.

[6] Miller J. Replicating software engineering experiments: a poisoned chalice or the Holy Grail? Inf Softw Technol. 2005;47(4):233–244.

[7] Menzies T., Shepperd M. Special issue on repeatable results in software engineering prediction. Empir Softw Eng. 2012;17(1–2):1–17.

[8] Eisenhardt K. Building theories from case study research. Acad Manag Rev. 1989;14(4):532–550.

[9] Jørgensen M., et al. Incorrect results in software engineering experiments: how to improve research practices. J Syst Softw. 2015;doi:10.1016/j.jss.2015.03.065 Available online.

[10] Schooler J. Metascience could rescue the ‘replication crisis’. Nature. 2014;515(7525):9.

[11] Sigweni B., Shepperd M. Using blind analysis for software engineering experiments. In: Proceedings of the 19th ACM international conference on evaluation and assessment in software engineering (EASE'15); 2015:doi:10.1145/2745802.2745832.

[12] Ioannidis J. How to make more published research true. PLoS Med. 2014;11(10):e1001747.

[13] Kitchenham B., et al. Evidence-based software engineering. In: Proceedings of the 27th IEEE international conference on software engineering (ICSE 2004), Edinburgh; 2004.

[14] Ellis P. The essential guide to effect sizes: statistical power. Cambridge University Press; 2010. meta-analysis, and the interpretation of research results.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset