M. Shepperd Brunel University London, Uxbridge, United Kingdom
Replication; Meta-analysis; Reproducible studies; Validity; Blind analysis; Reporting protocols
The need to replicate results is a central tenet throughout science, and empirical software engineering is no exception. The reasons are threefold.
First, it's a means of testing for errors, perhaps in the experimental set up or instrumentation. The Fleischmann-Pons cold fusion experiment is a famous example where other teams were unable to replicate the initial findings, which then cast doubt over the initial experimental results and analysis.
Second, it helps the scientific community better understand how general the findings are, throughout time, and through different settings. For example, an intervention may work well for small agile projects, and be unhelpful for large safety-critical systems.
Third, it helps us better approximate the confidence limits on a reported effect size. Note that even for well-conducted studies undertaken by the most scrupulous researchers, we can still find Type I (ie, false positives) or Type II (ie, false negatives) errors at a frequency approximately determined by the α and β settings, so for example, we would typically expect 1 in 20 studies to wrongly reject the no-effect hypothesis, and perhaps as many as 2 in 10 to studies fail to reject the null hypothesis (of no effect) when there actually is an effect. This means relying upon a single study has an appreciable degree of risk. For a more detailed, but accessible, account see Ellis [14].
So it's quite surprising to observe that there is no single agreed-upon set of terminology for replication of research. Schmidt [1], writing for an audience of psychologists, observed the paucity of clear-cut guidance or a single set of definitions. He differentiates between narrow replication, which entails replication of the experimental procedure, and wider, or conceptual replication, which entails testing of the same hypothesis or research question, but via different means.
Both narrow and wide replications are carried out in software engineering. An example of the former is Basili et al.'s [2] idea of families of experiments and materials being explicitly shared. Clearly, in such circumstances, it's essential to ensure that you have confidence in the validity of such materials. Wide, or conceptual, replications are more commonplace, but frequently appear in a less structured fashion, which can lead to considerable difficulties in making meaningful comparisons. This has been highlighted by various meta-analyses such as the Hannay et al. [3] study on pair programming and the Shepperd et al. [4] analysis of software defect classifiers.
In a mapping study de Magalhães et al. [5] report that whilst there has been some growth in the number of replication studies in recent years for empirical software engineering, the numbers remain a very small proportion of the total number of studies conducted. They found a total of 135 papers reporting replications, published between 1994 and 2012. Miller [6] comments that one reason for the poor uptake of replication as an important research technique is that it's perceived as “only [my emphasis] about proving robustness of pre-existing results,” in other words, a narrow view of replication with potentially negative connotations. This needs to be changed.
In fact, the problem is even more serious, and came forcibly to my attention when attempting a meta-analysis of results from the field of software defect prediction. For some time I and other researchers were concerned about the lack of agreement between individual studies; informally it seemed that for every study by Team A that reported strong evidence for X, the next study by Team B reported strong evidence for NOT(X) Menzies and Shepperd [7].
Lack of agreement between studies triggered a formal meta-analysis of defect classifiers all derived from competing statistical and machine-learning approaches conducted with co-authors Bowes and Hall. Despite locating over 200 primary studies, we had to reject more than 75% due to incompatible experimental designs, lack of reported details, and inconsistent or inappropriate response variables [4]. This is an dispiriting waste of research energy and effort. Nevertheless, we were still able to meta-analyze 600 individual experimental results. Our conclusions were quite stark. The choice of machine learning algorithm had almost no bearing upon the results of the study. We modeled this using a random effects linear model using ANalysis Of VAriance (ANOVA), and then added moderators to characterize aspects of the experimental design such as training data, input metrics, and research group (the latter being determined through a single linkage clustering algorithm based upon co-authorship). To our astonishment, we found that research group is many times (of the order 25 ×) more strongly associated with the actual results than the choice of prediction algorithm. We believe the causes of this researcher bias include:
• differing levels of expertise
• incomplete reporting of experimental details
• the preference for some results over others; scientists are human after all!
In this setting it is no surprise that there is presently little agreement, that is, reliability between research results.
Note that replication for nonexperimental investigations is an interesting and complex question that is not covered by this chapter (see, for instance, Eisenhardt [8] on replication logic for case studies).
In order for a study to be repeated naturally it must be reproducible, by this I mean there must be sufficient information available to enable this to happen. Schmidt [1] quotes Popper who states “Any empirical scientific statement can be presented (by describing experimental arrangements, etc.) in such a way, that anyone who has learned the relevant technique can test it.” However, this begs the question of what characteristics should a study report? There are four areas that must be considered, although their relative importance may vary from study to study:
1. Explicit research question: this needs to be carefully articulated, as exploratory trawls through data sets are difficult to replicate.
2. Data and pre-processing procedures: this militates against studies that use data not in the public domain, and such authors need to find persuasive arguments as to why using such data are necessary.
3. Experimental design: differences in design can significantly impact reliability. Note that for example in machine learning the impact of using a Leave-one-out cross-validation (LOOCV) as opposed to m × n cross-validation is not well understood.
4. Details of algorithms employed—most machine learners are complex, with large free spaces for their parameter settings. Choices are often governed by experience or trial and error.
Much of the preceding list could be simplified once specific research communities are able to define and agree upon standard reporting protocols. This would be particularly beneficial for new entrants to the field and for meta-analysts.
We need studies that are reliable, that is, if repeated will tend to produce the same result. They also need to be valid since reliably invalid is, to put it mildly, unhelpful! Hannay et al. [3] in their meta-analysis report that “the relatively small overall effects and large between-study variance (heterogeneity) indicate that one or more moderator variables might play a significant role.” This is a similar finding to the Shepperd et al. [4] meta-analysis in that experimental results are typically confounded by differences in experimental design and conduct. This harms reliability. Likewise, Jørgensen et al. [9] randomly sampled 150 software engineering experiments from their systematic review. Again, they found significant problems with the validity. Two particular problems appear to be researcher bias and low experimental power.
There is much consensus from researchers both within software engineering, for example, Jørgensen et al. [9] and beyond, for example, the psychologist Schooler [10] for more replications and meta-studies. In order to accomplish this I believe we need:
• Blind analysis: this is a technique to reduce researcher bias, whereby the experimental treatments are blinded by means of re-labeling so that the statistician or analyst is unaware which is the new “pet” treatment and which are the benchmarks. This makes fishing for a particular result more difficult [11].
• Agreed-upon reporting protocols: so that replications are more easily possible without having to guess as to the original researchers' implementation.
• To place more value on replications and nonsignificant results: Ioannidis [12] demonstrates—as a somewhat hypothetical exercise—how a game-theoretic analysis of the different goals of the different stakeholders in the research community can lead to problematic outcomes and that we need to therefore address the “currency” or payoff matrix. I see no easy shortcuts to achieving this but, minimally, awareness may lead to better understanding and more fruitful discussions.
Clearly the (long-term) purpose of software engineering research is to influence its practice. Hence, practitioners are concerned with research that generates “actionable” results. This implies:
• Practitioners should be wary of implementing the results of a single study. The findings should not simply be extrapolated to all settings because there are likely to be many sources of potentially important variation between different contexts.
• Given the importance of context, “one size fits all” types of result should be treated with caution.
• More generally, we need to promote evidence-based practice [13], which in turn requires that all relevant evidence be collated, combined, and made available.