Aggregating Evidence

One way to deal with the limitations of evidence is to combine different forms or sources: to aggregate results from different studies on the same question. The notion behind combing evidence is that if different forms of evidence or evidence from independent sources agree (or at least don’t contradict each other), they gather a sort of “weight” and are more credible together.

Software engineering does not yet have a consolidated body of evidence. However, various efforts are being made at consolidation. Barbara Kitchenham and colleagues have been fuelling a movement for systematic literature reviews, which examine and aggregate published evidence about a specific topic using a framework of assessment against specified criteria. For instance, Jørgensen and Shepperd’s work on cost modeling compiles and integrates the evidence comparing the performance of models and humans in estimation and suggests that they are roughly comparable [Jørgensen and Shepperd 2007]. Interestingly, doing systematic reviews—and doing tertiary reviews of systematic reviews, as done by Kitchenham et al.—exposes interesting weaknesses in the evidence base [Kitchenham et al. 2009]:

  • Little credible evidence exists on a variety of topics in software engineering.

  • Concerns about the quality of published evidence plague the endeavor.

  • Concerns about the quality of the reporting of evidence (e.g., whether the method is described fully and accurately) limit its evaluation.

Systematic reviews are not the final say in validating research results, however. One of their weaknesses is that their way of aggregating studies makes it difficult for them to pay proper attention to context, whose importance in validating and applying studies has been well-established in human studies. As one consequence, systematic reviewers find it difficult to handle qualitative studies, and therefore often exclude them from their reviews, thereby also excluding the evidence they offer. Another consequence is a danger of over-generalization when results from relevantly different contexts (say, student practice and professional practice) are put together as if those contexts were equivalent.

Methodologies—agreed systems of inquiry that include standard applications of methods—offer the important advantage of allowing researchers to compare and contrast results, so that evidence can accumulate over time and consistent evidence can provide a strong basis for knowledge. Disciplines such as chemistry, and particularly subdisciplines that have well-specified concerns and standard forms of questions, have well-defined methodologies. They may also have standard reporting practices that reinforce the standard methodologies through standard formats of reporting.

Any methodology can become a sort of lens through which researchers view the world, but it’s important to recognize that not everything will be within focal range. Blind adherence to methodology can lead to all sorts of embarrassing slips, especially when the researcher has failed to understand the conventions and assumptions that underpin that methodology. The availability of an accepted methodology does not absolve researchers from justifying the choice of technique in terms of the evidence required.

Benchmarking is one example of a method used in software engineering that produces results suitable for aggregation. It involves measuring performance according to a very precisely defined procedure, or benchmark. A good example is the SPEC CPU benchmark, which measures, despite its name, the combined performance of a CPU, memory subsystem, operating system, and compiler for CPU-heavy applications. It consists of a collection of batch-mode application program source codes, plus detailed instructions for how to compile them and instructions for how to run inputs and measure them.

If the benchmark is well specified and fairly applied, you know very precisely what benchmarking results mean, and can without further ado compare Sun’s results to HP’s to Intel’s and so on. This reliability and comparability is exactly what benchmarks where invented for and makes them a hallmark of credibility.

So is benchmarking a free lunch? Of course not. The real concern with benchmarking evidence is relevance: whether the measures used are appropriate for and representative of the phenomenon of interest. A benchmark’s content is always a matter of concern and is usually in dispute. SPEC CPU is a quite successful benchmark in this regard, but others, such as TPC-C for transaction loads on RDBMSes, attract a good share of skepticism.

Limitations and Bias

Even high-quality evidence is usually partial. We’re often unable to assess a phenomenon directly, and so we study those consequences of it that we can study directly, or we look at only a part of a phenomenon, or we look at it from a particular perspective, or we look at something that we can measure in the hope that it maps onto what we’re interested in. Measures are a shorthand, a compact expression or reflection of the phenomenon. But they’re often not the phenomenon; the measure is typically a simplification. Good credibility requires justifying the choices made.

Worse than that, evidence can be biased. How many software engineers would be convinced by the sort of “blind tests” and consumer experiments presented in old-fashioned detergent ads (“Duz washes more dishes...”)? Advertising regulations mean that such consumer experiments must conform to certain standards to make conditions comparable: same dirt, same water, same amount of detergent, and so on. But the advertisers are free to shape the conditions; they can optimize things like the kind of dirt and the water temperature for their product. “Hah!” we say, “the bias is built-in.” And yet many of the published evaluations of software engineering methods and tools follow the same model: wittingly or not, the setting is designed to demonstrate the virtues of the method or tool being promoted, rather than to make a fair comparison to other methods or tools based on well-founded criteria that were determined independently.

Bias occurs when things creep in unnoticed to corrupt or compromise the evidence. It is the distortion of results due to factors that have not been taken into consideration, such as other influences, conflated variables, inappropriate measures, or selectivity in a sample that renders it unrepresentative. Bias threatens the validity of research, and we look for possible bias when we’re assessing credibility.

We need to understand not just the value of specific evidence and its limitations, but also how different forms of evidence compare, and how they might fit together to compensate for the limitations of each form.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset