Types of Evidence and Their Strengths and Weaknesses

For the sake of illustration, let’s consider an example. Imagine that we are assessing a new software engineering technology, AWE (A Wonderfulnew Excitement), which has been developed to replace BURP (Boring but Usually Respected Predecessor). What sort of evidence might we consider in deciding whether to adopt AWE? In the following sections, we describe common types of studies and evaluate each for the issues that typically arise for credibility and relevance.

Controlled Experiments and Quasi-Experiments

Controlled experiments are suitable if we want to perform a direct comparison of two or more conditions (such as using AWE versus using BURP) with respect to one or more criteria that can be measured reliably, such as the amount of time needed to complete a certain task. These experiments also often can be helpful if measurement is tricky, such as counting the number of defects in the work products produced with AWE or BURP. “Control” means keeping everything else (other than exchanging AWE for BURP) constant, which we can do straightforwardly with the work conditions and the task to be solved. For the gazillions of human variables involved, the only way to implement control is to use a group of subjects (rather than just one) and count on all the differences to average out across that group. This hope is justified (at least statistically) if we assign the subjects to the groups at random (a randomized experiment), which makes sense only if all subjects are equally competent with using AWE as they are with BURP. Randomized experiments are the only research method that can prove causation: if all we changed is AWE versus BURP, any change in the results (except for statistical fluctuations) must be caused by this difference.

Sometimes we can’t assign developers at random, because all we have are preexisting groups. Consider, for instance, the comparison of C, C++, Java, Perl, Python, Rexx, and TCL, described in Chapter 14. How many programmers do you know who are equally versed in all seven languages? For a randomized experiment, you would need enough of them to fill seven groups! Training enough people to that level is infeasible. In such cases, we use the groups as they are (quasi-experiment) and must be concerned about whether, say, all brainy, bright, brilliant people only use languages whose names start with B or P, which would mean that our groups reflect different talent pools and hence distort the comparison (although we could then attribute this effect to the languages themselves).

Credibility

Given a clear, well-formed hypothesis, a clean design and execution, and sensible measures, the credibility of controlled experiments is usually quite high. The setup is notionally fair and the interpretation of the results is in principle quite clear. The typical credibility issues with experiments are these:

Subject bias

Were the subjects really equally competent with AWE and BURP? If not, the experiment may reflect participant characteristics more than technology characteristics. Subtle variants of this problem involve motivational issues: BURP is boring and AWE is exciting, but this will not remain true in the long run. The difference is pronounced if the experimenter happens to be the inventor of AWE and is correspondingly enthusiastic about it.

Task bias

The task chosen by such an experimenter is likely to emphasize the strengths of AWE more than the strengths of BURP.

Group assignment bias

Distortions due to inherent group differences are always a risk in quasi-experiments.

Relevance

This is the real weak spot of experiments. A good experiment design is precise, and the narrowness of the experimental setting may make the results hard to apply to the outside world. Experiments are prone to a “sand through the fingers” effect: one may grasp at a concept such as productivity, but by the time the concept has been mapped through many refinements to a particular measure in a setting that has sufficient controls, it may seem to slip away.

Surveys

Surveys—simply asking many people the same questions about an issue—is the method of choice for measuring attitude. What do I think is great about AWE? About BURP? How boring do I personally find BURP? Why? Sociology and psychology have worked out an elaborate (and costly) methodology to figure out appropriate questions for measuring attitude correctly for a particular topic. Unfortunately, this methodology is typically ignored in surveys in software engineering.

Surveys can also be used (if less reliably) to collect experiences: for instance, the top five things that often go wrong with AWE or BURP. Surveys scale well and are the cheapest method for involving a large set of subjects. If they contain open questions with free text answers, they can be a foundation for qualitative studies along with quantitative ones.

Credibility

Although surveys are cheap and convenient, it is quite difficult to obtain high credibility with them. The typical credibility issues are these:

Unreliable questions

Questions that are vague, misleading, or ambiguous will be interpreted differently by different respondents and lead to results that are not meaningful. These issues can be quite subtle.

Invalid questions or conclusions

This happens in particular when surveys are abused to gather information about complex factual issues: “How many defects does your typical AWE transphyxication diagram contain?” People just do not know such things well enough to give accurate answers. This is not much of a problem if we view the answers as what they are: that the answer reveals what respondents think, and as we stressed at the beginning of this section, it reflects their attitude. Most often, however, the answers are treated like true facts and the conclusions are correspondingly wrong. (In such cases, critics will often call the survey “too subjective,” but subjectivity is supposed to be what surveys are all about! If subjectivity is a problem, the method has been abused.)

Subject bias

Sometimes respondents will not answer what they really think but rather suppress or exaggerate some issues for socio-political reasons.

Unclear target population

The results of a survey can be generalized to any target population of which the respondents are a representative sample. Such a population always exists, but the researchers can rarely characterize what it is.

To generalize credibly, two conditions must be met. First, the survey has to be sent out to a well-defined and comprehensible population. “All readers of web forums X, Y, and Z in date range D to E” is well defined, but not comprehensible; we have no clear idea who these people are. Second, a large fraction (preferably 50% or more) of this population must answer, or else the respondents may be a biased sample. Perhaps the 4% AWE-fans (“the only thing that keeps me awake at work”) and the 5% BURP-haters (“so boring that I make twice as many mistakes”) in the population felt most motivated to participate and now represent half of our sample?

Relevance

The questions that can be addressed with surveys are often not the questions of most interest. The limited forms of response are often unsatisfying, can be hard to interpret satisfactorily, and are often hard to apply more generally. Surveys can be effective, however, in providing information that supplements and provides context for other forms of evidence: e.g., providing background information about participants in an experiment or providing satisfaction and preference information to supplement performance data.

Experience Reports and Case Studies

Given the limitations of experiments and surveys, and given an interest in assessments based on real-world experience, perhaps looking carefully and in depth at one or two realistic examples of adoption would give us the kind of information we need to make our decision, or at least focus our attention on the questions we would need to answer first.

A case study (or its less formal cousin, the experience report) describes a specific instance of a phenomenon (typically a series of events and their results) that happened in a real software engineering setting. In principle, a case study is the result of applying a specific, fairly sophisticated methodology, but the term is most often used more loosely. Although the phenomenon is inherently unique, it is described in the hope that other situations will be similar enough for this one to be interesting.

Case studies draw on many different kinds of data, such as conversations, activities, documents, and data records, that researchers collect by many different potential methods, such as direct observation, interviews, document analysis, and special-purpose data analysis programs. Case studies address a broad set of questions, such as: What were the main issues in this AWE process? How did AWE influence the design activities? How did it influence the resulting design? How did it structure the testing process? And so on.

In our AWE versus BURP example, BURP can be addressed either in a separate case study or in the same case study as a second case where the researchers attempt to match the structure of investigation and discussion as best they can. If the description is sufficiently detailed, we potentially can “convert” the observations to our own setting.

Credibility

For experience reports and case studies, credibility issues abound because these studies are so specific to their context. Most of the issues have to do with:

  • Selectivity in what is recorded

  • Ambiguity

Therefore, precise description of setup and data and careful interpretation of findings are of utmost importance. The credibility of case studies hinges crucially on the quality of the reporting.

Relevance

Experience reports and case studies are usually published because researchers and publishers believe them to have some general relevance. Interpreting relevance may be challenging, because there are likely to be many important differences between one setting and another. But often the relevance of the questions raised by such reports can be recognized, even if the relevance of the conclusions may be hard to ascertain.

Other Methods

The previous sections aren’t comprehensive. For instance, we might evaluate tools that support AWE and BURP for their scalability using benchmarking methods or for their learnability and error-proneness by user studies. Each of these (and other) methods and approaches has its own credibility and relevance issues.

Indications of Credibility (or Lack Thereof) in Reporting

When one takes into account the fundamental issues discussed in the previous sections, credibility largely boils down to a handful of aspects that have been done well or poorly. We now discuss what to look for in a study report with respect to each of these.

General characteristics

For a highly credible study:

  • Its report will be detailed (rather than vague) and precise (rather than ambiguous).

  • The style of discussion will be honest (rather than perform window-dressing), and the logic and argumentation will be correct (rather than inconsistent, false, or inconsequential).

  • The setup of the study will maximize relevance (rather than being too simple or demonstrate over-specialization).

  • The writing style in reports of very good studies is engaging (rather than dull), despite their dryness.

A clear research question

A clearly stated and unambiguous research question is the mother of all credibility, because if the authors do not explain what they were looking for, what can they turn up for you to believe? The research question need not be announced with a fanfare, but it must be clearly discernible from the abstract and introduction.

Often there is a clearly discernible research question, but it’s vague. In such cases, the results will either be vague as well, or they will be somewhat accidental and random. Such a study may sometimes be of sufficient interest to read on, but it likely has modest credibility at best.

An informative description of the study setup

Understanding the setup of the study is at the heart of credibility and is the main difference between press releases and scientific reports. Even the most appealing result has little credibility if it was generated by staring into a crystal ball. It may be quite impressive to read that “an empirical study with almost 600 participants has shown that Java is superior to C++ in almost all respects: programming times are shorter by 11%, debugging times are shorter by 47%, and long-term design stability is better by 42%. Only the runtime performance of C++ programs is still superior to Java by 23%.”

The same results are far less credible if you know that they were found by means of a survey. And they may become still less credible if you look at the questions more closely: How did they compare programming and debugging times when the programs are all different? Ah, they asked how often these tasks took longer than expected! And what is “long-term design stability”? Ah, they asked what fraction of methods never changed! The devil is in the details: the method, the sample, the data, the analysis.

As a coarse rule of thumb, you can safely ignore all studies whose setup is not described at all, and you should be skeptical of all studies whose setup description leaves open any of your inquisitive “W-questions”: What type of study is it? What tasks have the subjects undertaken? In what kind of work context? Who were the subjects? How was the data collected? How was it validated? How, precisely, are the main measures defined? A good report on an empirical study will answer all of these questions satisfactorily.

A meaningful and graspable data presentation

Once you know how the study was constructed and how the data were collected, you need information about the data itself. There may not be space to publish the raw data, so even small studies summarize data by what statisticians call descriptive statistics.

Assume you have a study involving six different groups and five different measures of interest. Some authors will present their data as a table, with each line listing the group name, group size, and, for instance, the minimum, maximum, average, and standard deviation of one measure. The table has five blocks of six such lines each, making a 37-line table that consumes more than half a page. What do you do when confronted with such information? You could ignore the table (thus failing to scrutinize the data), or you could delve into the table trying to compare the groups for each measure. In the latter case, you will create a mental image of the overall situation by first making sense of the individual entries and then visualizing their relationships. In more colorful terms, you will reanimate the corpses from this data graveyard and then teach them to dance to your music.

Preferably, the authors keep their data alive by presenting it in ways that help the reader engage: by finding visual representations that relate to the key questions and relationships the authors wish to discuss, and that relate clearly and directly to the tabular or textual data presentation. Good data visualizations associate salient visual qualities (e.g, scale, color) with information of interest in a consistent way. Poor data visualizations use visual qualities that obscure or distract viewers from the data, for example, by connecting discrete data with lines that imply some sort of continuity, or by using different scales on related graphs or scales not starting at zero, or by adding extraneous elements that are more salient than the data. Edward R. Tufte wrote the seminal book on proper visualization [Tufte 1983].

A transparent statistical analysis (if any)

Statistical analysis (inferential statistics) is about separating the “noise” from the “signal”: separating results that are likely to have arisen from chance or error from results that are reasonably likely to have arisen as a meaningful part of the phenomenon under study. Statistics are applied in order to reduce your uncertainty about how your results should be interpreted.

In contrast, many authors use statistical analysis for intimidation. They present all sorts of alpha levels, p values, degrees of freedom, residual sum of squares, parameters mu, sigma, theta, coefficients beta, rho, tau, and so on, all the way to Crete and back—all to tell you, “Dare not doubt my declamations or I’ll hit you over the head with my significance tests.” Credible studies use statistics to explain and reassure; bad ones use them to obfuscate (because the authors have weaknesses to hide) or daunt (because the authors are not so sure themselves what all this statistical hocus pocus really means).

In a good study, the authors explain in simple words each statistical inference they are using. They prefer inferences whose meaning is easy to grasp (such as confidence intervals) over inferences whose meaning is difficult to interpret (such as p-values and power, or effect sizes that are normalized by standard deviations). They will clearly interpret each result as saying one of the following: “There is probably some real difference here” (a positive result), “There is likely no effect here or only a very small one; we see mostly random noise” (a negative result), or “It remains unclear what this means” (a null result). Even the latter case is reassuring because it tells you that the uncertainty you felt about the meaning of the data when you looked at it was there for a reason and cannot be removed, even by statistical inference (at least not by this one; there may be a different analysis that could shed more light on the situation).

An honest discussion of limitations

Any solid report of an empirical study needs to have a section discussing the limitations of the study, often titled “threats to validity.” This discussion offers information about what could not be achieved by the study, what interpretations involved are problematic (construct validity), what has gone wrong or may have gone wrong in the execution of the study (internal validity), and what limits the generalizability of the results (external validity). For a good study and report, you are often already aware of these points and this section does not offer much surprising information. Authors of credible studies accept that possible points of criticism remain. It is not a good sign when a report attempts to discuss them all away.

If done well, the most interesting part is usually the discussion of generalizability, because it is directly relevant (forgive the pun) to the results’ relevance. Good reports will offer points both in favor of and against generalizability in several respects and for different generalization target domains.

Conclusions that are solid yet relevant

Critical readers of reports on empirical studies will form their own opinions based on their own assessment of the evidence presented, and will not rely just on the statements provided by the authors in the abstract or conclusions.

If the authors’ conclusions oversell the results, generalizing them to areas not backed by solid argumentation or even drawing conclusions regarding phenomena only loosely connected to the phenomenon being studied, you should be doubly wary regarding the credibility of the rest of the report. In particular, you should take your widest red pen and strike out the abstract and conclusion so you will never rely on them when refreshing your memory with respect to this particular study. Be advised that few readers ever do this, and yet all people need to refresh their memories. As a consequence, many references to a study in other reports or in textbooks will wrongly refer to overstated conclusions as if they were true. (Nobody’s perfect, and scientists are no exception.)

If, on the other hand, the conclusions appear thoughtful and visibly attempt to strike a good balance between reliability and relevance, their credibility is reinforced and their relevance is maximized. Such authors put both the results and the limitations of their study into one scale pan and all possible generalizations (viewed in light of other results elsewhere in the literature) into the other, and then tell you what they think are likely correct generalizations. Such information is valuable because the authors possess a lot of information about the study that is not explicit anywhere in the article but has been used for this judgment.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset