Credibility and Relevance

Our discussion revolves around two central concepts: credibility and relevance.

Credibility

The degree to which you are (or should be) willing to believe the evidence offered and the claims made about it. Part of credibility is validity, the extent to which a study and the claims made from it accurately represent the phenomenon of interest. Validity has a number of facets, including:

  • Whether what you observed was what you wanted to observe and thought you were observing

  • Whether measures used actually measure what they are intended to measure

  • Whether the account of why something is happening is accurate

  • Whether we can generalize from what has been studied to other, conceptually comparable settings

Credibility requires a study to embody not just high validity but also good reporting so readers know how and when to apply the study.

Relevance

The degree to which you are (or ought to be) interested in the evidence and claims. In most cases, you won’t even look at an irrelevant study. The typical exception is when the question is of interest to you but the environment in which the answer was found is very different from yours. Relevance in that case is the degree to which the result can be generalized to your environment—which unfortunately is usually a difficult question.

Some people consider relevance to be a facet of credibility, but we believe it is helpful to keep them apart because a low-credibility statement is hardly more than noise (given the gazillions of things you would like to learn about), whereas a high-credibility, low-relevance statement is respectable information you just do not currently consider very important for your purpose—but that might change.

Fitness for Purpose, or Why What Convinces You Might Not Convince Me

Evidence is not proof. In general, evidence is whatever empirical data is sufficient to cause us to conclude that one account is more probably true than not, or is probably more true than another. Different purposes require different standards of evidence. Some purposes need strong evidence. For example, deciding whether to switch all software development to an aspect-oriented design and implementation approach demands compelling evidence that the benefits would exceed the investment (both financial and cultural). Some purposes need only weak evidence. For instance, an example application of aspect-orientation for tracing might be sufficient evidence that aspect-oriented programming (AOP) can be worthwhile, leaving it up to further research to clarify how well its goals are accomplished (and in what environments). An evaluation study whose goal is to identify deficiencies in a design may require responses from only a handful of participants. We know of one prototype voice feedback system for manufacturing that was abandoned after the initial users pointed out that they wore ear plugs to block out factory noise. Some purposes need only counter-examples. For example, when one is trying to dispute an assumption or a universal claim, one needs only a single counter-example. An example is the work that put the “it’s good because it’s graphical” claims of visual programming proponents in perspective by showing in an experiment that a nested structure was comprehended more effectively using a textual representation [Green and Petre 1992].

So there’s a relationship between the question you want to answer and the respective evidence required for your particular context. And there’s a relationship between the evidence you need and the sorts of methods that can provide that evidence. For example, you can’t ask informants about their tacit knowledge, indicating one of the limitations of surveys. Software is created and deployed in a socio-technical context, so claims and arguments about software need to draw on evidence that takes into account both the social and the technical aspects as well as their inter-relationship.

For instance, understanding the implications of aspect-oriented design on your project’s communication structure requires observing that technology in its full context of use. Good application of methods means matching the method to the question—and to the evidence you need to answer that question for your purposes.

Quantitative Versus Qualitative Evidence: A False Dichotomy

Discussions of research often distinguish studies as quantitative or qualitative. Roughly speaking, the difference resides in the questions of interest. Quantitative studies revolve around measurement and usually ask comparison questions (“Is A faster than B?”), if questions (“If A changes, does B change?”), or how much questions (“How much of their time do developers spend debugging?”). In contrast, qualitative studies revolve around description and categorization and usually ask why questions (“Why is A easier to learn than B?”) or how questions (“How—by what methods—do developers approach debugging?”).

Like most models, this is a simplification. Some issues regarding this notional distinction lead to confusion and deserve a short discussion:

It is nonsense to say that quantitative evidence is generally better than qualitative evidence (or vice versa).

Rather, the two have different purposes and are hence not directly comparable. Qualitative evidence is required to identify phenomena of interest and to untangle phenomena that are known to be of interest but are convoluted. Quantitative evidence is for “nailing down” phenomena that are well-enough understood or simple enough to be isolated for study.

Hence, qualitative research has to precede quantitative research and will look at situations that are more complicated. When only few different factors are involved (such as in physics), one can proceed to quantitative investigation quickly; when many are involved (such as in human social interactions), the transition either takes a lot longer or will involve premature simplification. Many of the credibility problems of software engineering evidence stem from such premature simplification.

It’s not a dichotomy, but a continuum.

Qualitative and quantitative studies are not as distinct as the simplification suggests. Qualitative studies may collect qualitative data (e.g., records of utterances or actions), then code that data systematically (e.g., putting it into categories), and eventually quantify the coded data (i.e., count the instances in a category) and analyze it statistically. Quantitative studies may, in turn, have qualitative elements. For example, when one compares the efficacy of two methods A and B, one may compare the products of the methods, assessing them as “good,” “adequate,” or “poor.” The answers are qualitative; they lie on an ordinal scale. However, one may then use a statistical procedure (such as the Wilcoxon rank-sum test) to determine whether the outcome with A is significantly better than with B. The study’s structure is experimental, and the analysis is quantitative, using the same techniques as if the measure were centimeters or seconds. (Even in studies that use performance metrics, the choice of metrics and the association of metric with quality may be based on a qualitative assessment.)

Some of the most powerful studies of software systems combine quantitative evidence (e.g., performance measures) with qualitative evidence (e.g., descriptions of process) to document a phenomenon, identify key factors, and offer a well-founded explanation of the relationship between factors and phenomenon.

Qualitative studies are not necessarily “soft,” nor are quantitative ones “hard.”

The results of good studies are not at all arbitrary. Good science seeks reproducible (“hard”) results because reproducing results suggests that they are reliable, and the process of reproduction exposes the study method to critical scrutiny. Reproducing a result means repeating (“replicating”) the respective study, either under conditions that are almost exactly the same (close replication) or in a different but like-minded manner (loose replication). Qualitative studies almost always involve unique human contexts and are therefore hard to replicate closely. However, that does not mean the result of the study cannot be reproduced; it often can. And from the point of view of relevance, a loose replication is even more valuable than a close one because it signals that the result is more generalizable. On the other hand, the results of quantitative studies sometimes cannot be reproduced. For example, John Daly’s experiments on inheritance depth were replicated by three different research groups, all with conflicting results [Prechelt et al. 2003].

Summing up, the strength of quantitative studies is that they capture a situation in a few simple statements and thus can sometimes make things very clear. Their disadvantage is that they ignore so much information that it is often difficult to decide what the results actually mean and when they apply. The strength of qualitative studies is that they reflect and exhibit the complexity of the real world in their results. The disadvantage is that they are therefore much harder to evaluate. Any research results may be hard to apply in real life if it is not clear how the results map onto one’s real-world context, either because an experiment is too narrow to generalize or because an observational setting is too different.

The “bottom line” is that the method must be appropriate for the question. Issues about social context, process, and the contrast between what people believe they do and what they actually do require different forms of inquiry than issues about algorithm performance.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset