Lies, damned lies, and analytics

Why big data needs thick data

M.-A. Storey    University of Victoria, Victoria, BC, Canada

Abstract

Software analytics and the use of computational methods on “big data” in software engineering is transforming the ways software is developed, used, improved and deployed. Software engineering researchers and practitioners are witnessing an increasing trend in the availability of diverse trace and operational data and the methods to analyze it. This information is being used to paint a picture of how software is engineered and suggest ways it may be improved. But we have to remember that software engineering is inherently a socio-technical endeavour, with complex practices, activities and cultural aspects that cannot be externalized or captured by tools alone. In this short article, I suggest we consider the risks from blindly applying Software Analytics techniques and that we augment software big data with “thick data” gathered directly by observing or listening to software stakeholders.

Keywords

Qualitative data; Reality distortion field; Ethnography; Thick data

How Great It Is, to Have Data Like You

Software Analytics and the use of computational methods on “big” data in Software Engineering is transforming the way software is developed, used, improved and deployed [1].

We now have vast amounts of software data at our fingertips. Data on usage patterns, development methods, software quality, user opinions, and more. This is data at a scale that researchers and practitioners alike could only dream of in the past. But more than that, we also have sophisticated and “intelligent” methods for mining, cleaning, classifying, clustering, predicting, recommending, sharing, and visualizing this data. We can discover unproductive developers and teams, identify development processes and programming languages that lead to buggier software, spot unusable or insecure software features, and recommend how the software should be used in different contexts. There is still much to be done to improve the types and the scale of data that are collected and to improve tools for analyzing such data, but the path to our desired future is clearly illuminated, isn’t it?

In this short chapter, I ask you, the reader, to consider the risks from following and pursuing this utopian Software Analytics Path and to ponder: are we asking the right questions, are we answering the questions correctly, are we anticipating the impact these answers may have, and more importantly, are we ready to handle the inevitable changes these potentially disruptive insights may bring.

Looking for Answers in All the Wrong Places

As humans we often have a tendency to “look for the car keys where the light shines” or to chase after the “low hanging fruit.” Indeed, some technologists may be particularly attracted by the “shininess” of data that are quantifiable and relatively easy to collect. But important insights that our stakeholders may care about will often lie in qualitative data that are unstructured, messy, and resistant to automatic collection and analysis methods.

Consider for example data from A/B testing, although this data may help a designer understand which feature is preferred, there may be “thick” qualitative data from blog posts and tweets that provide insights into why a particular feature is shunned by its users. Qualitative data require the application of sound and rigorous manual analysis methods to make sense of them [2]. But a warning, these manual qualitative methods are both time consuming and expensive, and such efforts to reveal rich and insightful stories are unfortunately often not valued by some stakeholders that have a tendency to only trust “numbers” and “statistics.”

Beware the Reality Distortion Field

But no matter which data are analyzed (quantitative or qualitative), it is important for software analysts to do a reality check and to ask if the data under consideration are really bringing about an “epiphany” or an “apopheny” [3]. That is, it may be possible that when we have so much data, meaningless patterns and correlations may emerge. The New York Times mentioned this example when discussing limitations of big data: “The first thing to note is that although big data is very good at detecting correlations, especially subtle correlations that an analysis of smaller data sets might miss, it never tells us which correlations are meaningful. A big data analysis might reveal, for instance, that from 2006 to 2011 the United States murder rate was well correlated with the market share of Internet Explorer: Both went down sharply. But it’s hard to imagine there is any causal relationship between the two” [4].

Furthermore, we may start our exploration of a phenomena with an awareness that the metrics or measurements defined may only be loosely connected to the concepts to be studied, but over time this connection becomes hardened and the risks in relying on the construct are long forgotten. The answers that emerge may be completely wrong (verging on lies) if the constructs that are used are poor representations of reality (eg, number of commits as a proxy for developer productivity). Or the data that can be collected may have biases that are hidden (eg, some developers may game the number of commits when they realize commits are being counted) or the insights gained may be completely wrong (eg, if layers of tool integrations result in multiple counts of certain commits).

Finally, software analysts need to remember that although “data” may be rational, the creators of much of this data (the humans), whether they passively or actively create it, may not be rational and may not do things in a way we anticipate [5]. All human stakeholders may have hidden motivations that even they do not wish to acknowledge or due to their own distorted view of reality.

Build It and They Will Come, but Should We?

Even assuming we can find a reasonably fair way to represent and provide insights on some version of “reality,” should we? We sometimes witness a “build it because we can” attitude with the tools and methods software analysts develop. But access to data—especially derived data—brings many potential opportunities for misuse, or worse, for abuse, especially when there are power imbalances or politics at play.

We have yet to consider the ethics behind the kinds of trace data—ie, from the direct or indirect actions of one or more humans—that can be collected, aggregated, analyzed, repurposed, and presented, in some cases, it is of course very hard, if not impossible, to predict the negative implications that may arise. But software analysts need to acquire skills in recognizing and anticipating some of these ethical issues, or to form alliances with those that do. The fallout from a “mood manipulation” study at Facebook led to data scientists across many domains calling for an ethical framework or set of policies they can use as guidance (http://www.informationweek.com/big-data/big-data-analytics/data-scientists-want-big-data-ethics-standards/d/d-id/1315798). Some domains, notably health, do have such guidelines in place but this is not the case for many other domains, such as web-based businesses.

To Classify Is Human, but Analytics Relies on Algorithms

Even when we are armed with the best intentions and do pay attention to ethical issues, there may be other risks lurking in the shadows.

Human beings have an innate desire not just to count things, but also to classify. Bowker and Star in their landmark article [6] warn us about the drastic implications that some classification systems have had on the world we live in and on people. They describe how health classifications (such as those going back to the 1900s) are used to determine which diseases are “counted” and so are recognized as treatments that will be paid for by insurance companies. People suffering from rare diseases may be out of luck as such diseases do not occur frequently enough to be part of a classification scheme designed to support statistics.

Similarly, when we use software analytics, and choose to count some features, but not others, we make decisions about what is valued and what is not. Managers, for example, may implicitly make decisions about the value of certain activities their employees do and they may count the number of code reviews done in a week signaling those are valued, but they may not count (because they can’t!) the number of times those same employees mentored newcomers to do more effective code reviews.

Furthermore, when humans make judgments, as humans ourselves we may have at least some insights into the biases they are likely to have. But when judgments are made or based on algorithms, many of the biases will be opaque to the consumers of those analytics [7].

Lean in: How Ethnography Can Improve Software Analytics and Vice Versa

Software analytics is an important field of research and practice, there is no doubt. But how can it be improved? It is too simplistic to quote the mythical quantitative-qualitative divide in software engineering. The more important difference to consider here is about who generates the data. Is it the participants of the phenomenon under study that create the trace data (or as McGrath refers to them “outcroppings of past behavior” [8]), or is the data generated by researchers (eg, through interviews, surveys, or observations in the field)?

This latter kind of “thick” data (much of which may qualitative, but not always), may be much harder to collect and analyze (http://www.gousios.gr/blog/Scaling-qualitative-research/), but such insights can be used to augment and enrich the big data that are harvested and analyzed with the purpose to form richer insights about a software scenario under study. Ethnographic methods [9], that involve observations or contextual interviews [10] of the stakeholders in their workplace, can be used to inform which questions we should spend time trying to answer, which data we should collect (and conversely which data we should not collect), what is the meaning of the data that is collected, and how those insights should be shared and used.

Moreover, data may tell us what is going on, but they won’t necessarily tell us why a phenomenon is happening nor how we can fix a noted problem [11]. For example, we may be able tell which developers engage in code reviews, but we cannot tell why some do not do code reviews from the code reviewing data that can be captured.

On the other side, ethnography can likewise benefit from data science! Ethnographic methods can be subject to respondent and researcher biases as well as issues with ambiguity and lack of precision in the data collected. Furthermore, the findings from such methods can be dangerous to generalize to broader populations of actors. In light of these limitations, data science is being seen by many social scientists as the “new kid on the block” and many ethnographers are thus turning to “ethnomining” methods to enhance the work they do and to benefit from the big data they can gain access to (http://ethnographymatters.net/editions/ethnomining/). There is no doubt that this big data can increase the reliability of and speed up many kinds of insights, an improvement in speed that is much appreciated in today’s world of rapid technology development, diffusion, and adoption.

Finding the Ethnographer Within

So far we have asked you, the reader, to consider the risks that may arise from developing and using software analytic methods and we have made a call for software analysts to partner with ethnographers to improve the questions that are asked, to bring more meaning to the data being analyzed and to call into question the ethics of the analytics being applied. In an ideal world, both approaches would be applied in tandem by experts from both disciplines. This is perhaps more feasible and more important in research settings then in practice, as research relies on social theories as a foundation for analytics.

In terms of practitioners, we do recognize that using ethnographic methods to collect thick data to enrich big data, is challenging, expensive, and time consuming, and therefore in many cases just not feasible. Instead, we suggest that practitioner software analysts spend much more time thinking like an ethnographer and seek out stories about real people that will bring richness to the data being analyzed, and that they should consider and imagine the possible implications of the analytics that may arise in the short and long terms.

Tool builders should consider not just the analyst perspective but also the consumer of the analyses that may be presented to them through reports, visualizations, or interactive and customizable dashboards. Do these tools support stakeholders in adding “meaning” to these reports, dashboards, and visualizations through conversations, annotations, and links to other resources? Should these tools be designed so that data concerning their use can be analyzed to gain more insights into how software analytics is used and how?

In summary, we make a plea for software analytic researchers and practitioners to consider these issues carefully, to discuss them sharing positive and negative experiences with one another.

References

[1] Zhang D., Xie T. Software analytics in practice, a mini-tutorial at ICSE 2012; 2012.

[2] Seaman C. Qualitative methods in empirical studies of software engineering. IEEE Trans Softw Eng. 1999;25(4):557–572.

[3] Boyd D., Crawford K. Six provocations for big data. In: A decade in internet time: symposium on the dynamics of the internet and society; 2011.

[4] Marcus G., Davis E. Eight (no, nine!) problems with big data. 2014. New York Times. April 6.

[5] Harper R., Bird C., Zimmermann T., Murphy B. Dwelling in software: aspects of the felt-life of engineers in large software projects. In: Proceedings of the 13th European conference on computer supported cooperative work (ECSCW '13); Springer; 2013.

[6] Bowker G.C., Star S.L. Sorting things out: classification and its consequences. Boston, MA: MIT Press; 1999.

[7] Tufekci Z. Algorithms in our midst: information, power and choice when software is everywhere. In: Proceedings of the 18th ACM conference on computer supported cooperative work & social computing; ACM; 2015:1918.

[8] McGrath J.E. Methodology matters: doing research in the behavioral and social sciences. In: Baecker Ronald M., Grudin Jonathan, eds. In Human-computer interaction. CA, USA: Morgan Kaufmann Publishers Inc., San Francisco; 1995:152–169.

[9] Fetterman D.M., ed. London: Sage; Ethnography: step-by-step. 2010;vol. 17.

[10] Holtzblatt K., Jones S. Contextual inquiry: a participatory technique for system design. In: Participatory design: principles and practices. CRC Press; 1993:177–210.

[11] Easterbrook S., Singer J., Storey M.-A., Damian D. Selecting empirical methods for software engineering research. In: Guide to advanced empirical software engineering. London: Springer; 2008:285–311.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset