Foreword

Through the last decades of the twentieth century and into the twenty-first, data was largely a medium for bottom-line accounting: making sure that the books were balanced, the rules were followed, and the right numbers could be rolled up for executive decision-making. It was an era focused on a select group of IT staff engineering the “golden master” of organizational data; an era in which mantras like “garbage in, garbage out” captured the attitude that only carefully engineered data was useful.

Attitudes toward data have changed radically in the past decade, as new people, processes, and technologies have come forward to define the hallmarks of a data-driven organization. In this context, data is a medium for top-line value generation, providing evidence and content for the design of new products, new processes, and evermore efficient operation. Today’s data-driven organizations have analysts working broadly across departments to find methods to use data creatively. It is an era in which new mantras like “extracting signal from the noise” capture a different attitude of agile experimentation and exploitation of large, diverse sources of data.

Of course, accounting still needs to get done in the twenty-first century, and the need remains to curate select datasets. But the data sources and processes for accountancy are relatively small and slow to change. The data that drives creative and exploratory analyses represents an (exponentially!) growing fraction of the data in most organizations, driving widespread rethinking of processes for data and computing—including the way that IT organizations approach their traditional tasks.

The phrase data wrangling, born in the modern context of agile analytics, is meant to describe the lion’s share of the time people spend working with data. There is a common misperception that data analysis is mostly a process of running statistical algorithms on high-performance data engines. In practice, this is just the final step of a longer and more complex process; 50 to 80 percent of an analyst’s time is spent wrangling data to get it to the point at which this kind of analysis is possible. Not only does data wrangling consume most of an analyst’s workday, it also represents much of the analyst’s professional process: it captures activities like understanding what data is available; choosing what data to use and at what level of detail; understanding how to meaningfully combine multiple sources of data; and deciding how to distill the results to a size and shape that can drive downstream analysis. These activities represent the hard work that goes into both traditional data “curation” and modern data analysis. And in the context of agile analytics, these activities also capture the creative and scientific intuition of the analyst, which can dictate different decisions for each use case and data source.

We have been working on these issues with data-centric folks of various stripes—from the IT professionals who fuel data infrastructure in large organizations, to professional data analysts, to data-savvy “enthusiasts” in roles from marketing to journalism to science and social causes. Much is changing across the board here. This book is our effort to wrangle the lessons we have learned in this context into a coherent overview, with a specific focus on the more recent and quickly growing agile analytic processes in data-driven organizations. Hopefully, some of these lessons will help to clarify the importance—and yes, the satisfaction—of data wrangling done well.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset