Foreword

It is common knowledge that every computer scientist, programmer, statistician, informatician, and knowledge domain expert must know how to analyze data. It is less commonly known that very few of these data professionals are given the skills required to prepare their data in a form that supports credible analysis. As Berman points out, data analysis has no value unless it can be verified, validated, reanalyzed, and repurposed, as needed. Data Simplification: Taming Information with Open Source Tools is a practical guide to the messiness of complex and heterogeneous data used in discovery science (ie, our optimistic resolve to understand what our data is trying to tell us). Berman successfully makes the case for data simplification as a discipline within data science.

fm01-9780128037812

In this important work, Berman deals with the practical aspects of complex data sets and creates a workflow for tackling problems in so-called big data research. No book to date has effectively dealt with the sources of data complexity in such a comprehensive, yet practical fashion. Speaking from my own area of involvement, biomedical researchers wrestling with genome/imaging/computational phenotype analyses will find Berman's approach to data simplification particularly constructive.

The book opens with a convincing demonstration that complex data requires simplification in order to answer high impact questions. Berman shows that the process of data simplification is not, itself, simple. He provides a set of principles, methods and tools to unlock the secrets of “big data.” More importantly, he provides a roadmap to the use of free, open source tools in the data simplification process; skills that need to be emphasized to the data science community irrespective of scientific discipline. It is fair to acknowledge that our customary reliance on costly and “comprehensive” software/development solutions will sometimes increase the likelihood that a data project will fail.

As there is a “gold rush” encouraging the workforce training of data scientists, this gritty “Rules of the Road” monograph should serve as a constant companion for modern data scientists. Berman convincingly portrays the value of programmers and analysts who have facility with Perl, Python, or Ruby and who understand the critical role of metadata, indexing, and data visualization. These professionals will be high on my shopping list of talent to add to our biomedical informatics team in Pittsburgh.

Data science is currently the focus of an intense, worldwide effort extending to all biomedical institutions. It seems that we have reached a point where progress in the biomedical sciences is paused, waiting for us to draw useful meaning from the dizzying amount of new data being collected by high throughput technologies, electronic health records, mobile medical sensors, and the exabytes generated from imaging modalities in research and clinical practice. Here at the University of Pittsburgh, we are deeply involved in the efforts of the U.S. National Institutes of Health to tame complex biomedical data, through our membership in the NIH Big Data to Knowledge (BD2K) Consortium (https://datascience.nih.gov/bd2k). Our continuing fascination with “more” data and “big” data have been compounded by the amplification and hype of an array of software tools and solutions that claim to “solve” big data problems. Although, much of data science innovation focuses on hardware, cloud computing, and novel algorithms to solve BD2K problems, the critical issues remain at the level of the utility of the data (eg, simplification) addressed in this important book by Berman.

Data Simplification provides easy, free solutions to the unintended consequences of data complexity. This book should be the first (and probably most important) guide to success in the data sciences. I will be providing copies to my trainees, programmers, analysts, and faculty, as required reading.

Michael J. Becich, MD, PhD, Associate Vice-Chancellor for Informatics in the Health Sciences, Chairman and Distinguished University Professor, Department of Biomedical Informatics, Director, Center for Commercial Application (CCA) of Healthcare Data, University of Pittsburgh School of Medicine

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset