This book is about data science: a field that uses results from statistics, machine learning, and computer science to create predictive models. Because of the broad nature of data science, it’s important to discuss it a bit and to outline the approach we take in this book.
The statistician William S. Cleveland defined data science as an interdisciplinary field larger than statistics itself. We define data science as managing the process that can transform hypotheses and data into actionable predictions. Typical predictive analytic goals include predicting who will win an election, what products will sell well together, which loans will default, or which advertisements will be clicked on. The data scientist is responsible for acquiring the data, managing the data, choosing the modeling technique, writing the code, and verifying the results.
Because data science draws on so many disciplines, it’s often a “second calling.” Many of the best data scientists we meet started as programmers, statisticians, business intelligence analysts, or scientists. By adding a few more techniques to their repertoire, they became excellent data scientists. That observation drives this book: we introduce the practical skills needed by the data scientist by concretely working through all of the common project steps on real data. Some steps you’ll know better than we do, some you’ll pick up quickly, and some you may need to research further.
Much of the theoretical basis of data science comes from statistics. But data science as we know it is strongly influenced by technology and software engineering methodologies, and has largely evolved in groups that are driven by computer science and information technology. We can call out some of the engineering flavor of data science by listing some famous examples:
These systems share a lot of features:
This book teaches the principles and tools needed to build systems like these. We teach the common tasks, steps, and tools used to successfully deliver such projects. Our emphasis is on the whole process—project management, working with others, and presenting results to nonspecialists.
This book covers the following:
We’ve arranged the book topics in an order that we feel increases understanding. The material is organized as follows.
Part 1 describes the basic goals and techniques of the data science process, emphasizing collaboration and data.
Chapter 1 discusses how to work as a data scientist, and chapter 2 works through loading data into R and shows how to start working with R.
Chapter 3 teaches what to first look for in data and the important steps in characterizing and understanding data. Data must be prepared for analysis, and data issues will need to be corrected, so chapter 4 demonstrates how to handle those things.
Part 2 moves from characterizing data to building effective predictive models. Chapter 5 supplies a starting dictionary mapping business needs to technical evaluation and modeling techniques.
Chapter 6 teaches how to build models that rely on memorizing training data. Memorization models are conceptually simple and can be very effective. Chapter 7 moves on to models that have an explicit additive structure. Such functional structure adds the ability to usefully interpolate and extrapolate situations and to identify important variables and effects.
Chapter 8 shows what to do in projects where there is no labeled training data available. Advanced modeling methods that increase prediction performance and fix specific modeling issues are introduced in chapter 9.
Part 3 moves away from modeling and back to process. We show how to deliver results. Chapter 10 demonstrates how to manage, document, and deploy your models. You’ll learn how to create effective presentations for different audiences in chapter 11.
The appendixes include additional technical details about R, statistics, and more tools that are available. Appendix A shows how to install R, get started working, and work with other tools (such as SQL). Appendix B is a refresher on a few key statistical ideas. Appendix C discusses additional tools and research ideas. The bibliography supplies references and opportunities for further study.
The material is organized in terms of goals and tasks, bringing in tools as they’re needed. The topics in each chapter are discussed in the context of a representative project with an associated dataset. You’ll work through 10 substantial projects over the course of this book. All the datasets referred to in this book are at the book’s GitHub repository, https://github.com/WinVector/zmPDSwR. You can download the entire repository as a single zip file (one of GitHub’s services), clone the repository to your machine, or copy individual files as needed.
To work the examples in this book, you’ll need some familiarity with R, statistics, and (for some examples) SQL databases. We recommend you have some good introductory texts on hand. You don’t need to be an expert in R, statistics, and SQL before starting the book, but you should be comfortable tutoring yourself on topics that we mention but can’t cover completely in our book.
For R, we recommend R in Action, Second Edition, by Robert Kabacoff (www.manning.com/kabacoff2/), along with the text’s associated website, Quick-R (www.statmethods.net). For statistics, we recommend Statistics, Fourth Edition by David Freedman, Robert Pisani, and Roger Purves. For SQL, we recommend SQL for Smarties, Fourth Edition by Joe Celko.
In general, here’s what we expect from our ideal reader:
This book is not an R manual. We use R to concretely demonstrate the important steps of data science projects. We teach enough R for you to work through the examples, but a reader unfamiliar with R will want to refer to appendix A as well as to the many excellent R books and tutorials already available.
This book is not a set of case studies. We emphasize methodology and technique. Example data and code is given only to make sure we’re giving concrete usable advice.
This book is not a big data book. We feel most significant data science occurs at a database or file manageable scale (often larger than memory, but still small enough to be easy to manage). Valuable data that maps measured conditions to dependent outcomes tends to be expensive to produce, and that tends to bound its size. For some report generation, data mining, and natural language processing, you’ll have to move into the area of big data.
This is not a theoretical book. We don’t emphasize the absolute rigorous theory of any one technique. The goal of data science is to be flexible, have a number of good techniques available, and be willing to research a technique more deeply if it appears to apply to the problem at hand. We prefer R code notation over beautifully typeset equations even in our text, as the R code can be directly used.
This is not a machine learning tinkerer’s book. We emphasize methods that are already implemented in R. For each method, we work through the theory of operation and show where the method excels. We usually don’t discuss how to implement them (even when implementation is easy), as that information is readily available.
This book is example driven. We supply prepared example data at the GitHub repository (https://github.com/WinVector/zmPDSwR), with R code and links back to original sources. You can explore this repository online or clone it onto your own machine. We also supply the code to produce all results and almost all graphs found in the book as a zip file (https://github.com/WinVector/zmPDSwR/raw/master/CodeExamples.zip), since copying code from the zip file can be easier than copying and pasting from the book. You can also download the code from the publisher’s website at www.manning.com/PracticalDataSciencewithR.
We encourage you to try the example R code as you read the text; even when we discuss fairly abstract aspects of data science, we illustrate examples with concrete data and code. Every chapter includes links to the specific dataset(s) that it references.
In this book, code is set with a fixed-width font like this to distinguish it from regular text. Concrete variables and values are formatted similarly, whereas abstract math will be in italic font like this. R is a mathematical language, so many phrases read correctly in either font. In our examples, any prompts such as > and $ are to be ignored. Inline results may be prefixed by R’s comment character #.
To work through our examples, you’ll need some sort of computer (Linux, OS X, or Windows) with software installed (installation described in appendix A). All of the software we recommend is fully cross-platform (Linux, OS X, or Windows), freely available, and usually open source.
We suggest installing at least the following:
The purchase of Practical Data Science with R includes free access to a private web forum run by Manning Publications, where you can make comments about the book, ask technical questions, and receive help from the authors and from other users. To access the forum and subscribe to it, point your web browser to www.manning.com/PracticalDataSciencewithR. This page provides information on how to get on the forum once you are registered, what kind of help is available, and the rules of conduct on the forum.
Manning’s commitment to our readers is to provide a venue where a meaningful dialogue between individual readers and between readers and the authors can take place. It is not a commitment to any specific amount of participation on the part of the authors, whose contribution to the forum remains voluntary (and unpaid). We suggest you try asking the authors some challenging questions lest their interest stray!
The Author Online forum and the archives of previous discussions will be accessible from the publisher’s website as long as the book is in print.
NINA ZUMEL has worked as a scientist at SRI International, an independent, nonprofit research institute. She has worked as chief scientist of a price optimization company and founded a contract research company. Nina is now a principal consultant at Win-Vector LLC. She can be reached at [email protected].
JOHN MOUNT has worked as a computational scientist in biotechnology and as a stock trading algorithm designer, and has managed a research team for Shopping.com. He is now a principal consultant at Win-Vector LLC. John can be reached at [email protected].