About this Book

This book is about data science: a field that uses results from statistics, machine learning, and computer science to create predictive models. Because of the broad nature of data science, it’s important to discuss it a bit and to outline the approach we take in this book.

What is data science?

The statistician William S. Cleveland defined data science as an interdisciplinary field larger than statistics itself. We define data science as managing the process that can transform hypotheses and data into actionable predictions. Typical predictive analytic goals include predicting who will win an election, what products will sell well together, which loans will default, or which advertisements will be clicked on. The data scientist is responsible for acquiring the data, managing the data, choosing the modeling technique, writing the code, and verifying the results.

Because data science draws on so many disciplines, it’s often a “second calling.” Many of the best data scientists we meet started as programmers, statisticians, business intelligence analysts, or scientists. By adding a few more techniques to their repertoire, they became excellent data scientists. That observation drives this book: we introduce the practical skills needed by the data scientist by concretely working through all of the common project steps on real data. Some steps you’ll know better than we do, some you’ll pick up quickly, and some you may need to research further.

Much of the theoretical basis of data science comes from statistics. But data science as we know it is strongly influenced by technology and software engineering methodologies, and has largely evolved in groups that are driven by computer science and information technology. We can call out some of the engineering flavor of data science by listing some famous examples:

  • Amazon’s product recommendation systems
  • Google’s advertisement valuation systems
  • LinkedIn’s contact recommendation system
  • Twitter’s trending topics
  • Walmart’s consumer demand projection systems

These systems share a lot of features:

  • All of these systems are built off large datasets. That’s not to say they’re all in the realm of big data. But none of them could’ve been successful if they’d only used small datasets. To manage the data, these systems require concepts from computer science: database theory, parallel programming theory, streaming data techniques, and data warehousing.
  • Most of these systems are online or live. Rather than producing a single report or analysis, the data science team deploys a decision procedure or scoring procedure to either directly make decisions or directly show results to a large number of end users. The production deployment is the last chance to get things right, as the data scientist can’t always be around to explain defects.
  • All of these systems are allowed to make mistakes at some non-negotiable rate.
  • None of these systems are concerned with cause. They’re successful when they find useful correlations and are not held to correctly sorting cause from effect.

This book teaches the principles and tools needed to build systems like these. We teach the common tasks, steps, and tools used to successfully deliver such projects. Our emphasis is on the whole process—project management, working with others, and presenting results to nonspecialists.

Roadmap

This book covers the following:

  • Managing the data science process itself. The data scientist must have the ability to measure and track their own project.
  • Applying many of the most powerful statistical and machine learning techniques used in data science projects. Think of this book as a series of explicitly worked exercises in using the programming language R to perform actual data science work.
  • Preparing presentations for the various stakeholders: management, users, deployment team, and so on. You must be able to explain your work in concrete terms to mixed audiences with words in their common usage, not in whatever technical definition is insisted on in a given field. You can’t get away with just throwing data science project results over the fence.

We’ve arranged the book topics in an order that we feel increases understanding. The material is organized as follows.

Part 1 describes the basic goals and techniques of the data science process, emphasizing collaboration and data.

Chapter 1 discusses how to work as a data scientist, and chapter 2 works through loading data into R and shows how to start working with R.

Chapter 3 teaches what to first look for in data and the important steps in characterizing and understanding data. Data must be prepared for analysis, and data issues will need to be corrected, so chapter 4 demonstrates how to handle those things.

Part 2 moves from characterizing data to building effective predictive models. Chapter 5 supplies a starting dictionary mapping business needs to technical evaluation and modeling techniques.

Chapter 6 teaches how to build models that rely on memorizing training data. Memorization models are conceptually simple and can be very effective. Chapter 7 moves on to models that have an explicit additive structure. Such functional structure adds the ability to usefully interpolate and extrapolate situations and to identify important variables and effects.

Chapter 8 shows what to do in projects where there is no labeled training data available. Advanced modeling methods that increase prediction performance and fix specific modeling issues are introduced in chapter 9.

Part 3 moves away from modeling and back to process. We show how to deliver results. Chapter 10 demonstrates how to manage, document, and deploy your models. You’ll learn how to create effective presentations for different audiences in chapter 11.

The appendixes include additional technical details about R, statistics, and more tools that are available. Appendix A shows how to install R, get started working, and work with other tools (such as SQL). Appendix B is a refresher on a few key statistical ideas. Appendix C discusses additional tools and research ideas. The bibliography supplies references and opportunities for further study.

The material is organized in terms of goals and tasks, bringing in tools as they’re needed. The topics in each chapter are discussed in the context of a representative project with an associated dataset. You’ll work through 10 substantial projects over the course of this book. All the datasets referred to in this book are at the book’s GitHub repository, https://github.com/WinVector/zmPDSwR. You can download the entire repository as a single zip file (one of GitHub’s services), clone the repository to your machine, or copy individual files as needed.

Audience

To work the examples in this book, you’ll need some familiarity with R, statistics, and (for some examples) SQL databases. We recommend you have some good introductory texts on hand. You don’t need to be an expert in R, statistics, and SQL before starting the book, but you should be comfortable tutoring yourself on topics that we mention but can’t cover completely in our book.

For R, we recommend R in Action, Second Edition, by Robert Kabacoff (www.manning.com/kabacoff2/), along with the text’s associated website, Quick-R (www.statmethods.net). For statistics, we recommend Statistics, Fourth Edition by David Freedman, Robert Pisani, and Roger Purves. For SQL, we recommend SQL for Smarties, Fourth Edition by Joe Celko.

In general, here’s what we expect from our ideal reader:

  • An interest in working examples. By working through the examples, you’ll learn at least one way to perform all steps of a project. You must be willing to attempt simple scripting and programming to get the full value of this book. For each example we work, you should try variations and expect both some failures (where your variations don’t work) and some successes (where your variations outperform our example analyses).
  • Some familiarity with the R statistical system and the will to write short scripts and programs in R. In addition to Kabacoff, we recommend a few good books in the bibliography. We work specific problems in R; to understand what’s going on, you’ll need to run the examples and read additional documentation to understand variations of the commands we didn’t demonstrate.
  • Some experience with basic statistical concepts such as probabilities, means, standard deviations, and significance. We introduce these concepts as needed, but you may need to read additional references as we work through examples. We define some terms and refer to some topic references and blogs where appropriate. But we expect you will have to perform some of your own internet searches on certain topics.
  • A computer (OS X, Linux, or Windows) to install R and other tools on, as well as internet access to download tools and datasets. We strongly suggest working through the examples, examining R help() on various methods, and following up some of the additional references.

What is not in this book?

This book is not an R manual. We use R to concretely demonstrate the important steps of data science projects. We teach enough R for you to work through the examples, but a reader unfamiliar with R will want to refer to appendix A as well as to the many excellent R books and tutorials already available.

This book is not a set of case studies. We emphasize methodology and technique. Example data and code is given only to make sure we’re giving concrete usable advice.

This book is not a big data book. We feel most significant data science occurs at a database or file manageable scale (often larger than memory, but still small enough to be easy to manage). Valuable data that maps measured conditions to dependent outcomes tends to be expensive to produce, and that tends to bound its size. For some report generation, data mining, and natural language processing, you’ll have to move into the area of big data.

This is not a theoretical book. We don’t emphasize the absolute rigorous theory of any one technique. The goal of data science is to be flexible, have a number of good techniques available, and be willing to research a technique more deeply if it appears to apply to the problem at hand. We prefer R code notation over beautifully typeset equations even in our text, as the R code can be directly used.

This is not a machine learning tinkerer’s book. We emphasize methods that are already implemented in R. For each method, we work through the theory of operation and show where the method excels. We usually don’t discuss how to implement them (even when implementation is easy), as that information is readily available.

Code conventions and downloads

This book is example driven. We supply prepared example data at the GitHub repository (https://github.com/WinVector/zmPDSwR), with R code and links back to original sources. You can explore this repository online or clone it onto your own machine. We also supply the code to produce all results and almost all graphs found in the book as a zip file (https://github.com/WinVector/zmPDSwR/raw/master/CodeExamples.zip), since copying code from the zip file can be easier than copying and pasting from the book. You can also download the code from the publisher’s website at www.manning.com/PracticalDataSciencewithR.

We encourage you to try the example R code as you read the text; even when we discuss fairly abstract aspects of data science, we illustrate examples with concrete data and code. Every chapter includes links to the specific dataset(s) that it references.

In this book, code is set with a fixed-width font like this to distinguish it from regular text. Concrete variables and values are formatted similarly, whereas abstract math will be in italic font like this. R is a mathematical language, so many phrases read correctly in either font. In our examples, any prompts such as > and $ are to be ignored. Inline results may be prefixed by R’s comment character #.

Software and hardware requirements

To work through our examples, you’ll need some sort of computer (Linux, OS X, or Windows) with software installed (installation described in appendix A). All of the software we recommend is fully cross-platform (Linux, OS X, or Windows), freely available, and usually open source.

We suggest installing at least the following:

  • R itself: http://cran.r-project.org.
  • Various packages from CRAN (installed by R itself using the install.packages() command and activated using the library() command).
  • Git for version control: http://git-scm.com.
  • RStudio for an integrated editor, execution and graphing environment—http://www.rstudio.com.
  • A bash shell for system commands. This is built-in for Linux and OS X, and can be added to Windows by installing Cygwin (http://www.cygwin.com). We don’t write any scripts, so an experienced Windows shell user can skip installing Cygwin if they’re able to translate our bash commands into the appropriate Windows commands.

Author Online

The purchase of Practical Data Science with R includes free access to a private web forum run by Manning Publications, where you can make comments about the book, ask technical questions, and receive help from the authors and from other users. To access the forum and subscribe to it, point your web browser to www.manning.com/PracticalDataSciencewithR. This page provides information on how to get on the forum once you are registered, what kind of help is available, and the rules of conduct on the forum.

Manning’s commitment to our readers is to provide a venue where a meaningful dialogue between individual readers and between readers and the authors can take place. It is not a commitment to any specific amount of participation on the part of the authors, whose contribution to the forum remains voluntary (and unpaid). We suggest you try asking the authors some challenging questions lest their interest stray!

The Author Online forum and the archives of previous discussions will be accessible from the publisher’s website as long as the book is in print.

About the authors

NINA ZUMEL has worked as a scientist at SRI International, an independent, nonprofit research institute. She has worked as chief scientist of a price optimization company and founded a contract research company. Nina is now a principal consultant at Win-Vector LLC. She can be reached at [email protected].

JOHN MOUNT has worked as a computational scientist in biotechnology and as a stock trading algorithm designer, and has managed a research team for Shopping.com. He is now a principal consultant at Win-Vector LLC. John can be reached at [email protected].

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset