Chapter 17: Statistical Storytelling

The Path from Multivariate Data to the Modeling Process

Early Applications of Data Mining

Numerous JMP Customer Stories of Modern Applications

Definitions of Data Mining

Data Mining

Predictive Analytics

A Framework for Predictive Analytics Techniques

The Goal, Tasks, and Phases of Predictive Analytics

The Difference between Statistics and Data Mining

SEMMA

The Path from Multivariate Data to the Modeling Process

As you have read the early chapters, you likely have come to realize that we feel strongly that, before discussing predictive analytics or performing a modeling project, you need to understand how to deal with multivariate data. That is one of this book’s main objectives. In particular, you needs a foundation beyond the univariate/bivariate analysis taught and learned in a basic statistics course to understand some of the issues involved in addressing real-world—that is, multivariate—data. Hopefully, the previous chapters have achieved that goal. Now, you are better prepared to understand data mining (or predictive analytics and modeling) and to conduct a modeling project.

We conclude here with a basic overview of data mining (or predictive analytics or predictive modeling) and of the modeling process using JMP.

For the past 25 years, the big buzzword in the business analytics (BA) area has been data mining. The roots of data mining techniques run deep and can be traced back to three areas—statistics, artificial intelligence (AI), and machine learning. All data mining tools and techniques have a strong foundation in classical statistical analysis. In the 1970s and 1980s, AI techniques based on heuristics that attempted to simulate human thought processes were developed. Subsequently the field of machine learning, which is the union of statistics and AI, evolved. An example of machine learning is a computer program that learns more about the game of chess as it plays more and more games.

Early Applications of Data Mining

Two areas of early successful application of data mining have been credit card fraud detection and customer relationship management (CRM).

Credit Card Fraud Detection

On the basis of analyzing customers’ historical buying patterns, data mining models identify potential credit card fraud in transactions that are out of the “norm.” For example, suppose you have never traveled to South America, but you want to go to the World Cup in Brazil in 2014. So, you book a flight to Brazil and charge it on your credit card. While in Brazil, you think it would be nice to go on a few side trips—for example, to see Iguassu Falls. You then book a flight and accommodations and tour package with your credit card. Subsequently, you receive an email from your credit card company saying that your card transactions are temporarily suspended and to please contact it. The credit card company wants to make sure your card has not been stolen, because you were making significant purchases outside your normal spending pattern. Data mining models are used to target such behavior.

Customer Relationship Management

The other area of early successful data mining applications is customer relationship management (CRM). CRM is a process or business strategy taken by companies to improve overall customer satisfaction, especially for their best customers. For example, large companies with multiple product offerings might have several customers that buy products across the company’s product line. However, each division of the company might have separate sales and support staff as well as their own independent database. A CRM solution to this situation would be one company-wide database that allows everyone in the company to access the data on a particular customer, which improves customer satisfaction and promotes cross-selling opportunities.

Numerous JMP Customer Stories of Modern Applications

SAS provides numerous customer stories of successful data mining or statistical applications that use JMP in several industry areas (aerospace, conservation, education, energy, genomics, government, health care, manufacturing, pharmaceuticals, and semiconductor) and by statistical application areas (see JMP Customer Stories: http://www.jmp.com/en_us/success.html).

Definitions of Data Mining

Defined broadly, data mining is a process of finding patterns in data to help us make better decisions. Or more simply, as a good old friend of ours would say, it is mining data. In a nutshell, he is basically right. Furthermore, as Stuart Ewen wrote in the New York Times, “Probably at no time in the last decade has the actual knowledge of consumer buying habits been as vital to successful and profitable retailing as it is today”(1996, p.184).

However, this statement was first written in 1931. So data mining is not new, and successful decision makers have always done it. Then why has the area of data mining grown so much recently? What has changed? The change has been the confluence of the three areas of data mining:

   Statistics, AI, and machine learning.

   The exponential increase in our computer power.

   The scale of data accumulation has amplified this new area called data mining.

Nonetheless, data mining is not the current buzzword anymore. It has been replaced by the terms predictive analytics and predictive modeling. What is the difference in these terms? As we discussed, with the many and evolving definitions of business intelligence in Chapter 1, these terms seem to have many different yet quite similar definitions. One SAS expert defines these terms as follows.

Data Mining

Data mining has been defined in a lot of ways, but at the heart of all of these definitions is a process for analyzing data that typically includes the following steps:

   Formulate the problem.

   Accumulate data.

   Transform and select data.

   Train models.

   Evaluate models.

   Deploy models.

   Monitor results.

Predictive Analytics

Predictive analytics is an umbrella term that encompasses both data mining and predictive modeling—as well as a number of other analytical techniques. We define predictive analytics as a collection of statistics and data mining techniques that analyze data to make predictions about future events.

Predictive modeling is one such technique that answers questions such as the following:

   Who's likely to respond to a campaign?

   How much do first-time purchasers usually spend?

   Which customers are likely to default?

Predictive analytics is a subset of analytics, which more broadly includes other areas of statistics like experimental design, time series forecasting, operations research, and text analytics.1

A Framework for Predictive Analytics Techniques

There appear to be two common major characterizations of the terms data mining, predictive analytics, and predictive modeling. You can view them as a collection of advanced statistical techniques. Or you can view them as a modeling process.

In terms of a collection of advanced statistical techniques, several approaches have been used to classify data mining, predictive analytics, and predictive modeling.2 We categorize these predictive analytics techniques into supervised (directed) or unsupervised (undirected) learning techniques as shown in Figure 17.1.

Figure 17.1: A Framework for Predictive Analytics Techniques

image

With the unsupervised learning techniques, there is no target, or dependent variable (or variables). An example of an unsupervised learning predictive analytics technique is association rules (or market basket analysis or affinity grouping). With the association rules technique, we try to identify which things (in most cases, products) go together. For example, when you go grocery shopping, which products are sold together? An example would be milk and cereal or the unexpected classic data mining example of diapers and beer.

With supervised learning techniques, the goal is to develop a model that describes what affects one variable of interest (and occasionally more than one). The variable of interest is called the dependent variable. The goal is to establish one or more significant relationships among the other variables (called independent variables) and this dependent variable. We have examined several such supervised techniques in this book:

   regression

   logistic regression

   ANOVA

   decision trees

   k-nearest neighbors

   neural networks

   bootstrap forests

   boosted trees

The decision trees, k-nearest neighbors, neural networks, bootstrap forests, and boosted trees techniques are usually considered supervised learning predictive analytics techniques.

Another step found in many data mining/predictive analytics projects, especially when you have a large data set, is to compare the various models and divide the data set into training and validation, which was discussed in Chapter 14.

Finally, notice that, in classifying and listing these predictive analytics techniques (see Figure 17.1), we do include the basic statistical tools and techniques that you learned in the introduction to statistics, as well as address dirty data and the multivariate techniques discussed in this book. These tools and techniques are also part of predictive analytics and the modeling process.

The Goal, Tasks, and Phases of Predictive Analytics

The goal of these advanced statistical techniques, whether supervised or unsupervised, is to extract information from the data. The six main tasks of predictive analytics with their associated activities are as follows:

Discovery

describes, summarizes, and visualizes the data and develops a basic understanding of their relationships.

Classification

places each object into a predefined set of classes or groups.

Estimation

is similar to classification but the dependent/target variable is continuous.

Clustering

segments each object into a number of subgroups or clusters, with the difference between classification and clustering being that with clustering the classes or groups are not predefined but are developed by the technique.

Association

determines which items go together (for example, which items are concurrently brought together).

Prediction

identifies variables that are related to one or more other variables so as to predict or estimate their future values.

The tasks of discovery, clustering, and association are all examples of unsupervised (undirected) learning. The other three tasks—classification, estimation, and prediction—are examples of supervised (directed) learning.

In this text, you have examined several of the fundamental-to-advanced statistical techniques:

   discovery tools

   clustering

   principal component analysis and factor analysis

   ANOVA

   regression

   logistic regression

   decision trees

   k-nearest neighbors

   neural networks

   bootstrap forests

   boosted trees

   model comparison

JMP is a comprehensive statistical and predictive analytics package. So, in addition to the JMP techniques and tools discussed in the text, JMP provides other predictive analytics/multivariate techniques such as conjoint analysis (in particular, discrete choice analysis), as well as several other statistical techniques.

The Difference between Statistics and Data Mining

So, what are the differences between statistics and data mining/predictive analytics/predictive modeling? This question is difficult to answer. First, both disciplines share numerous similar tools and techniques. However, both disciplines are much more than several tools and techniques. The major differences seem to lie in their objectives and processes.

The broadening of the definition for predictive analytics from a collection of statistical techniques to a process is the second point of view of predictive analytics. This broadening of the definition reflects the maturity of the discipline

Berry and Linoff (2004) define data mining as “a business process for exploring a large amount of data to discover meaningful patterns and rules.” The phases of the data mining process are listed in Table 17.1. This process is not necessarily linear. That is, you do not always proceed from one phase to the next listed phase. Many times, if not most of the time, depending on the phase’s results, the data mining project might require you to go back one or more phases. The process is usually iterative.

Table 17.1: The Data Mining Process and the Percentage of Time Spent on Each Phase

Phase Time (%)
Project definition 5
Data collection 20
Data preparation 30
Data understanding 20
Model development and evaluation 20
Implementation 5

As you can see from Table 17.1, what we have discussed in this book concerns only 20% of the time spent on a data mining project: model development and evaluation. While “data understanding” does require some use of statistics (scatterplots, univariate summary statistics, and the like), easily 50% of the analyst's time will be spent on the mundane and tedious tasks of data preparation and data understanding. You addressed some of these issues in Chapter 3. But note that little is written about data preparation and data understanding, which makes learning about these topics difficult. There is a notable exception, though—the excellent book by Pyle (1999). We recommend it to anyone who wishes to understand the basics of data collection and understanding.

Often in statistical studies, the study’s objectives are well defined, so the project is well focused and directed. The data is collected to answer the study’s specific questions. A major focus of most statistical studies and processes is to draw inferences about the population based on the sample.

By contrast, in many predictive analytics projects, besides having a significantly large data set, in many cases, the data is the entire population. This makes statistical inference a moot point. The data in a predictive analytics project is rarely collected with a well-defined objective of analysis. It is usually retrieved from several data sources, and it is most likely dirty data. As a result, unlike most statistical studies, the data must be integrated from these different sources and appropriately aggregated. Just like statistical studies, the data in a data mining project must be cleaned and prepared for analysis. However, due to the numerous sources of data and the usually larger number of variables, this phase of the process is much more labor intensive. Both processes share the same concern: to develop an understanding, description, and summary of the data.

SEMMA

The primary phase of the data mining/predictive analytics modeling process, which many people would define as data mining/predictive analytics (the first point of view), is the model development and evaluation phase. This phase might account for only about 20% of the project’s overall efforts (mainly because of the large amount of effort to integrate and prepare the data).

SAS Institute Inc. developed a systematic approach to this phase of the data mining process called SEMMA (Azevedo and Santos, 2008):

S—Sample. If possible (that is, if you have a large enough data set), extract a sample that contains the significant information, yet is small enough to process quickly. The part of the data set that remains can be used to validate and test the model developed.

E—Explore. Use discovery tools and various data reduction tools to further understand data and search for hidden trends and relationships.

M—Modify. Create, transform, and group variables to enhance the analysis.

M—Model. Choose and apply one or more appropriate data mining techniques.

A—Assess. Build several models using multiple techniques; evaluate; assess the usefulness; and compare the models results. If a small portion of the large data set was set aside during the sample stage, validate and test the model.

Once the “best” model is identified, the model is deployed, and the ROI from the data mining process is realized.

The objective of the model development and evaluation phase is to uncover unsuspected but valuable relationships. So you search until you find a model that fits the data set arbitrarily well, so that it is not overly complex and the model does not overfit the data. Statisticians become concerned with such a data-driven analysis approach to obtain a good fit because they are aware that such a search could lead to relationships that happen purely by chance. Unlike most statistical studies, predictive analytics projects are less focused on statistical significance and more on the practical importance—on obtaining answers that will improve decision making. Nevertheless, even though objectives and processes might differ, the bottom line of statistical studies and data mining projects is to learn from the data.

We hope this book has provided you with a foundation to conduct a statistical study (or a predictive analytics project) and planted the seeds on how to write a statistical story.

Happy storytelling!

1 http://www.sas.com/news/sascom/2010q1/column_tech.html

2 Remember that we use these terms interchangeably, to mean the same thing.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset