Data mining overview

In MicroStrategy, predictive analysis is a subset of Data Mining Services; we cannot do predictive analysis well without knowing the basics of data mining. In this section, we will learn some basics.

Purpose of data mining

The purpose of data mining is to discover useful information from the data, to help us make better business decisions. Data mining is also about automation, reusable methods, and time saving. We want to implement a pre-built model into our database and automate the use of mathematical and statistical processes to predict future outcomes, in short, to build an automated predictive mechanism. Examples include instant credit scoring, fraud detection, marketing campaign management, interactive marketing, market basket analysis, failure rate prediction, and so on.

Limitations of data mining

First, data mining reveals patterns, but we are responsible for deciding how valuable these patterns are. Second, data quality affects model quality; poor data will produce poor predictive models. Third, accuracy of prediction depends on discovered patterns and trends remaining unchanged. If there is a structure change (you may be familiar with the concept from Chow's test in econometrics), big discrepancies between actual values and model predictions are expected. Fourth, we need to understand data mining tools before using them. Finally, yet importantly, data mining may cause legal issues and ethical challenges.

Terminologies

Let us review some frequently used data mining terminologies in MicroStrategy.

Target variable, explanatory variable

Consider a simple linear model:

Target variable, explanatory variable

On the left hand side, probability of Response is called a dependent variable, or Target variable. On the right-hand side, Age, Gender, and Income are called independent variables, predictors, explanatory variables, predictive variables, or input variables.

Continuous variable, categorical variable

Continuous variable is about numerical values we can perform arithmetic operations on, for example, price, income, and salary. Categorical variable is about non-numerical values, such as true or false, or numeric values that are just labels, for example, 1 for January and 2 for February; social security numbers; and US zip codes.

Training, validation, modeling

In order to increase our confidence in our models, we often go through training, validation, and modeling stages. Training uses part of the data, say 80%, with target values known to build our model. Validating uses part of, say 20%, the available data with known target values to apply the model we built and to verify the computed results with the actual results. Modeling applies our model to the data with target variable values unknown.

The effectiveness of going through these steps, in my humble opinion, is doubtful, because how accurately our models predict depends on two things and two things only: one, the data quality; two, the discovered patterns and trends remaining unchanged. Even if our model gives fabulous prediction results in the validation stage, its prediction could fail miserably when applied to out of sample data, because the patterns and trends in the data we built the model upon are different from those in the data we applied our model to.

The solution is to ensure the data we built the model upon is representative enough. That is, it has the same patterns and trends as the data we intend to apply our model to. This is a challenging task.

Supervised learning, unsupervised learning

According to Pattern Recognition and Machine Learning, BishopSpringer-Verlag New York, supervised learning refers to input data with a known target variable, while unsupervised learning has no known target variable. The goal in unsupervised learning problems may be to discover groups of similar data points, which is called clustering.

Classification, prediction

When a target variable is categorical, we often refer to related data mining techniques as classification, for example, predicting whether a customer will churn or not. Classification data mining techniques include decision tree and logistic regression.

When a target variable is continuous, we often refer to related data mining techniques as prediction. For example, predicting the monthly revenues for the next year. Prediction data mining techniques include linear regression, time series analysis, and so on.

Data mining techniques

Some commonly used techniques include decision tree, regression model, cluster model, and neural network. The following graph shows the level of prediction accuracy, and how hard it is to understand different techniques. In general, decision tree is easy to understand but the least accurate, neural network is the most accurate but also the most difficult to understand, while regression model and cluster model are somewhere in between:

Data mining techniques

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset