CHAPTER TWO: An Overview of Data Mining Techniques

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

CHAPTER TWO

An Overview of Data Mining Techniques

SUPERVISED MODELING

In supervised modeling, whether for the prediction of an event or for a continuous numeric outcome, the availability of a training dataset with historical data is required. Models learn from past cases. In order for predictive models to associate input data patterns with specific outcomes, it is necessary to present them with cases with known outcomes. This phase is called the training phase. During that phase, the predictive algorithm builds the function that connects the inputs with the target field. Once the relationships are identified and the model is evaluated and proved to be of satisfactory predictive power, the scoring phase follows. New records, for which the outcome values are unknown, are presented to the model and scored accordingly.

Some predictive models such as regression and decision trees are transparent, providing an explanation of their results. Besides prediction, these models can also be used for insight and profiling. They can identify inputs with a significant effect on the target attribute and they can reveal the type and magnitude of the effect. For instance, supervised models can be applied to find the drivers associated with customer satisfaction or attrition. Similarly, supervised models can also supplement traditional reporting techniques in the profiling of the segments of an organization by identifying the differentiating features of each group.

According to the measurement level of the field to be predicted, supervised models are further categorized into:

Classification or propensity modeling techniques.
Estimation or regression modeling techniques.

A categorical or symbolic field contains discrete values which denote membership of known groups or predefined classes. A categorical field may be a flag (dichotomous or binary) field with Yes/No or True/False values or a set field with more than two outcomes. Typical examples of categorical fields and outcomes include:

Accepted a marketing offer. [Yes/No]
Good credit risk/bad credit risk.
Churned/stayed active.

These outcomes are associated with the occurrence of specific events. When the target is categorical, the use of a classification model is appropriate. These models analyze discrete outcomes and are used to classify new records into the predefined classes. In other words, they predict events. Confidence scores supplement their predictions, denoting the likelihood of a particular outcome.

On the other hand, there are fields with continuous numeric values (range values), such as:

The balance of bank accounts
The amount of credit card purchases of each card holder
The number of total telecommunication calls made by each customer.

In such cases, when analysts want to estimate continuous outcome values, estimation models are applied. These models are also referred to as regression models after the respective statistical technique. Nowadays, though, other estimation techniques are also available.

Another use of supervised models is in the screening of predictors. These models are used as a preparatory step before the development of a predictive model. They assess the predictive importance of the original input fields and identify the significant predictors. Predictors with little or no predictive power are removed from the subsequent modeling steps.

The different uses of supervised modeling techniques are depicted in Figure 2.1.

Figure 2.1 Graphical representation of supervised modeling.

PREDICTING EVENTS WITH CLASSIFICATION MODELING

As described above, classification models predict categorical outcomes by using a set of input fields and a historical dataset with pre-classified data. Generated models are then used to predict the occurrence of events and classify unseen records. The general idea of a classification models is described in the next, simplified example.

A mobile telephony network operator wants to conduct an outbound cross-selling campaign to promote an Internet service to its customers. In order to optimize the campaign results, the organization is going to offer the incentive of a reduced service cost for the first months of usage. Instead of addressing the offer to the entire customer base, the company decided to target only prospects with an increased likelihood of acceptance. Therefore it used data mining in order to reveal the matching customer profile and identify the right prospects. The company decided to run a test campaign in a random sample of its existing customers which currently were not using the Internet service. The campaign’s recorded results define the output field. The input fields include all the customer demographics and usage attributes which already reside in the organization’s data mart.

Input and output fields are joined into a single dataset for the purposes of model building. The final form of the modeling dataset, for eight imaginary customers and an indicative list of inputs (gender, occupation category, volume/traffic of voice and SMS usage), is shown in Table 2.1.

The classification procedure is depicted in Figure 2.2.

The data are then mined with a classification model. Specific customer profiles are associated with acceptance of the offer. In this simple, illustrative example, none of the two contacted women accepted the offer. On the other hand, two out of the five contacted men (40%) were positive toward the offer. Among white-collar men this percentage reaches 67% (two out of three). Additionally, all white-collar men with heavy SMS usage turned out to be interested in the Internet service. These customers comprise the service’s target group. Although oversimplified, the described process shows the way that classification algorithms work. They analyze predictor fields and map input data patterns with specific outcomes.

Table 2.1 The modeling dataset for the classification model.

Figure 2.2 Graphical representation of classification modeling.

After identifying the customer profiles associated with acceptance of the offer, the company extrapolated the results to the whole customer base to construct a campaign list of prospective Internet users. In other words, it scored all customers with the derived model and classified customers as potential buyers or non-buyers.

In this naive example, the identification of potential buyers could also be done with inspection by eye. But imagine a situation with hundreds of candidate predictors and tens of thousands of records or customers. Such complicated but realistic tasks which human brains cannot handle can be easily and effectively carried out by data mining algorithms.

What If There Is Not an Explicit Target Field to Predict?

In some cases there is no apparent categorical target field to predict. For example, in the case of prepaid customers in mobile telephony, there is no recorded disconnection event to be modeled. The separation between active and churned customers is not evident. In such cases a target event could be defined with respect to specific customer behavior. This handling requires careful data exploration and co-operation between the data miners and the marketers. For instance, prepaid customers with no incoming or outgoing phone usage within a certain time period could be considered as churners. In a similar manner, certain behaviors or changes in behavior, for instance a substantial decrease in usage or a long period of inactivity, could be identified as signals of specific events and then used for the definition of the respective target. Moreover, the same approach could also be followed when analysts want to act proactively. For instance, even when a churn/disconnection event could be directly identified through a customer’s action, a proactive approach would analyze and model customers before their typical attrition, trying to identify any early signals of defection and not waiting for official termination of the relationship with the customer.

At the heart of all classification models is the estimation of confidence scores. These are scores that denote the likelihood of the predicted outcome. They are estimates of the probability of occurrence of the respective event. The predictions generated by the classification models are based on these scores: a record is classified into the class with the largest estimated confidence. The scores are expressed on a continuous numeric scale and usually range from 0 to 1. Confidence scores are typically translated to propensity scores which signify the likelihood of a particular outcome: the propensity of a customer to churn, to buy a specific add-on product, or to default on a loan. Propensity scores allow for the rank ordering of customers according to the likelihood of an outcome. This feature enables marketers to tailor the size of their campaigns according to their resources and marketing objectives. They can expand or reduce their target lists on the basis of their particular objectives, always targeting those customers with the relatively higher probabilities.

The purpose of all classification models is to provide insight and help in the refinement and optimization of marketing applications. The first step after model training is to browse the generated results, which may come in different forms according to the model used: rules, equations, graphs. Knowledge extraction is followed by evaluation of the model’s predictive efficiency and by the deployment of the results in order to classify new records according to the model’s findings. The whole procedure is described in Figure 2.3, which is explained further below.

The following modeling techniques are included in the class of classification models:

Decision trees: Decision trees operate by recursively splitting the initial population. For each split they automatically select the most significant predictor, the predictor that yields the best separation with respect to the target field. Through successive partitions, their goal is to produce “pure” sub-segments, with homogeneous behavior in terms of the output. They are perhaps the most popular classification technique. Part of their popularity is because they produce transparent results that are easily interpretable, offering an insight into the event under study. The produced results can have two equivalent formats. In a rule format, results are represented in plain English as ordinary rules:

IF (PREDICTOR VALUES) THEN (TARGET OUTCOME AND CONFIDENCE SCORE).

For example:

IF (Gender=Male and Profession=White Collar and SMS_Usage > 60 messages per month) THEN Prediction=Buyer and Confidence=0.95.

In a tree format, rules are graphically represented as a tree in which the initial population (root node) is successively partitioned into terminal nodes or leaves of sub-segments with similar behavior in regard to the target field.

Decision tree algorithms provide speed and scalability. Available algorithms include:

– C5.0

– CHAID

– Classification and Regression Trees

– QUEST.

Figure 2.3 An outline of the classification modeling procedure.

Decision rules: These are quite similar to decision trees and produce a list of rules which have the format of human-understandable statements:

IF (PREDICTOR VALUES) THEN (TARGET OUTCOME AND CONFIDENCE SCORE).

Their main difference from decision trees is that they may produce multiple rules for each record. Decision trees generate exhaustive and mutually exclusive rules which cover all records. For each record only one rule applies. On the contrary, decision rules may generate an overlapping set of rules. More than one rule, with different predictions, may hold true for each record. In that case, rules are evaluated, through an integrated procedure, to determine the one for scoring. Usually a voting procedure is applied, which combines the individual rules and averages their confidences for each output category. Finally, the category with the highest average confidence is selected as the prediction. Decision rule algorithms include:

– C5.0

– Decision list.

Logistic regression: This is a powerful and well-established statistical technique that estimates the probabilities of the target categories. It is analogous to simple linear regression but for categorical outcomes. It uses the generalized linear model and calculates regression coefficients that represent the effect of predictors on the probabilities of the categories of the target field. Logistic regression results are in the form of continuous functions that estimate the probability of membership in each target outcome:

where p = probability of an event to happen.

For example:

In order to yield optimal results it may require special data preparation, including potential screening and transformation of the predictors. It still demands some statistical experience, but provided it is built properly it can produce stable and understandable results.

Neural networks: Neural networks are powerful machine learning algorithms that use complex, nonlinear mapping functions for estimation and classification. They consist of neurons organized in layers. The input layer contains the predictors or input neurons. The output layer includes the target field. These models estimate weights that connect predictors (input layer) to the output. Models with more complex topologies may also include intermediate, hidden layers, and neurons. The training procedure is an iterative process. Input records, with known outcomes, are presented to the network and model prediction is evaluated with respect to the observed results. Observed errors are used to adjust and optimize the initial weight estimates. They are considered as opaque or “black box” solutions since they do not provide an explanation of their predictions. They only provide a sensitivity analysis, which summarizes the predictive importance of the input fields. They require minimum statistical knowledge but, depending on the problem, may require a long processing time for training.
Support vector machine(SVM): SVM is a classification algorithm that can model highly nonlinear, complex data patterns and avoid overfitting, that is, the situation in which a model memorizes patterns only relevant to the specific cases analyzed. SVM works by mapping data to a high-dimensional feature space in which records become more easily separable (i.e., separated by linear functions) with respect to the target categories. Input training data are appropriately transformed through nonlinear kernel functions and this transformation is followed by a search for simpler functions, that is, linear functions, which optimally separate records. Analysts typically experiment with different transformation functions and compare the results. Overall SVM is an effective yet demanding algorithm, in terms of memory resources and processing time. Additionally, it lacks transparency since the predictions are not explained and only the importance of predictors is summarized.
Bayesian networks: Bayesian models are probability models that can be used in classification problems to estimate the likelihood of occurrences. They are graphical models that provide a visual representation of the attribute relationships, ensuring transparency, and an explanation of the model’s rationale.

Evaluation of Classification Models

Before applying the generated model in new records, an evaluation procedure is required to assess its predictive ability. The historical data with known outcomes, which were used for training the model, are scored and two new fields are derived: the predicted outcome category and the respective confidence score, as shown in Table 2.2, which illustrates the procedure for the simplified example presented earlier.

In practice, models are never as accurate as in the simple exercise presented here. There are always errors and misclassified records. A comparison of the predicted to the actual values is the first step in evaluating the model’s performance.

Table 2.2 Historical data and model-generated prediction fields.

This comparison provides an estimate of the model’s future predictive accuracy on unseen cases. In order to make this procedure more valid, it is advisable to evaluate the model in a dataset that was not used for training the model. This is achieved by partitioning the historical dataset into two distinct parts through random sampling: the training and the testing dataset. A common practice is to allocate approximately 70–75% of the cases to the training dataset. Evaluation procedures are applied to both datasets. Analysts should focus mainly on the examination of performance indicators in the testing dataset. A model underperforming in the testing dataset should be re-examined since this is a typical sign of overfitting and of memorizing the specific training data. Models with this behavior do not provide generalizable results. They provide solutions that only work for the particular data on which they were trained.

Some analysts use the testing dataset to refine the model parameters and leave a third part of the data, namely the validation dataset, for evaluation. However, the best approach, which unfortunately is not always employed, would be to test the model’s performance in a third, disjoint dataset from a different time period.

One of the most common performance indicators for classification models is the error rate. It measures the percentage of misclassifications. The overall error rate indicates the percentage of records that were not correctly classified by the model. Since some mistakes may be more costly than others, this percentage is also estimated for each category of the target field. The error rate is summarized in misclassification or coincidence or confusion matrices that have the form given in Table 2.3.

Table 2.3 Misclassification matrix.

The gains, response, and lift/index tables and charts are also helpful evaluation tools that can summarize the predictive efficiency of a model with respect to a specific target category. To illustrate their basic concepts and usage we will present the results of a hypothetical churn model that was built on a dichotomous output field which flagged churners.

The first step in the creation of such charts and tables is to select the target category of interest, also referred to as the hit category. Records/customers are then ordered according to their hit propensities and binned into groups of equal size, named quantiles. In our hypothetical example, the target is the category of churners and the hit propensity is the churn propensity; in other words, the estimated likelihood of belonging to the group of churners. Customers have been split into 10 equal groups of 10% each, named deciles. The 10% of customers with the highest churn propensities comprise tile 1 and those with the lowest churn propensities, tile 10. In general, we expect that high estimated hit propensities also correspond to the actual customers of the target category. Therefore, we hope to find large concentrations of actual churners among the top model tiles.

The cumulative table, Table 2.4, evaluates our churn model in terms of the gain, response, and lift measures.

But what exactly do these performance measures represent and how are they used for model evaluation? A brief explanation is as follows:

Response %: “How likely is the target category within the examined quantiles?” Response % denotes the percentage (probability) of the target category within the quantiles. In our example, 10.7% of the customers of the top 10% model tile were actual churners, yielding a response % of the same value. Since the overall churn rate was 2.9%, we expect that a random list would also have an analogous churn rate. However, the estimated churn rate for the top model tile was 3.71 times (or 371.4%) higher. This is called the lift. Analysts have achieved results about four times better than randomness in the examined model tile. As we move from the top to the bottom tiles, the model estimated confidences decrease. The concentration of the actual churners is also expected to decrease. Indeed, the first two tiles, which jointly account for the top 20% of customers with the highest estimated churn scores, have a smaller percentage of actual churners (8.2%). This percentage is still 2.8 times higher than randomness, though.

Table 2.4 The gains, response, and lift table.

Gain %: “How many of the target population fall in the quantiles?” Gain % is defined as the percentage of the total target population that belongs in the quantiles. In our example, the top 10% model tile contains 37.1% of all actual churners, yielding a gain % of the same value. A random list containing 10% of the customers would normally capture about 10% of all observed churners. However, the top model tile contains more than a third (37.1%) of all observed churners. Once again we come to the lift concept. The top 10% model tile identifies about four times more target customers than a random list of the same size.
Lift: “How much better are the model results compared to randomness?” The lift or index assesses the improvement in predictive ability due to the model. It is defined as the ratio of the response % to the prior probability. In other words, it compares the model quantiles to a random list of the same size in terms of the probability of the target category. Therefore it represents how much a data mining model exceeds the baseline model of random selection.

The gain, response, and lift evaluation measures can also be depicted in corresponding charts such as those shown below. The two added reference lines correspond to the top 5% and the top 10% tiles. The diagonal line in the gains chart represents the baseline model of randomness.

The response chart (Figure 2.4) visually illustrates the estimated churn probability among the mode tiles. As we move to the left of the X-axis and toward the top tiles, we have increased churn probabilities. These tiles would result in more targeted lists and smaller error rates. Expanding the list to the right of the X-axis, toward the bottom model tiles, would increase the expected false positive error rate by including in the targeting list more customers with no real intention to churn.

Figure 2.4 Response chart.

According to the gains chart (Figure 2.5), when scoring an unseen customer list, data miners should expect to capture about 40% of all potential churners if they target the customers of the top 10% model tile. Narrowing the list to the top 5% tile decreases the percentage of potential churners to be reached to approximately 25%. As we move to the right of the X-axis, the expected number of total churners to be identified increases. At the same time, though, as we have seen in the response chart, the respective error rate of false positives increases. On the contrary, the left parts of the X-axis lead to smaller but more targeted campaigns.

The lift or index chart (Figure 2.6) directly compares the model’s predictive performance to the baseline model of random selection. The concentration of churners is estimated to be four times higher than randomness among the top 10% customers and about six times higher among the top 5% customers.

By studying these charts marketers can gain valuable insight into the model’s future predictive accuracy on new records. They can then decide on the size of the respective campaign by choosing the tiles to target. They may choose to conduct a small campaign, limited to the top tiles, in order to address only those customers with very high propensities and minimize the false positive cases. Alternatively, especially if the cost of the campaign is small compared to the potential benefits, they may choose to expand their list by including more tiles and more customers with relatively lower propensities.

Figure 2.5 Gains chart.

In conclusion, these charts can answer questions such as:

What response rates should we expect if we target the top n% of customers according to the model-estimated propensities?
How many target customers (potential churners or buyers) are we about to identify by building a campaign list based on the top n% of the leads according to the model?

The answers permit marketers to build scenarios on different campaign sizes. The estimated results may include more information than just the expected response rates. Marketers can incorporate cost and revenue information and build profit and ROI (Return On Investment) charts to assess their upcoming campaigns in terms of expected cost and revenue.

Figure 2.6 Lift chart.

The Maximum Benefit Point

An approach often referred to in the literature as a rule of thumb for selecting the optimal size of a targeted marketing campaign list is to examine the gains chart and select all top tiles up to the point where the distance between the gains curve and the diagonal reference line becomes a maximum. This is referred to as the maximum benefit point and it is the point where the difference between the gains curve and the diagonal reference line has its maximum value. The reasoning behind this approach is that, from that point on, the model classifies worse than randomness. This approach usually yields large targeting lists. In practice analysts and marketers should take into consideration the particular business situation, objectives, and resources and possibly consider as a classification threshold the point of lift maximization. If possible, they should also incorporate in the gains chart cost (per offer) and revenue (per acceptance) information and select the cut-point that best serves their specific business needs and maximizes the expected ROI and profit.

Scoring with Classification Models

Once the classification model is trained and evaluated, the next step is to deploy it and use the generated results to develop and carry out direct marketing campaigns. Each model, apart from offering insight through the revealed data patterns, can also be used as a scoring engine. When unseen data are passed through the derived model, they are scored and classified according to their estimated confidence scores.

As we saw above, the procedure for assigning records to the predefined classes may not be left entirely to the model specifications. Analysts can consult the gains charts and intervene in the predictions by setting a classification threshold that best serves their needs and their business objectives. Thus, they can expand or decrease the size of the derived marketing campaign lists according to the expected response rates and the requirements of the specific campaign.

The actual response rates of the executed campaigns should be monitored and evaluated. The results should be recorded in campaign libraries as they could be used for training relevant models in the future.

Finally, an automated and standardized procedure should be established that will enable the updating of the scores and their loading into the existing campaign management systems.

MARKETING APPLICATIONS SUPPORTED BY CLASSIFICATION MODELING

Marketing applications aim at establishing a long-term and profitable relationship with customers, throughout the whole lifetime of the customer. Classification models can play a significant role in marketing, specifically in the development of targeted marketing campaigns for acquisition, cross/up/deep selling, and retention. Table 2.5 presents a list of these applications along with their business objectives.

All the above applications can be supported by classification modeling. A classification model can be applied to identify the target population and recognize customers with an increased likelihood for churn or additional purchase. In other words, the event of interest (acquisition, churn, cross/up/deep selling) can be translated into a categorical target field which can then be used as an output in a classification model. Targeted campaigns can then be conducted with contact lists based on data mining models.

Setting up a data mining procedure for the needs of these applications requires special attention and co-operation between data miners and marketers. The most difficult task is usually to decide on the target event and population. The analysts involved should come up with a valid definition that makes business sense and can lead to really effective and proactive marketing actions. For instance, before starting to develop a churn model we should have an answer to the “what constitutes churn?” question. Even if we build a perfect model, this may turn out to be a business failure if, due to our target definition, it only identifies customers who are already gone by the time the retention campaign takes place.

Table 2.5 Marketing application and campaigns that can be supported by classification modeling.

Business objective	Marketing application
Getting customers	• Acquisition: finding new customers and expanding the customer base with new and potentially profitable customers
Developing customers	• Cross selling: promoting and selling additional products or services to existing customers
	• Up selling: offering and switching customers to premium products, other products more profitable than the ones that they already have
	• Deep selling: increasing usage of the products or services that customers already have
Retaining customers	• Retention: prevention of voluntary churn, with priority given to presently or potentially valuable customers

Predictive modeling and its respective marketing applications are beyond the scope of this book, which focuses on customer segmentation. Thus, we will not deal with these important methodological issues here. In the next section, though, we will briefly outline an indicative methodological approach for setting up a voluntary churn model.

SETTING UP A VOLUNTARY CHURN MODEL

In this simplified example, the goal of a mobile telephony network operator is to set up a model for the early identification of potential voluntary churners. This model will be the base for a respective targeted retention campaign and predicts voluntary attrition three months ahead. Figure 2.7 presents the setup.

Figure 2.7 Setting up a voluntary churn model.

The model is trained on a six-month historical dataset. The methodological approach is outlined by the following points:

The input fields used cover all aspects of the customer relationship with the organization: customer and contract characteristics, usage and behavioral indicators, and so on, providing an integrated customer view also referred to as the customer signature.
The model is trained on customers who were active at the end of the historical period (end of the six-month period). These customers comprise the training population.
A three-months period is used for the definition of the target event and the target population.
The target population consists of those who have voluntary churned (applied for disconnection) by the end of the three-month period.
The model is trained by identifying the input data patterns (customer characteristics) associated with voluntary churn.
The generated model is validated on a disjoint dataset of a different time period, before being deployed for scoring presently active customers.
In the deployment or scoring phase, presently active customers are scored according to the model and churn propensities are generated. The model predicts churn three months ahead.
The generated churn propensities can then be used for better targeting of an outbound retention campaign. The churn model results can be combined and cross-examined with the present or potential value of the customers so that the retention activities are prioritized accordingly.
All input data fields that were used for the model training are required, obviously with refreshed information, in order to update the churn propensities.
Two months have been reserved to allow for scoring and preparing the campaign. These two months are shown as gray boxes in the figure and are usually referred to as the latency period.
A latency period also ensures that the model is not trained to identify “immediate” churners. Even if we manage to identify those customers, the chances are that by the time they are contacted, they could already be gone or it will be too late to change their minds. The goal of the model should be long term: the recognition of early churn signals and the identification of customers with an increased likelihood to churn in the near but not immediate future, since for them there is a chance of retention.
To build a long-term churn model, immediate churners, namely customers who churned during the two-month latency period, are excluded from the model training.
The definition of the target event and the time periods used in this example are purely indicative. A different time frame for the historical or latency period could be used according to the specific task and business situation.

FINDING USEFUL PREDICTORS WITH SUPERVISED FIELD SCREENING MODELS

Another class of supervised modeling techniques includes the supervised field screening models (Figure 2.8). These are models that usually serve as a preparation step for the development of classification and estimation models. The situation of having hundreds or even thousands of candidate predictors is not an unusual one in complicated data mining tasks. Some of these fields, though, may not have an influence on the output field that we want to predict. The role of supervised field screening models is to assess all the available inputs and find the key predictors and those predictors with marginal or no importance that are candidates for potential removal from the predictive model.

Some predictive algorithms, including decision trees, for example, integrate screening mechanisms that internally filter out the unrelated predictors. There are some other algorithms which are inefficient when handling a large number of candidate predictors at reasonable times. The field screening models can efficiently reduce data dimensionality, retaining only those fields relevant to the outcome of interest, allowing data miners to focus only on the information that matters.

Figure 2.8 Supervised field screening models.

Field screening models are usually used in the data preparation phase of a data mining project in order to perform the following tasks:

Evaluate the quality of potential predictors. They incorporate specific criteria to identify inadequate predictors: for instance, predictors with an extensive percentage of missing (null) values, continuous predictors which are constant or have little variation, categorical predictors with too many categories or with almost all records falling in a single category.
Rank predictors according to their predictive power. The influence of each predictor on the target field is assessed and an importance measure is calculated. Predictors are then sorted accordingly.
Filter out unimportant predictors. Predictors unrelated to the target field are identified. Analysts have the option to filter them out, reducing the set of input fields to those related to the target field.

PREDICTING CONTINUOUS OUTCOMES WITH ESTIMATION MODELING

Estimation models, also referred to as regression models, deal with continuous numeric outcomes. By using linear or nonlinear functions they use the input fields to estimate the unknown values of a continuous target field.

Estimation techniques can be used to predict attributes like the following:

The expected balance of the savings accounts of bank customers in the near future.
The estimated volume of traffic for new customers of a mobile telephony network operator.
The expected revenue from a customer for the next year.

A dataset with historical data and known values of the continuous output is required for training the model. A mapping function is then identified that associates the available inputs to the output values. These models are also referred to as regression models, after the well-known and established statistical technique of ordinary least squares regression (OLSR), which estimates the line that best fits the data and minimizes the observed errors, the so-called least squares line. It requires some statistical experience and since it is sensitive to possible violations of its assumptions it may require specific data examination and processing before building. The final model has the intuitive form of a linear function with coefficients denoting the effect of predictors on the outcome measure. Although transparent, it has inherent limitations that may affect its predictive performance in complex situations of nonlinear relationships and interactions between predictors.

Nowadays, traditional regression is not the only available estimation technique. New techniques, with less stringent assumptions and which also capture nonlinear relationships, can also be employed to handle continuous outcomes. More specifically, neural networks, SVM, and specific types of decision trees, such as Classification and Regression Trees and CHAID, can also be employed for the prediction of continuous measures.

The data setup and the implementation procedure of an estimation model are analogous to those of a classification model. The historical dataset is used for training the model. The model is evaluated with respect to its predictive effectiveness, in a disjoint dataset, preferably of a different time period, with known outcome values. The generated model is then deployed on unseen data to estimate the unknown target values.

The model creates one new field when scoring: the estimated outcome value. Estimation models are evaluated with respect to the observed errors: the deviation, the difference between the predicted and the actual values. Errors are also called residuals.

A large number of residual diagnostic plots and measures are usually examined to assess the model’s predictive accuracy. Error measures typically examined include:

Correlation measures between the actual and the predicted values, such as the Pearson correlation coefficient. This coefficient is a measure of the linear association between the observed and the predicted values. Values close to 1 indicate a strong relationship and a high degree of association between what was predicted and what is really happening.
The relative error. This measure denotes the ratio of the variance of the observed values from those predicted by the model to the variance of the observed values from their mean. It compares the model with a baseline model that simply returns the mean value as the prediction for all records. Small values indicate better models. Values greater than 1 indicate models less accurate than the baseline model and therefore not useful.
Mean error or mean squared error across all examined records.
Mean absolute error (MAE).
Mean absolute percent error (MAPE).

Examining the Model Errors to Reveal Anomalous or Even Suspect Cases

The examination of deviations of the predicted from the actual values can also be used to identify outlier or abnormal cases. These cases may simply indicate poor model performance or an unusual but acceptable behavior. Nevertheless, they deserve special inspection since they may also be signs of suspect behavior.

For instance, an insurance company can build an estimation model based on the amounts of claims by using the claim application data as predictors. The resulting model can then be used as a tool to detect fraud. Entries that substantially deviate from the expected values could be identified and further examined or even sent to auditors for manual inspection.

UNSUPERVISED MODELING TECHNIQUES

In the previous sections we briefly presented the supervised modeling techniques. Whether used for classification, estimation, or field screening, their common characteristic is that they all involve a target attribute which must be associated with an examined set of inputs. The model training and data pattern recognition are guided or supervised by a target field. This is not the case in unsupervised modeling, in which only input fields are involved. All inputs are treated equally in order to extract information that can be used, mainly, for the identification of groupings and associations.

Clustering techniques identify meaningful natural groupings of records and group customers into distinct segments with internal cohesion. Data reduction techniques like factor analysis or principal components analysis (PCA) “group” fields into new compound measures and reduce the data’s dimensionality without losing much of the original information. But grouping is not the only application of unsupervised modeling. Association or affinity modeling is used to discover co-occurring events, such as purchases of related products. It has been developed as a tool for analyzing shopping cart patterns and that is why it is also referred to as market basket analysis. By adding the time factor to association modeling we have sequence modeling: in sequence modeling we analyze associations over time and try to discover the series of events, the order in which events happen. And that is not all. Sometimes we are just interested in identifying records that “do not fit well,” that is, records with unusual and unexpected data patterns. In such cases, record screening techniques can be employed as a data auditing step before building a subsequent model to detect abnormal (anomalous) records.

Figure 2.9 Graphical representation of unsupervised modeling.

Below, we will briefly present all these techniques before focusing on the clustering and data reduction techniques used mainly for segmentation purposes.

The different uses of supervised modeling techniques are depicted in Figure 2.9.

SEGMENTING CUSTOMERS WITH CLUSTERING TECHNIQUES

Consider the situation of a social gathering where guests start to arrive and mingle with each other. After a while, guests start to mix in company and groups of socializing people start to appear. These groups are formed according to the similarities of their members. People walk around and join groups according to specific criteria such as physical appearance, dress code, topic and tone of discussion, or past acquaintance. Although the host of the event may have had some initial presumptions about who would match with whom, chances are that at the end of the night some quite unexpected groupings would come up.

Grouping according to proximity or similarity is the key concept of clustering. Clustering techniques reveal natural groupings of “similar” records. In the small stores of old, when shop owners knew their customers by name, they could handle all clients on an individual basis according to their preferences and purchase habits. Nowadays, with thousands or even millions of customers, this is not feasible. What is feasible, though, is to uncover the different customer types and identify their distinct profiles. This constitutes a large step on the road from mass marketing to a more individualized handling of customers. Customers are different in terms of behavior, usage, needs, and attitudes and their treatment should be tailored to their differentiating characteristics. Clustering techniques attempt to do exactly that: identify distinct customer typologies and segment the customer base into groups of similar profiles so that they can be marketed more effectively.

These techniques automatically detect the underlying customer groups based on an input set of fields/attributes. Clusters are not known in advance. They are revealed by analyzing the observed input data patterns. Clustering techniques assess the similarity of the records or customers with respect to the clustering fields and assign them to the revealed clusters accordingly. The goal is to detect groups with internal homogeneity and interclass heterogeneity.

Clustering techniques are quite popular and their use is widespread in data mining and market research. They can support the development of different segmentation schemes according to the clustering attributes used: namely, behavioral, attitudinal, or demographic segmentation.

The major advantage of the clustering techniques is that they can efficiently manage a large number of attributes and create data-driven segments. The created segments are not based on a priori personal concepts, intuitions, and perceptions of the business people. They are induced by the observed data patterns and, provided they are built properly, they can lead to results with real business meaning and value. Clustering models can analyze complex input data patterns and suggest solutions that would not otherwise be apparent. They reveal customer typologies, enabling tailored marketing strategies. In later chapters we will have the chance to present real-world applications from major industries such as telecommunications and banking, which will highlight the true benefits of data mining-derived clustering solutions.

Unlike classification modeling, in clustering there is no predefined set of classes. There are no predefined categories such as churners/non-churners or buyers/non-buyers and there is also no historical dataset with pre-classified records. It is up to the algorithm to uncover and define the classes and assign each record to its “nearest” or, in other words, its most similar cluster. To present the basic concepts of clustering, let us consider the hypothetical case of a mobile telephony network operator that wants to segment its customers according to their voice and SMS usage. The available demographic data are not used as clustering inputs in this case since the objective concerns the grouping of customers according only to behavioral criteria.

The input dataset, for a few imaginary customers, is presented in Table 2.6.

In the scatterplot in Figure 2.10, these customers are positioned in a two-dimensional space according to their voice usage, along the X-axis, and their SMS usage, along the Y-axis.

The clustering procedure is depicted in Figure 2.11, where voice and SMS usage intensity are represented by the corresponding symbols.

Examination of the scatterplot reveals specific similarities among the customers. Customers 1 and 6 appear close together and present heavy voice usage and low SMS usage. They can be placed in a single group which we label as “Heavy voice users.” Similarly, customers 2 and 3 also appear close together but far apart from the rest. They form a group of their own, characterized by average voice and SMS usage. Therefore one more cluster has been disclosed, which can be labeled as “Typical users.” Finally, customers 4 and 5 also seem to be different from the rest by having increased SMS usage and low voice usage. They can be grouped together to form a cluster of “SMS users.”

Table 2.6 The modeling dataset for a clustering model.

	Input fields
Customer ID	Monthly average number of SMS calls	Monthly average number of voice calls
1	27	144
2	32	44
3	41	30
4	125	21
5	105	23
6	20	121

Figure 2.10 Scatterplot of voice and SMS usage.

Although quite naive, the above example outlines the basic concepts of clustering. Clustering solutions are based on analyzing similarities among records. They typically use distance measures that assess the records’ similarities and assign records with similar input data patterns, hence similar behavioral profiles, to the same cluster.

Figure 2.11 Graphical representation of clustering.

Nowadays, various clustering algorithms are available, which differ in their approach for assessing the similarity of records and in the criteria they use to determine the final number of clusters. The whole clustering “revolution” started with a simple and intuitive distance measure, still used by some clustering algorithms today, called the Euclidean distance. The Euclidean distance of two records or objects is a dissimilarity measure calculated as the square root of the sum of the squared differences between the values of the examined attributes/fields. In our example the Euclidean distance between customers 1 and 6 would be:

This value denotes the disparity of customers 1 and 6 and is represented in the respective scatterplot by the length of the straight line that connects points 1 and 6. The Euclidean distances for all pairs of customers are summarized in Table 2.7.

A traditional clustering algorithm, named agglomerative or hierarchical clustering, works by evaluating the Euclidean distances between all pairs of records, literally the length of their connecting lines, and begins to group them accordingly in successive steps. Although many things have changed in clustering algorithms since the inception of this algorithm, it is nice to have a graphical representation of what clustering is all about. Nowadays, in an effort to handle large volumes of data, algorithms use more efficient distance measures and approaches which do not require the calculation of the distances between all pairs of records. Even a specific type of neural network is applied for clustering; however, the main concept is always the same – the grouping of homogeneous records. Typical clustering tasks involve the mining of thousands of records and tens or hundreds of attributes. Things are much more complicated than in our simplified exercise. Tasks like this are impossible to handle without the help of specialized algorithms that aim to automatically uncover the underlying groups.

Table 2.7 The proximity matrix of Euclidean distances between all pairs of customers.

One thing that should be made crystal clear about clustering is that it groups records according to the observed input data patterns. Thus, the data miners and marketers involved should decide in advance, according to the specific business objective, the segmentation level and the segmentation criteria – in other words, the clustering fields. For example, if we want to segment bank customers according to their product balances, we must prepare a modeling dataset with balance information at a customer level. Even if our original input data are in a transactional format or stored at a product account level, the selected segmentation level requires a modeling dataset with a unique record per customer and with fields that would summarize their product balances.

In general, clustering algorithms provide an exhaustive and mutual exclusive solution. They automatically assign each record to one of the uncovered groups. They produce disjoint clusters and generate a cluster membership field that denotes the group of each record, as shown in Table 2.8.

In our illustrative exercise we have discovered the differentiating characteristics of each cluster and labeled them accordingly. In practice, this process is not so easy and may involve many different attributes, even those not directly participating in the cluster formation. Each clustering solution should be thoroughly examined and the profiles of the clusters outlined. This is usually accomplished by simple reporting techniques, but it can also include the application of supervised modeling techniques such as classification techniques, aiming to reveal the distinct characteristics associated with each cluster.

Table 2.8 The cluster membership field.

This profiling phase is an essential step in the clustering procedure. It can provide insight on the derived segmentation scheme and it can also help in the evaluation of the scheme’s usefulness. The derived clusters should be evaluated with respect to the business objective they were built to serve. The results should make sense from a business point of view and should generate business opportunities. The marketers and data miners involved should try to evaluate different solutions before selecting the one that best addresses the original business goal.

Available clustering models include the following:

Agglomerative or hierarchical: Although quite outdated nowadays, we present this algorithm since in a way it is the “mother” of all clustering models. It is called hierarchical or agglomerative because it starts with a solution where each record comprises a cluster and gradually groups records up to the point where all of them fall into one supercluster. In each step it calculates the distances between all pairs of records and groups the most similar ones. A table (agglomeration schedule) or a graph (dendrogram) summarizes the grouping steps and the respective distances. The analyst should consult this information, identify the point where the algorithm starts to group disjoint cases, and then decide on the number of clusters to retain. This algorithm cannot effectively handle more than a few thousand cases. Thus it cannot be directly applied in most business clustering tasks. A usual workaround is to a use it on a sample of the clustering population. However, with numerous other efficient algorithms that can easily handle millions of records, clustering through sampling is not considered an ideal approach.
K-means: This is an efficient and perhaps the fastest clustering algorithm that can handle both long (many records) and wide datasets (many data dimensions and input fields). It is a distance-based clustering technique and, unlike the hierarchical algorithm, it does not need to calculate the distances between all pairs of records. The number of clusters to be formed is predetermined and specified by the user in advance. Usually a number of different solutions should be tried and evaluated before approving the most appropriate. It is best for handling continuous clustering fields.
TwoStep cluster: As its name implies, this scalable and efficient clustering model, included in IBM^™ SPSS^™ Modeler (formerly Clementine), processes records in two steps. The first step of pre-clustering makes a single pass through the data and assigns records to a limited set of initial subclusters. In the second step, initial subclusters are further grouped, through hierarchical clustering, into the final segments. It suggests a clustering solution by automatic clustering: the optimal number of clusters can be automatically determined by the algorithm according to specific criteria.
Kohonen network/Self-Organizing Map (SOM): Kohonen networks are based on neural networks and typically produce a two-dimensional grid or map of the clusters, hence the name self-organizing maps. Kohonen networks usually take a longer time to train than the K-means and TwoStep algorithms, but they provide a different view on clustering that is worth trying.

Apart from segmentation, clustering techniques can also be used for other purposes, for example, as a preparatory step for optimizing the results of predictive models. Homogeneous customer groups can be revealed by clustering and then separate, more targeted predictive models can be built within each cluster. Alternatively, the derived cluster membership field can also be included in the list of predictors in a supervised model. Since the cluster field combines information from many other fields, it often has significant predictive power. Another application of clustering is in the identification of unusual records. Small or outlier clusters could contain records with increased significance that are worth closer inspection. Similarly, records far apart from the majority of the cluster members might also indicate anomalous cases that require special attention.

The clustering techniques are further explained and presented in detail in the next chapter.

REDUCING THE DIMENSIONALITY OF DATA WITH DATA REDUCTION TECHNIQUES

As their name implies, data reduction techniques aim at effectively reducing the data’s dimensions and removing redundant information. They do so by replacing the initial set of fields with a core set of compound measures which simplify subsequent modeling while retaining most of the information of the original attributes.

Factor analysis and PCA are among the most popular data reduction techniques. They are unsupervised, statistical techniques which deal with continuous input attributes. These attributes are analyzed and mapped to representative fields, named factors or components. The procedure is illustrated in Figure 2.12.

Factor analysis and PCA are based on the concept of linear correlation. If certain continuous fields/attributes tend to covary then they are correlated. If their relationship is expressed adequately by a straight line then they have a strong linear correlation. The scatterplot in Figure 2.13 depicts the monthly average SMS and MMS (Multimedia Messaging Service) usage for a group of mobile telephony customers.

As seen in the scatterplot, most customer points cluster around a straight line with a positive slope that slants upward to the right. Customers with increased SMS usage also tend to be MMS users as well. These two services are related in a linear manner and present a strong, positive linear correlation, since high values of one field tend to correspond to high values of the other. However, in negative linear correlations, the direction of the relationship is reversed. These relationships are described by straight lines with a negative slope that slant downward. In such cases high values of one field tend to correspond to low values of the other. The strength of linear correlation is quantified by a measure named the Pearson correlation coefficient. It ranges from –1 to +1. The sign of the coefficient reveals the direction of the relationship. Values close to +1 denote strong positive correlation and values close to –1 negative correlation. Values around 0 denote no discernible linear correlation, yet this does not exclude the possibility of nonlinear correlation.

Figure 2.12 Data reduction techniques.

Figure 2.13 Linear correlation between two continuous measures.

Factor analysis and PCA examine the correlations between the original input fields and identify latent data dimensions. In a way they “group” the inputs into composite measures, named factors or components, that can effectively represent the original attributes, without sacrificing much of their information. The derived components and factors have the form of continuous numeric scores and can be subsequently used as any other fields for reporting or modeling purposes.

Data reduction is also widely used in marketing research. The views, perceptions, and preferences of the respondents are often recorded through a large number of questions that investigate all the topics of interest in detail. These questions often have the form of a Likert scale, where respondents are asked to state, on a scale of 1–5, the degree of importance, preference, or agreement on specific issues. The answers can be used to identify the latent concepts that underlie the respondents’ views.

To further explain the basic concepts behind data reduction techniques, let us consider the simple case of a few customers of a mobile telephony operator. SMS, MMS, and voice call traffic, specifically the number of calls by service type and the minutes of voice calls, were analyzed by principal components. The modeling dataset and the respective results are given in Table 2.9.

The PCA model analyzed the associations among the original fields and identified two components. More specifically, the SMS and MMS usage appear to be correlated and a new component was extracted to represent the usage of those services. Similarly, the number and minutes of voice calls were also correlated. The second component represents these two fields and measures the voice usage intensity. Each derived component is standardized, with an overall population mean of 0 and a standard deviation of 1. The component scores denote how many standard deviations above or below the overall mean each record stands. In simple terms, a positive score in component 1 indicates high SMS and MMS usage while a negative score indicates below-average usage. Similarly, high scores on component 2 denote high voice usage, in terms of both frequency and duration of calls. The generated scores can then be used in subsequent modeling tasks.

Table 2.9 The modeling dataset for principal components analysis and the derived component scores.

The interpretation of the derived components is an essential part of the data reduction procedure. Since the derived components will be used in subsequent tasks, it is important to fully understand the information they convey. Although there are many formal criteria for selecting the number of factors to be retained, analysts should also examine their business meaning and only keep those that comprise interpretable and meaningful measures.

Simplicity is the key benefit of data reduction techniques, since they drastically reduce the number of fields under study to a core set of composite measures. Some data mining techniques may run too slow or not at all if they have to handle a large number of inputs. Situations like these can be avoided by using the derived component scores instead of the original fields. An additional advantage of data reduction techniques is that they can produce uncorrelated components. This is one of the main reasons for applying a data reduction technique as a preparatory step before other models. Many predictive modeling techniques can suffer from the inclusion of correlated predictors, a problem referred to as multicollinearity. By substituting the correlated predictors with the extracted components we can eliminate collinearity and substantially improve the stability of the predictive model. Additionally, clustering solutions can also be biased if the inputs are dominated by correlated “variants” of the same attribute. By using a data reduction technique we can unveil the true data dimensions and ensure that they are of equal weight in the formation of the final clusters.

In the next chapter, we will revisit data reduction techniques and present PCA in detail.

FINDING “WHAT GOES WITH WHAT” WITH ASSOCIATION OR AFFINITY MODELING TECHNIQUES

When browsing a bookstore on the Internet you may have noticed recommendations that pop up and suggest additional, related products for you to consider: “Customers who have bought this book have also bought the following books.” Most of the time these recommendations are quite helpful, since they take into account the recorded preferences of past customers. Usually they are based on association or affinity data mining models.

These models analyze past co-occurrences of events, purchases, or attributes and detect associations. They associate a particular outcome category, for instance a product, with a set of conditions, for instance a set of other products. They are typically used to identify purchase patterns and groups of products purchased together.

In the e-bookstore example, by browsing through past purchases, association models can discover other popular books among the buyers of the particular book viewed. They can then generate individualized recommendations that match the indicated preference.

Association modeling techniques generate rules of the following general format:

IF (ANTECEDENTS) THEN CONSEQUENT

For example:

IF (product A and product C and product E and ...) → product B

More specifically, a rule referring to supermarket purchases might be:

IF EGGS & MILK & FRESH FRUIT → VEGETABLES

This simple rule, derived by analyzing past shopping carts, identifies associated products that tend to be purchased together: when eggs, milk, and fresh fruit are bought, then there is an increased probability of also buying vegetables. This probability, referred to as the rule’s confidence, denotes the rule’s strength and will be further explained in what follows.

The left or the IF part of the rule consists of the antecedent or condition: a situation where, when true, the rule applies and the consequent shows increased occurrence rates. In other words, the antecedent part contains the product combinations that usually lead to some other product. The right part of the rule is the consequent or conclusion: what tends to be true when the antecedents hold true. The rule’s complexity depends on the number of antecedents linked to the consequent.

These models aim at:

Providing insight on product affinities: Understand which products are commonly purchased together. This, for instance, can provide valuable information for advertising, for effectively reorganizing shelves or catalogues, and for developing special offers for bundles of products or services.
Providing product suggestions: Association rules can act as a recommendation engine. They can analyze shopping carts and help in direct marketing activities by producing personalized product suggestions, according to the customer’s recorded behavior.

This type of analysis is also referred to as market basket analysis since it originated from point-of-sale data and the need to understand consumer shopping patterns. Its application was extended to also cover any other “basket-like” problem from various other industries. For example:

In banking, it can be used for finding common product combinations owned by customers.
In telecommunications, for revealing services that usually go together.
In web analysis, for finding web pages accessed in single visits.

Association models are considered as unsupervised techniques since they do not involve a single output field to be predicted. They analyze product affinity tables: that is, multiple fields that denote product/service possession. These fields are at the same time considered as inputs and outputs. Thus, all products are predicted and act as predictors for the rest of the products.

According to the business scope and the selected level of analysis, association models can be applied to:

Transaction or order data – data summarizing purchases at a transaction level, for instance what is bought in a single store visit.
Aggregated information at a customer level – what is bought during a specific time period by each customer or what is the current product mix of each (bank) customer.

Product Groupings

In general, these techniques are rarely applied directly to product codes. They are usually applied to product groups. A taxonomy level, also referred to as a hierarchy or grouping level, is selected according to the defined business objective and the data are grouped accordingly. The selected product grouping will also determine the type of generated rules and recommendations.

A typical modeling dataset for an association model has the tabular format shown in Table 2.10. These tables, also known as basket or truth tables, contain categorical, flag (binary) fields which denote the presence or absence of specific items or events of interest, for instance purchased products. The fields denoting product purchases, or in general event occurrences, are the content fields. The analysis ID field, here the transaction ID, is used to define the unit or level of the analysis. In other words, whether the revealed purchase patterns refer to transactions or customers. In tabular data format, the dataset should contain aggregated content/purchase information at the selected analysis level.

Table 2.10 The modeling dataset for association modeling – a basket table.

In the above example, the goal is to analyze purchase transactions and identify rules which describe the shopping cart patterns. We also assume that products are grouped into four supercategories.

Analyzing Raw Transactional Data with Association Models

Besides basket tables, specific algorithms, like the a priori association model, can also directly analyze transactional input data. This format requires the presence of two fields: a content field denoting the associated items and an analysis ID field that defines the level of analysis. Multiple records are linked by having the same ID value. The transactional modeling dataset for the simple example presented above is listed in Table 2.11.

Table 2.11 A transactional modeling dataset for association modeling.

	Input–output field
Analysis ID field	Content field
Transaction ID	Products
101	Product 1
101	Product 3
102	Product 2
103	Product 1
103	Product 3
103	Product 4
104	Product 1
104	Product 4
105	Product 1
105	Product 4
106	Product 2
106	Product 3
107	Product 1
107	Product 3
107	Product 4
108	Product 3
108	Product 4
109	Product 1
109	Product 3
109	Product 4

By setting the transaction ID field as the analysis ID we require the algorithm to analyze the purchase patterns at a transaction level. If the customer ID had been selected as the analysis ID, the purchase transactions would have been internally aggregated and analyzed at a customer level.

Two of the derived rules are listed in Table 2.12.

Usually all the extracted rules are described and evaluated with respect to three main measures:

Table 2.12 Rules of an association model.

The support: This assesses the rule’s coverage or “how many records the rule constitutes.” It denotes the percentage of records that match the antecedents.
The confidence: This assesses the strength and the predictive ability of the rule. It indicates “how likely the consequent is, given the antecedents.” It denotes the consequent percentage or probability, within the records that match the antecedents.
The lift: This assesses the improvement in the predictive ability when using the derived rule compared to randomness. It is defined as the ratio of the rule confidence to the prior confidence of the consequent. The prior confidence is the overall percentage of the consequent within all the analyzed records.

In the presented example, Rule 2 associates product 1 to product 4 with a confidence of 71.4%. In plain English, it states that 71.4% of the baskets containing product 1, which is the antecedent, also contain product 4, the consequent. Additionally, the baskets containing product 1 comprise 77.8% of all the baskets analyzed. This measure is the support of the rule. Since six out of the nine total baskets contain product 4, the prior confidence of a basket containing product 4 is 6/9 or 67%, slightly lower than the rule confidence. Specifically, Rule 2 outperforms randomness and achieves a confidence about 7% higher with a lift of 1.07. Thus by using the rule, the chances of correctly identifying a product 1 purchase are improved by 7%.

Rule 4 is more complicated since it contains two antecedents. It has a lower coverage (44.4%) but yields a higher confidence (75%) and lift (1.13). In plain English this rule states that baskets with products 1 and 3 present a strong chance (75%) of also containing product 4. Thus, there is a business opportunity to promote product 4 to all customers who check out with products 1 and 3 and have not bought product 4.

The rule development procedure can be controlled according to model parameters that analysts can specify. Specifically, analysts can define in advance the required threshold values for rule complexity, support, confidence, and lift in order to guide the rule growth process according to their specific requirements.

Unlike decision trees, association models generate rules that overlap. Therefore, multiple rules may apply for each customer. Rules applicable to each customer are then sorted according to a selected performance measure, for instance lift or confidence, and a specified number of n rules, for instance the top three rules, are retained. The retained rules indicate the top n product suggestions, currently not in the basket, that best match each customer’s profile. In this way, association models can help in cross-selling activities as they can provide specialized product recommendations for each customer. As in every data mining task, derived rules should also be evaluated with respect to their business meaning and “actionability” before deployment.

Association vs. Classification Models for Product Suggestions

As described above, the association models can be used for cross selling and for identification of the next best offers for each customer. Although useful, association modeling is not the ideal approach for next best product campaigns, mainly because they do not take into account customer evolvement and possible changes in the product mix over time.

A recommended approach would be to analyze the profile of customers before the uptake of a product to identify the characteristics that have caused the event and are not the result of the event. This approach is feasible by using either test campaign data or historical data. For instance, an organization might conduct a pilot campaign among a sample of customers not owning a specific product that it wants to promote, and mine the collected results to identify the profile of customers most likely to respond to the product offer. Alternatively, it can use historical data, and analyze the profile of those customers who recently acquired the specific product. Both these approaches require the application of a classification model to effectively estimate acquisition propensities.

Therefore, a set of separate classification models for each product and a procedure that would combine the estimated propensities into a next best offer strategy are a more efficient approach than a set of association rules.

Most association models include categorical and specifically binary (flag or dichotomous) fields, which typically denote product possession or purchase. We can also include supplementary fields, like demographics, in order to enhance the antecedent part of the rules and enrich the results. These fields must also be categorical, although specific algorithms, like GRI (Generalized Rule Induction), can also handle continuous supplementary fields. The a priori algorithm is perhaps the most widely used association modeling technique.

DISCOVERING EVENT SEQUENCES WITH SEQUENCE MODELING TECHNIQUES

Sequence modeling techniques are used to identify associations of events/purchases/attributes over time. They take into account the order of events and detect sequential associations that lead to specific outcomes. They generate rules analogous to association models but with one difference: a sequence of antecedent events is strongly associated with the occurrence of a consequent. In other words, when certain things happen in a specific order, a specific event has an increased probability of occurring next.

Sequence modeling techniques analyze paths of events in order to detect common sequences. Their origin lies in web mining and click stream analysis of web pages. They began as a way to analyze weblog data in order to understand the navigation patterns in web sites and identify the browsing trails that end up in specific pages, for instance purchase checkout pages. The use of these techniques has been extended and nowadays can be applied to all “sequence” business problems.

The techniques can also be used as a means for predicting the next expected “move” of the customers or the next phase in a customer’s lifecycle. In banking, they can be applied to identify a series of events or customer interactions that may be associated with discontinuing the use of a credit card; in telecommunications, for identifying typical purchase paths that are highly associated with the purchase of a particular add-on service; and in manufacturing and quality control, for uncovering signs in the production process that lead to defective products.

The rules generated by association models include antecedents or conditions and a consequent or conclusion. When antecedents occur in a specific order, it is likely that they will be followed by the occurrence of the consequent. Their general format is:

IF (ANTECEDENTS with a specific order) THEN CONSEQUENT

or:

IF (product A and THEN product F and THEN product C and THEN ...) THEN product D

For example, a rule referring to bank products might be:

IF SAVINGS & THEN CREDIT CARD & THEN SHORT-TERM DEPOSIT → STOCKS

This rule states that bank customers who start their relationship with the bank as savings account customers, and subsequently acquire a credit card and a short-term deposit product, present an increased likelihood to invest in stocks. The likelihood of the consequent, given the antecedents, is expressed by the confidence value. The confidence value assesses the rule’s strength. Support and confidence measures, which were presented in detail for association models, are also applicable in sequence models.

The generated sequence models, when used for scoring, provide a set of predictions denoting the n, for instance the three, most likely next steps, given the observed antecedents. Predictions are sorted in terms of their confidence and may indicate for example the top three next product suggestions for each customer according to his or her recorded path of product purchasing to date.

Sequence models require the presence of an ID field to monitor the events of the same individual overtime. The sequence data could be tabular or transactional, in a format similar to the one presented for association modeling. Fields required for the analysis involve: content(s) field(s), an analysis ID field, and a time field. Content(s) fields denote the occurrence of events of interest, for instance purchased products or web pages viewed during a visit to a site. The analysis ID field determines the level of analysis, for instance whether the revealed sequence patterns would refer to customers, transactions, or web visits, based on appropriately prepared weblog files. The time field records the time of the events and is required so that the algorithm can track the occurrence order. A typical transactional modeling dataset, recording customer purchases over time, is given in Table 2.13.

A derived association rule is displayed in Table 2.14.

Table 2.13 A transactional modeling data set for association modeling.

		Input–output field
Analysis ID field	Time field	Content field
Customer ID	Acquisition time	Products
101	30 June 2007	Product 1
101	12 August 2007	Product 3
101	20 December 2008	Product 4
102	10 September 2008	Product 3
102	12 September 2008	Product 5
102	20 January 2009	Product 5
103	30 January 2009	Product 1
104	10 January 2009	Product 1
104	10 January 2009	Product 3
104	10 January 2009	Product 4
105	10 January 2009	Product 1
105	10 February 2009	Product 5
106	30 June 2007	Product 1
106	12 August 2007	Product 3
106	20 December 2008	Product 4
107	30 June 2007	Product 2
107	12 August 2007	Product 1
107	20 December 2008	Product 3

Table 2.14 Rule of an association detection model.

The support value represents the percentage of units of the analysis, here unique customers, that had a sequence of the antecedents. In the above example the support rises to 57.1%, since four out of seven customers purchased product 3 after buying product 1. Three of these customers purchased product 4 afterward. Thus, the respective rule confidence figure is 75%. The rule simply states that after acquiring product 1 and then product 3, customers have an increased likelihood (75%) of purchasing product 4 next.

DETECTING UNUSUAL RECORDS WITH RECORD SCREENING MODELING TECHNIQUES

Record screening modeling techniques are applied to detect anomalies or outliers. The techniques try to identify records with odd data patterns that do not “conform” to the typical patterns of “normal” cases.

Unsupervised record screening modeling techniques can be used for:

Data auditing, as a preparatory step before applying subsequent data mining models.
Discovering fraud.

Valuable information is not just hidden in general data patterns. Sometimes rare or unexpected data patterns can reveal situations that merit special attention or require immediate action. For instance, in the insurance industry, unusual claim profiles may indicate fraudulent cases. Similarly, odd money transfer transactions may suggest money laundering. Credit card transactions that do no fit the general usage profile of the owner may also indicate signs of suspicious activity.

Record screening modeling techniques can provide valuable help in revealing fraud by identifying “unexpected” data patterns and “odd” cases. The unexpected cases are not always suspicious. They may just indicate an unusual, yet acceptable, behavior. For sure, though, they require further investigation before being classified as suspicious or not.

Record screening models can also play another important role. They can be used as a data exploration tool before the development of another data mining model. Some models, especially those with a statistical origin, can be affected by the presence of abnormal cases which may lead to poor or biased solutions. It is always a good idea to identify these cases in advance and thoroughly examine them before deciding on their inclusion in subsequent analysis.

Modified standard data mining techniques, like clustering models, can be used for the unsupervised detection of anomalies. Outliers can often be found among cases that do not fit well in any of the emerged clusters or in sparsely populated clusters. Thus, the usual tactic for uncovering anomalous records is to develop an explorative clustering solution and then further investigate the results. A specialized technique in the field of unsupervised record screening is IBM SPSS Modeler’s Anomaly Detection. It is an exploratory technique based on clustering. It provides a quick, preliminary data investigation and suggests a list of records with odd data patterns for further investigation. It evaluates each record’s “normality” in a multivariate context and not on a per-field base by assessing all the inputs together. More specifically, it identifies peculiar cases by deriving a cluster solution and then measuring the distance of each record from its cluster central point, the centroid. An anomaly index is then calculated that represents the proximity of each record to the other records in its cluster. Records can be sorted according to this measure and then flagged as anomalous according to a user-specified threshold value. What is interesting about this algorithm is that it provides the reasoning behind its results. For each anomalous case it displays the fields with the unexpected values that do not conform to the general profile of the record.

Supervised and Unsupervised Models for Detecting Fraud

Unsupervised record screening techniques can be applied for fraud detection even in the absence of recorded fraudulent events. If past fraudulent cases are available, analysts can try a supervised classification model to identify the input data patterns associated with the target suspicious activities. The supervised approach has strengths since it works in a more targeted way than unsupervised record screening. However, it also has specific disadvantages. Since fraudsters’ behaviors may change and evolve over time, supervised models trained on past cases may soon become outdated and fail to capture new tricks and new types of suspicious patterns. Additionally, the list of past fraudulent cases, which is necessary for the training of the classification model, is often biased and partial. It depends on the specific rules and criteria in use. The existing list may not cover all types of potential fraud and may need to be appended to the results of random audits. In conclusion, both the supervised and unsupervised approaches for detecting fraud have pros and cons. A combined approach is the one that usually yields the best results.

MACHINE LEARNING/ARTIFICIAL INTELLIGENCE VS. STATISTICAL TECHNIQUES

According to their origin and the way they analyze data patterns, the data mining models can be grouped into two classes:

Machine learning/artificial intelligence models
Statistical models.

Statistical models include algorithms like OLSR, logistic regression, factor analysis/PCA, among others. Techniques like decision trees, neural networks, association rules, self-organizing maps are machine learning models.

With the rapid developments in IT in recent years, there has been a rapid growth in machine leaning algorithms, expanding analytical capabilities in terms of both efficiency and scalability. Nevertheless, one should never underestimate the predictive power of “traditional” statistical techniques whose robustness and reliability have been established and proven over the years.

Faced with the growing volume of stored data, analysts started to look for faster algorithms that could overcome potential time and size limitations. Machine learning models were developed as an answer to the need to analyze large amounts of data in a reasonable time. New algorithms were also developed to overcome certain assumptions and restrictions of statistical models and to provide solutions to newly arisen business problems like the need to analyze affinities through association modeling and sequences through sequence models.

Trying many different modeling solutions is the essence of data mining. There is no particular technique or class of techniques which yields superior results in all situations and for all types of problems. However, in general, machine learning algorithms perform better than traditional statistical techniques in regard to speed and capacity of analyzing large volumes of data. Some traditional statistical techniques may fail to efficiently handle wide (high-dimensional) or long datasets (many records). For instance, in the case of a classification project, a logistic regression model would demand more resources and processing time than a decision tree model. Similarly, a hierarchical or agglomerative cluster algorithm will fail to analyze more than a few thousand records when some of the most recently developed clustering algorithms, like IBM SPSS Modeler TwoStep Model, can handle millions without sampling. Within the machine learning algorithms we can also note substantial differences in terms of speed and required resources, with neural networks, including SOMs for clustering, among the most demanding techniques.

Another advantage of machine learning algorithms is that they have less stringent data assumptions. Thus they are more friendly and simple to use for those with little experience in the technical aspects of model building. Usually, statistical algorithms require considerable effort in building. Analysts should spend time taking into account the data considerations. Merely feeding raw data into these algorithms will probably yield poor results. Their building may require special data processing and transformations before they produce results comparable or even superior to those of machine learning algorithms.

Another aspect that data miners should take into account when choosing a model technique is the insight provided by each algorithm. In general, statistical models yield transparent solutions. On the contrary, some machine learning models, like neural networks, are opaque, conveying little information and knowledge about the underlying data patterns and customer behaviors. They may provide reliable customer scores and achieve satisfactory predictive performance, but they provide little or no reasoning for their predictions. However, among machine learning algorithms there are models that provide an explanation of the derived results, like decision trees. Their results are presented in an intuitive and self-explanatory format, allowing an understanding of the findings. Since most data mining software packages allow for fast and easy model development, the case of developing one model for insight and a different model for scoring and deployment is not unusual.

SUMMARY

In the previous sections we presented a brief introduction to the main concepts of data mining modeling techniques. Models can be grouped into two main classes: supervised and unsupervised.

Supervised modeling techniques are also referred to as directed or predictive because their goal is prediction. Models automatically detect or “learn” the input data patterns associated with specific output values. Supervised models are further grouped into classification and estimation models, according to the measurement level of the target field. Classification models deal with the prediction of categorical outcomes. Their goal is to classify new cases into predefined classes. Classification models can support many important marketing applications that are related to the prediction of events, such as customer acquisition, cross/up/deep selling, and churn prevention. These models estimate event scores or propensities for all the examined customers, which enable marketers to efficiently target their subsequent campaigns and prioritize their actions. Estimation models, on the other hand, aim at estimating the values of continuous target fields. Supervised models require a thorough evaluation of their performance before deployment. There are many evaluation tools and methods which mainly include the cross-examination of the model’s predicted results with the observed actual values of the target field.

In Table 2.15 we present a list summarizing the supervised modeling techniques, in the fields of classification and estimation. The table is not meant to be exhaustive but rather an indicative listing which presents some of the most well-known and established algorithms.

While supervised models aim at prediction, unsupervised models are mainly used for grouping records or fields and for the detection of events or attributes that occur together. Data reduction techniques are used to narrow the data’s dimensions, especially in the case of wide datasets with correlated inputs. They identify related sets of original fields and derive compound measures that can effectively replace them in subsequent tasks. They simplify subsequent modeling or reporting jobs without sacrificing much of the information contained in the initial list of fields.

Table 2.15 Supervised modeling techniques.

Classification techniques	Estimation/regression techniques
• Logistic regression	• Ordinary least squares regression
• Decision trees:	• Neural networks
	• Decision trees:
• C5.0
• CHAID	• CHAID
• Classification and Regression Trees	• Classification and Regression Trees
• QUEST	• Classification and Regression Trees
	• Support vector machine
	• Generalized linear models
• Decision rules:
• C5.0
• Decision list
• Discriminant analysis
• Neural networks
• Support vector machine
• Bayesian networks

Table 2.16 Unsupervised modeling techniques.

Clustering techniques	Data reduction techniques
• K-means	• Principal components analysis
• TwoStep cluster	• Factor analysis
• Kohonen network/self-organizing map

Clustering models automatically detect natural groupings of records. They can be used to segment customers. All customers are assigned to one of the derived clusters according to their input data patterns and their profiles. Although an explorative technique, clustering also requires the evaluation of the derived clusters before selecting a final solution. The revealed clusters should be understandable, meaningful, and actionable in order to support the development of an effective segmentation scheme.

Association models identify events/products/attributes that tend to co-occur. They can be used for market basket analysis and in all other “affinity” business problems related to questions such as “what goes with what?” They generate IF... THEN rules which associate antecedents to a specific consequent. Sequence models are an extension of association models that also take into account the order of events. They detect sequences of events and can be used in web path analysis and in any other “sequence” type of problem.

Table 2.16 lists unsupervised modeling techniques in the fields of clustering and data reduction. Once again the table is not meant to be exhaustive but rather an indicative listing of some of the most popular algorithms.

One last thing to note about data mining models: they should not be viewed as a stand-alone procedure but rather as one of the steps in a well-designed procedure. Model results depend greatly on the preceding steps of the process (business understanding, data understanding, and data preparation) and on decisions and actions that precede the actual model training. Although most data mining models automatically detect patterns, they also depend on the skills of the persons involved. Technical skills are not enough. They should be complemented with business expertise in order to yield meaningful instead of trivial or ambiguous results. Finally, a model can only be considered as effective if its results, after being evaluated as useful, are deployed and integrated into the organization’s everyday business operations.

Since the book focuses on customer segmentation, a thorough presentation of supervised algorithms is beyond its scope. In the next chapter we will introduce only the key concepts of decision trees, as this is a technique that is often used in the framework of a segmentation project for scoring and profiling. We will, however, present in detail in that chapter those data reduction and clustering techniques that are widely used in segmentation applications.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for CHAPTER TWO: An Overview of Data Mining Techniques

Create new playlist

Sign In

Sign Up

Table of Contents for
CHAPTER TWO: An Overview of Data Mining Techniques