Customers are the most important asset of an organization. That’s why an organization should plan and employ a clear strategy for customer handling. Customer relationship management (CRM) is the strategy for building, managing, and strengthening loyal and long-lasting customer relationships. CRM should be a customer-centric approach based on customer insight. Its scope should be the “personalized” handling of the customers as distinct entities through the identification and understanding of their differentiated needs, preferences, and behaviors.
CRM aims at two main objectives:
Data mining can provide customer insight which is vital for these objectives and for establishing an effective CRM strategy. It can lead to personalized interactions with customers and hence increased satisfaction and profitable customer relationships through data analysis. It can offer individualized and optimized customer management throughout all the phases of the customer life cycle, from acquisition and establishment of a strong relationship to attrition prevention and win-back of lost customers. Marketers strive to get a greater market share and a greater share of their customers. In plain words, they are responsible for getting, developing, and keeping the customers. Data mining can help them in all these tasks, as shown in Figure 1.1.
More specifically, the marketing activities that can be supported with the use of data mining include:
Segmentation is the process of dividing the customer base in distinct and homogeneous groups in order to develop differentiated marketing strategies according to their characteristics. There are many different segmentation types according to the specific criteria/attributes used for segmentation. In behavioral segmentation, customers are grouped based on behavioral and usage characteristics. Although behavioral segments can be created using business rules, this approach has inherent disadvantages. It can handle only a few segmentation fields, and its objectivity is questionable as it is based on the personal perceptions of a business expert. Data mining on the other hand can create data-driven behavioral segments. Clustering algorithms can analyze behavioral data, identify the natural groupings of customers, and suggest a grouping founded on observed data patterns. Provided it is properly built, it can uncover groups with distinct profiles and characteristics and lead to rich, actionable segmentation schemes with business meaning and value.
Data mining can also be used for the development of segmentation schemes based on the current or expected/estimated value of the customers. These segments are necessary in order to prioritize the customer handling and the marketing interventions according to the importance of each customer.
Marketers carry out direct marketing campaigns to communicate a message to their customers through mail, Internet, e-mail, telemarketing (phone), and other direct channels in order to prevent churn (attrition) and drive customer acquisition and purchase of add-on products. More specifically, acquisition campaigns aim at drawing new and potentially valuable customers from the competition. Cross/deep/up-selling campaigns are rolled out to sell additional products, more of the same product, or alternative but more profitable products to the existing customers. Finally, retention campaigns aim at preventing valuable customers from terminating their relationship with the organization.
These campaigns, although potentially effective, when not refined can also lead to a huge waste of resources and to the annoyance of customers with unsolicited communication. Data mining and classification (propensity) models in particular can support the development of targeted marketing campaigns. They analyze the customer characteristics and recognize the profile of the target customers. New cases with similar profiles are then identified, assigned a high propensity score, and included in the target lists. Table 1.1 summarizes the use of data mining models in direct marketing campaigns.
Table 1.1 Data mining models and direct marketing campaigns
Source: Tsiptsis and Chorianopoulos (2009).
Business objective | Marketing campaign | Data mining models |
Getting customers |
|
|
Developing customers |
|
|
Retaining customers |
|
|
When properly built, propensity models can identify the right customers to contact and lead to campaign lists with increased concentrations of target customers. They outperform random selections as well as predictions based on business rules and personal intuitions.
Data mining and association models in particular can be used to identify related products, typically purchased together. These models can be used for market basket analysis and for the revealing of bundles of products/services that can be sold together. Sequence models take into account the order of actions/purchases and can identify sequences of events.
The modeling phase is just one phase in the implementation process of a data mining project. Steps of critical importance precede and follow the model building and have a significant effect in the success of the project. An outline of the basic phases in the development of a data mining project, according to the Cross Industry Standard Process for Data Mining (CRISP-DM) process model, is presented in Table 1.2.
Table 1.2 The CRISP-DM phases
Source: Tsiptsis and Chorianopoulos (2009). Reproduced with permission from Wiley.
1. Business understanding | 2. Data understanding | 3. Data preparation |
|
|
|
4. Modeling | 5. Model evaluation | 6. Deployment |
|
|
|
Data mining projects are not simple. They may end in business failure if the engaged team is not guided by a clear methodological framework. The CRISP-DM process model charts the steps that should be followed for successful data mining implementations. These steps are:
The aforementioned phases present strong dependencies, and the outcomes of a phase may lead to revisiting and reviewing the results of preceding phases. The nature of the process is cyclical since the data mining itself is a never-ending journey and quest, demanding continuous reassessment and update of completed tasks in the context of a rapidly changing business environment.
This book contains two chapters dedicated in the methodological framework of classification and behavioral segmentation modeling. In these chapters, the recommended approach for these applications is elaborated and presented as a step-by-step guide.
Data mining models employ statistical or machine-learning algorithms to identify useful data patterns and understand and predict behaviors. They can be grouped in two main classes according to their goal:
In supervised, also referred to as predictive, directed, or targeted, modeling, the goal is to predict an event or estimate the values of a continuous numeric attribute. In these models, there are input fields and an output or target field. Inputs are also called predictors because they are used by the algorithm for the identification of a prediction function for the output. We can think of predictors as the “X” part of the function and the target field as the “Y” part, the outcome.
The algorithm associates the outcome with input data patterns. Pattern recognition is “supervised” by the target field. Relationships are established between the inputs and the output. An input–output “mapping function” is generated by the algorithm that associates predictors with the output and permits the prediction of the output values, given the values of the inputs.
In unsupervised or undirected models, there is no output, just inputs. The pattern recognition is undirected; it is not guided by a specific target field. The goal of the algorithm is to uncover data patterns in the set of inputs and identify groups of similar cases, groups of correlated fields, frequent itemsets, or anomalous records.
Models learn from past cases. In order for predictive algorithms to associate input data patterns with specific outcomes, it is necessary to present them cases with known outcomes. This phase is called the training phase. During that phase, the predictive algorithm builds the function that connects the inputs with the target. Once the relationships are identified and the model is evaluated and proved of satisfactory predictive power, the scoring phase follows. New records, for which the outcome values are unknown, are presented to the model and scored accordingly.
Some predictive algorithms such as regression and Decision Trees are transparent, providing an explanation of their results. Besides prediction, these algorithms can also be used for insight and profiling. They can identify inputs with a significant effect on the target attribute, for example, drivers of customer satisfaction or attrition, and they can reveal the type and the magnitude of their effect.
According to their scope and the measurement level of the field to be predicted, supervised models are further categorized into:
Classification or propensity models predict categorical outcomes. Their goal is to classify new cases to predefined classes, in other words to predict an event. The classification algorithm estimates a propensity score for each new case. The propensity score denotes the likelihood of occurrence of the target event.
Estimation models are similar to classification models with one big difference. They are used for predicting the value of a continuous output based on the observed values of the inputs.
These models are used as a preparation step preceding the development of a predictive model. Feature selection algorithms assess the predictive importance of the inputs and identify the significant ones. Predictors with trivial predictive power are discarded from the subsequent modeling steps.
Classification models predict categorical outcomes by using a set of inputs and a historical dataset with preclassified data. Generated models are then used to predict event occurrence and classify unseen records. Typical examples of target categorical fields include:
In the heart of all classification models is the estimation of confidence scores. These scores denote the likelihood of the predicted outcome. They are estimates of the probability of occurrence of the respective event, typically ranging from 0 to 1. Confidence scores can be translated to propensity scores which signify the likelihood of a particular target class: the propensity of a customer to churn, to buy a specific add-on product, or to default on his loan. Propensity scores allow for the rank ordering of customers according to their likelihood. This feature enables marketers to target their lists and optimally tailor their campaign sizes according to their resources and marketing objectives. They can expand or narrow their target lists on the base of their particular objectives, always targeting the customers with the relatively higher probabilities.
Popular classification algorithms include:
For example:
In a tree format, rules are graphically represented as a tree in which the initial population (root node) is successively partitioned into terminal (leaf) nodes with similar behavior in respect to the target field.
Decision Tree algorithms are fast and scalable. Available algorithms include:
where pj = probability of the target class j, pk probability of the reference target class k, Xi the predictors, bi the regression coefficients, and b0 the intercept of the model. The regression coefficients represent the effect of predictors.
For example, in the case of a binary target denoting churn,
In order to yield optimal results, it may require special data preparation, including potential screening and transformation (optimal binning) of the predictors. It demands some statistical experience yet, provided it is built properly, it can produce stable and understandable results.
Estimation models, also referred to as regression models, deal with continuous numeric outcomes. By using linear or nonlinear functions, they use the input fields to estimate the unknown values of a continuous target field.
Estimation algorithms can be used to predict attributes like the following:
A dataset with historical data and known values of the continuous output is required for the model training. A mapping function is then identified that associates the available inputs to the output values. These models are also referred to as regression models, after the well-known and established statistical algorithm of ordinary least squares regression (OLSR). The OLSR estimates the line that best fits the data and minimizes the observed errors, the so-called least squares line. It requires some statistical experience, and since it is sensitive to possible violations of its assumptions, it may require specific data examination and processing before building. The final model has the intuitive form of a linear function with coefficients denoting the effect of predictors to the outcome. Although transparent, it has inherent limitations that may affect its performance in complex situations of nonlinear relationships and interactions between predictors.
Nowadays, traditional regression is not the only available estimation algorithm. New techniques, with less stringent assumptions, which also capture nonlinear relationships, can also be employed to handle continuous outcomes. More specifically, polynomial regression, neural networks, SVM, and regression trees such as CART can also be employed for the prediction of continuous attributes.
The feature selection (field screening) process is a preparation step for the development of classification and estimation (regression) models. The situation of having hundreds of candidate predictors is not an unusual case in complicated data mining tasks. Some of these fields though may not have an influence to the output that we want to predict.
The basic idea of feature selection is to use basic statistical measures to assess and quantify the relationship of the inputs to the output. More specifically, feature selection is used to:
Some predictive algorithms, including Decision Trees, integrate screening mechanisms that internally filter out the unrelated predictors. A preprocessing feature selection step is also available in Data Mining for Excel, and it can be invoked when building a predictive model. Feature selection can efficiently reduce data dimensionality, retaining only a subset of significant inputs so that the training time is reduced with no or insignificant loss of accuracy.
In unsupervised modeling, only input fields are involved. The scope is the identification of groupings and associations. Unsupervised models include:
In cluster models, the groups are not known in advance. Instead, the algorithms analyze the input data patterns and identify the natural groupings of instances/cases. When new cases are scored by the generated cluster model, they are assigned into one of the revealed clusters.
Association and sequence models also belong to the class of unsupervised algorithms. Association models do not involve direct prediction of a single field. In fact, all fields have a double role, since they act as inputs and outputs at the same time. Association algorithms detect associations between discrete events, products, and attributes. Sequence algorithms detect associations over time.
Dimensionality reduction algorithms “group” fields into new compound measures and reduce the dimensions of data without sacrificing much of the information of the original fields.
Cluster models automatically detect the underlying groups of cases, the clusters. The clusters are not known in advance. They are revealed by analyzing the observed input data patterns. Clustering algorithms assess the similarity of the records/customers in respect to the clustering fields, and they assign them to the revealed clusters accordingly. Their goal is to detect groups with internal homogeneity and interclass heterogeneity.
Clustering algorithms are quite popular, and their use is widespread from data mining to market research. They can support the development of different segmentation schemes according to the clustering attributes used: behavioral, attitudinal, or demographical segmentation.
The major advantage of the clustering algorithms is that they can efficiently manage a large number of attributes and create data-driven segments. The revealed segments are not based on personal concepts, intuitions, and perceptions of the business people. They are induced by the observed data patterns, and provided they are properly built, they can lead to results with real business meaning and value. Clustering models can analyze complex input data patterns and suggest solutions that would not otherwise be apparent. They reveal customer typologies, enabling tailored marketing strategies.
Nowadays, various clustering algorithms are available which differ in their approach for assessing the similarity of the cases. According to the way they work and their outputs, the clustering algorithms can be categorized in two classes, the hard and the soft clustering algorithms. The hard clustering algorithms assess the distances (dissimilarities) of the instances. The revealed clusters do not overlap and each case is assigned to a single cluster.
Hard clustering algorithms include:
The soft clustering techniques on the other end use probabilistic measures to assign the cases to clusters with a certain probabilities. The clusters can overlap and the instances can belong to more than one cluster with certain, estimated probabilities. The most popular probabilistic clustering algorithm is Expectation Maximization (EM) clustering.
Association models analyze past co-occurrences of events and detect associations and frequent itemsets. They associate a particular outcome category with a set of conditions. They are typically used to identify purchase patterns and groups of products often purchased together.
Association algorithms generate rules of the following general format:
For example:
More specifically, a rule referring to supermarket purchases might be:
This simple rule, derived by analyzing past shopping carts, identifies associated products that tend to be purchased together: when eggs, milk, and fresh fruit are bought, then there is an increased probability of also buying vegetables. This probability, referred to as the rule’s confidence, denotes the rule’s strength.
The left or the IF part of the rule consists of the antecedents or conditions: a situation that when holds true, the rule applies and the consequent shows increased occurrence rates. In other words, the antecedent part contains the product combinations that usually lead to some other product. The right part of the rule is the consequent or the conclusion: what tends to be true when the antecedents hold true. The rule complexity depends on the number of antecedents linked with the consequent.
These models aim at:
This type of analysis is referred to as market basket analysis since it originated from point-of-sale data and the need of understanding consuming shopping patterns. Its application was extended though to also cover any other “basketlike” problem from various other industries. For example:
Association models are unsupervised since they do not involve a single output field to be predicted. They analyze product affinity tables: multiple fields that denote product/service possession. These fields are at the same time considered as inputs and outputs. Thus, all products are predicted and act as predictors for the rest of the products.
Usually, all the extracted rules are described and evaluated in respect to three main measures:
The Apriori and the FP-growth algorithms are popular association algorithms.
Sequence algorithms analyze paths of events in order to detect common sequences. They are used to identify associations of events/purchases/attributes over time. They take into account the order of events and detect sequential associations that lead to specific outcomes. Sequence algorithms generate rules analogous to association algorithms with one difference: a sequence of antecedent events is strongly associated with the occurrence of a consequent. In other words, when certain things happen with a specific order, a specific event has increased probability to follow. Their general format is:
Or, for example, a rule referring to bank products might be:
This rule states that bank customers who start their relationship with the bank as savings customers and subsequently acquire a credit card and a short-term deposit present increased likelihood to invest in stocks. The support and confidence measures are also applicable in sequence models.
The origin of sequence modeling lies in web mining and click stream analysis of web pages; it started as a way to analyze web log data in order to understand the navigation patterns in web sites and identify the browsing trails that end up in specific pages, for instance, purchase checkout pages. The use of these algorithms has been extended, and nowadays, they can be applied to all “sequence” business problems. They can be used as a mean for predicting the next expected “move” of the customers or the next phase in a customer’s life cycle. In banking, they can be applied to identify a series of events or customer interactions that may be associated with discontinuing the use of a product; in telecommunications, to identify typical purchase paths that are highly associated with the purchase of a particular add-on service; and in manufacturing and quality control, to uncover signs in the production process that lead to defective products.
As their name implies, dimensionality reduction models aim at effectively reducing the data dimensions and remove the redundant information. They identify the latent data dimensions and replace the initial set of inputs with a core set of compound measures which simplify subsequent modeling while retaining most of the information of the original attributes.
Factor, Principal Components Analysis (PCA), and Independent Component Analysis (ICA) are among the most popular data reduction algorithms. They are unsupervised, statistical algorithms which analyze and substitute a set of continuous inputs with representative compound measures of lower dimensionality.
Simplicity is the key benefit of data reduction techniques, since they drastically reduce the number of fields under study to a core set of composite measures. Some data mining techniques may run too slow or may fail to run if they have to handle a large number of inputs. Situations like these can be avoided by using the derived component scores instead of the original fields.
Record screening models are applied for anomaly or outlier detection. They try to identify records with odd data patterns that do not “conform” to the typical patterns of the “normal” cases.
Unsupervised record screening models can be used for:
Valuable information is not only hidden in general data patterns. Sometimes rare or unexpected data patterns can reveal situations that merit special attention or require immediate actions. For instance, in the insurance industry, unusual claim profiles may indicate fraudulent cases. Similarly, odd money transfer transactions may suggest money laundering. Credit card transactions that do not fit the general usage profile of the owner may also indicate signs of suspicious activity.
Record screening algorithms can provide valuable help in fraud discovery by identifying the “unexpected” data patterns and the “odd” cases. The unexpected cases are not always suspicious. They may just indicate an unusual yet acceptable behavior. For sure though, they require further investigation before being classified as suspicious or not.
Record screening models can also play another important role. They can be used as a data exploration tool before the development of another data mining model. Some models, especially those with a statistical origin, can be affected by the presence of abnormal cases which may lead to poor or biased solutions. It is always a good idea to identify these cases in advance and thoroughly examine them before deciding for their inclusion in subsequent analysis.
The examination of record distances as well as standard data mining techniques, such as clustering, can be applied for anomaly detection. Anomalous cases can often be found among cases distant from their “neighbors” or among cases that do not fit well in any of the emerged clusters or lie in sparsely populated clusters.
The success of a data mining project strongly depends on the breadth and quality of the available data. That’s why the data preparation phase is typically the most time consuming phase of the project. Data mining applications should not be considered as one-off projects but rather as ongoing processes, integrated in the organization’s marketing strategy. Data mining has to be “operationalized.” Derived results should be made available to marketers to guide them in their everyday marketing activities. They should also be loaded in the organization’s frontline systems in order to enable “personalized” customer handling. This approach requires the setting up of well-organized data mining procedures, designed to serve specific business goals, instead of occasional attempts which just aim to cover sporadic needs.
In order to achieve this and become a “predictive enterprise,” an organization should focus on the data to be mined. Since the goal is to turn data into actionable knowledge, a vital step in this “mining quest” is to build the appropriate data infrastructure. Ad hoc data extraction and queries which just provide answers to a particular business problem may soon end up into a huge mess of unstructured information. The proposed approach is to design and build a central mining datamart that will serve as the main data repository for the majority of the data mining applications.
All relevant information should be taken into account in the datamart design. Useful information from all available data sources, including internal sources such as transactional, billing and operational systems, and external sources such as market surveys and third-party lists, should be collected and consolidated in the datamart framework. After all, this is the main idea of the datamart: to combine all important blocks of information in a central repository that can enable the organization to have a complete a view of each customer.
The mining data mart should:
Table 1.3 presents an indicative, minimum list of required information that should be loaded and available in the mining datamart of retail banking.
Table 1.3 The minimum required data for the mining datamart of retail banking
Product mix and product utilization: ownership and balances | |
Product ownership and balances per product groups/subgroups For example:
|
|
Frequency (number) and volume (amount) of transactions | |
Transactions by transaction type For example:
|
Transactions by transaction channel For example:
|
Product (account) openings/terminations | |
For the specific case of credit cards, frequency and volume of purchases by type (one-off, installments, etc.), and merchant category | |
Credit score and arrears history | |
Information on the profitability of customers | |
Customer status history (active, inactive, dormant, etc.) and core segment membership (retail, corporate, private banking, affluent, mass, etc.) | |
Registration and sociodemographical information of customers |
Table 1.4 lists the minimum blocks of information that should reside in the datamart of a mobile telephony operator (for residential customers).
Table 1.4 The minimum required data for the mining datamart of mobile telephony (residential customers)
Phone usage: number of calls/minutes of calls/traffic | |||
Usage by call direction and network type
|
Usage by core service type For example:
|
Usage by origination/destination operator For example:
|
Usage by call day/time For example:
|
Customer communities: distinct telephone numbers with calls from or to | |||
Information by call direction (incoming/outgoing) and operator (on-net/mobile telephony competitor operator/fixed telephony operator) | |||
Top-up history (for prepaid customers), frequency, value, and recency of top-ups | |||
Information by top-up type | |||
Rate plan history (opening, closings, migrations) | |||
Billing, payment, and credit history (average number of days till payment, number of times in arrears, average time remaining in arrears, etc.) | |||
Financial information such as profit and cost for each customer (ARPU, MARPU) | |||
Status history (active, suspended, etc.) and core segmentation information (consumer postpaid/contractual, consumer prepaid customers) | |||
Registration and sociodemographical information of customers |
Table 1.5 lists the minimum information blocks that should be loaded in the datamart of retailers.
Table 1.5 The minimum required data for the mining datamart of retailers
RFM attributes: recency, frequency, monetary overall, and per product category | |||
Time since the last transaction (purchase event) or since the most recent visit day Frequency (number) of transactions or number of days with purchases Value of total purchases | |||
Spending: purchases amount and number of visits/transactions | |||
Relative spending by product hierarchy and private labels For example:
|
Relative spending by store/department For example:
|
Relative spending by date/time
|
Relative spending by channel of purchase For example:
|
Usage of vouchers and reward points | |||
Payments by type (e.g., cash, credit card) | |||
Registration data collected during the loyalty scheme application process |
Most data mining applications require a one-dimensional, flat table, typically at a customer level. A recommended approach is to consolidate the mining datamart information, often spread in a set of database tables, into one table which should be designed to cover the key mining as well as marketing and reporting needs. This table, also referred to as the marketing reference table, should integrate all the important customer information, including demographics, usage, and revenue data, providing a customer “signature”: a unified view of the customer. Through extensive data processing, the data retrieved from the datamart should be enriched with derived attributes and informative KPIs to summarize all the aspects of the customer’s relationship with the organization.
The reference table typically includes data at a customer level, though the data model of the mart should support the creation of custom reference tables of different granularities. It should be updated at a regular basis to provide the most recent view of each customer, and it should be stored to track the customer view over time. The key idea is to comprise a good starting point for the fast creation of a modeling file for all the key future analytical applications.
Typical data management operations required to construct and maintain the marketing reference tables include:
The derived measures in the marketing reference table should include:
In this chapter, we’ve presented a brief overview of the main concepts of data mining. We’ve outlined how can data mining help an organization to better address the CRM objectives and achieve “individualized” and more effective customer management through customer insight. We’ve introduced the main types of data mining models and algorithms and a process model, a methodological framework for designing and implementing successful data mining projects. We’ve also presented the importance of the mining datamart and provided indicative lists of required information per industry.
Chapters 2 and 3 are dedicated in the detailed explanation of the methodological steps for classification and segmentation modeling. After clarifying the roadmap, we describe the tools, the algorithms. Chapters 4 and 5 explain in plain words some of the most popular and powerful data mining algorithms for classification and clustering. The second part of the book is the “hands-on” part. The knowledge of the previous chapters is applied in real-world applications of various industries using three different data mining tools. A worth noting lesson that I’ve learned after concluding my journey in this book is that it is not the tool that matters the most but the roadmap. Once you have an unambiguous business and data mining goal and a clear methodological approach, then you’ll most likely reach your target, yielding comparable results, regardless of the data mining software used.