Chapter 3

Using Analytics

Contents

Analytics has a degree of mystery surrounding it almost like a magic box that takes in large amounts of data and, voila, business insights jump out. This chapter will demystify analytics by first explaining the specific problems for which analytics can be used using several examples, rather than a generic statement that “analytics finds insights that business can employ.” The patterns or insights have to be in a given context—a problem statement—that will show the relevant data that is needed for that context. Any attempts to blindly put an analytics tool on a large data set may or may not deliver results. It can lead to aimlessly wandering in data and yet not getting anywhere. That type of an approach (also known as data discovery) may work in classic academic research and development type of environment, but not for commercial organizations, as the entire initiative may get terminated after a while. It is advised not to undertake analytics project where problem domain or context is not properly established. Examples in this chapter will help.

Based on the Information Continuum where analytics was put in context of other data and information delivery mechanisms, this chapter presents a variety of problems from numerous industries to see where analytics can be applied. The purpose of these examples is twofold:

1. To illustrate the variety of problems that analytics can solve.

2. To illustrate common themes across these problems, allowing you to find similar problems within your area of responsibility.

For each of these problems, the analytics technique used will be identified along with some idea on sample data that is used and the business value of the analytics output. Once you go through these examples, which are presented in a simplistic way not requiring specific industry knowledge, you should start to think along these themes, and will easily find opportunities within your organizations where analytics models can be tried as a pilot to illustrate its value.

Each of the following examples is a legitimate business problem and is presented here in a structured layout. First, the problem statement is described. In some cases multiple examples are taken from one industry. Then the analytics model and its selection is discussed as to which technique is appropriate for the problem statement and why. The same part of the example also covers some details of the analytics solution, such as sample data. Lastly, the third section within each example covers how business value is derived from the application of analytics to the specific problem statement. None of these examples comes from formal case studies, and they are very simplistic representations of the industry and the problem.

Healthcare

The healthcare industry deals with patients and providers on one side where disease, diagnosis, and treatment are important, while on the other side it deals with pharmaceutical and healthcare manufacturers trying to solve challenges in disease and life style.

Emergency Room Visit

A patient visits the emergency room (ER), is thoroughly checked, diagnosed, and treated. What is the probability that the patient will be back in ER in the next three months? This is done to track the treatment efficacy of the ER department.

Analytics Solution

A predictive model is needed that will review the detailed records of the patients who returned to the ER within three months of their original visits and the ones who did not return. It will use a predictive variable called ER_Return and set it to 1 for all patients who did return and 0 for all patients who did not return. The data preparation would require historical ER data going back to three to five years. The grain of the record will be the patient visit, meaning each patient visit will have one record. The variables in the record will be age, gender, profession, marital status, diagnosis, procedure 1, procedure 2, date and time, vitals 1–5, previous diagnosis, previous procedure, current medication, last visit to the ER, insurance coverage, etc.

Note that the data preparation is important here because the vital readings, such as blood pressure, temperature, weight, pulse, etc., all have to be built in such a way that one visit gets one record. This is a requirement for the predictive model. Also the predictive modeling tool is not blindly or aimlessly being applied on Big Data, rather the problem statement determines the grain and structure of data.

The predictive model will take the historical data set, look at all the records that have a 1 in the predictive variable, and find some common patterns of variables. It will then look at the records that have the predicted variable as 0 and find some common patterns of variables. Next it will remove the variables that are common in both and identify the variables that stand out in determination of the 1 and 0. This is the power of discrimination, and each variable gets a metric in terms of its power to discriminate 1 versus 0. All of the variables with their discriminatory powers combined become the predictive model.

Once the model is ready from the historical training data set, it will be tested. We will deal with testing in subsequent sections, but basically 90% of the records can be used to train or build the model and then 10% to test the model. The model is run using the 10% data, but the predictive variable is not provided to the model and it is required to assign the 1 or 0 using its knowledge of the variables acquired through the 90% of the data. The model assigns the 1 and 0 to the 10% data, and then the results are compared with the actual outcome that is known but was withheld when the test data was submitted to the model. For simplicity’s sake, we will assume that if the model got 70% of the values assigned correctly, the model is in good shape.

Once the model is tested, as a new patient is about to be released from the ER, his or her record is run through the predictive model. The model will assign a 1 or a 0 and return the record. This determination of a 1 or 0 comes with a degree of certainty i.e., a probability. The degree of certainty will be addressed in subsequent chapters in detail. If a 1 is assigned that means the patient will return to the ER in the next three months, and a 0 would mean otherwise. If a 1 is assigned, the decision strategy will kick in and an outpatient clinic nurse is assigned the case depending on the disease and treatment. The nurse would be responsible for following up with the patient regularly to encourage healthy behavior and discipline in dietary and medication schedules. The nurse may also schedule to bring the patient into an outpatient clinic for an examination after one month. All of this ensures the patient does not overburden the healthcare system by another visit to the ER if it can be avoided by managed care.

Patients with the Same Disease

A different problem within healthcare is the analysis of a disease to understand the common patterns among patients who suffered from the disease. This is used by drug manufacturers and disease control and prevention departments. The problem is to identify common patterns among a large group of patients who suffered from that same disease.

Analytics Solution

In this scenario all the patients under consideration contracted the disease, so we cannot use the predictive variable with the 1 and 0 approach; that is, we are not trying to use two sets of patients: one with the disease and one without. This requires a clustering solution. We would build the customer data set for the ones who contracted the disease, including variables like age, gender, economic status, zip code, presence of children, presence of pets, other medical conditions, vitals, etc. The grain of the data prepared for this will be at the patient level, meaning one record will represent one patient. The clustering software will take this data and create clusters of patients and assign a unique identifier to each cluster (name or label of the cluster). In addition to the name, it will provide the details of the variables and their value ranges within the cluster. For example, cluster 1 has the age from 26 to 34, while cluster 2 has the age from 35 to 47; cluster 1 has 60% male and 40% female, while cluster 3 has 30% male and 70% female.

Ten is a common number of clusters that the clustering software is required to build, but sometimes it cannot find 10 clusters, which means either more variables need to be added or the required number of clusters should be reduced. Clusters where variables have overlapping ranges say cluster 6 has the age range 24–34, would mean cluster 1 and cluster 6 have a strong relationship.

The purpose of building clusters is to break down a problem into manageable chunks so that various different types of approaches can be employed on various clusters. Clustering is most useful when a clear line of attack on a problem is not identifiable. In this scenario, various types of tests and treatments will be applied to different clusters and analyze the impact. Also, if the patient population is quite large, treatment research requires some mechanism to break down the population size to a manageable chunk. Instead of selecting random patients or the first 100 on a sorted list by age, clustering is a better mechanism, as it finds likeness among the population of a cluster and therefore the variation in data is evened out.

Customer Relationship Management

One of the most common and highly quoted applications of analytics is within the CRM space. Direct marketing or target marketing requires analyzing customers and then offering incentives for additional sales, usually coupons like 20% off or buy one get one free. There is a cost associated with these promotions, and therefore it cannot be offered to all customers. There are two distinct problem statements within CRM—that of segmenting the customers into clusters so specific incentives can be offered, and then only offering to those who have a high propensity to use them for additional sales.

Customer Segmentation

Break down the entire customer database into customer segments based on the similarity of their profile.

Analytics Solution

The grain of the data prepared for this problem will be the customer, meaning one record will represent one customer. Sales histories of individual customers have to be structured in such a way that all the prior sales history is captured in a set of variables. Additional variables can be age, gender, presence of children, income, luxury car owner, zip code, number of products purchased in a year, average amount of products purchased in a year, distance from nearest store, online shopper flag, etc. The clustering software will create 10 clusters and provide the specific values and value ranges for the variables used in that cluster. It may turn out that a cluster is found of frequent buyers who live in the same zip code as a big shopping mall. This is a find from the clustering that may not have been evident to the marketing team without clustering. They can now design campaigns in the surrounding areas of malls where there is a store.

The value here is identification of an interesting pattern and then exploiting that pattern for a creative campaign. The direct marketing campaigns have a fixed budget, and if an interesting cluster has a large population set beyond the budgets affordability, more variables may get added to find smaller clusters through an iterative process. It is also possible that the cluster variables and their ranges are not very useful, and therefore more iterations are needed with increased historical data, such as from one year to three years or adding additional variables to the data set. Typically, once good clusters are found that can be successfully exploited, the marketing teams rely on those clusters again and again and may not need additional clusters for seasonal campaigns.

Propensity to Buy

Once a coupon is sent, the direct marketing teams are required to track how many coupons were actually utilized and what was the overall benefit of the campaign. The higher coupon utilization leads to higher success of the campaign. It is, therefore, desirable to send the coupons only to customers who have a higher propensity to use the coupon. The problem statement therefore becomes: What is the probability that the customer will buy the product upon receiving the coupon?

Analytics Solution

This is a predictive modeling problem, as from the entire population that was sent the coupon, some used it and some didn’t. Let’s assign a 1 to the people who used it and a 0 to the people who didn’t. Feed the data into a predictive modeling engine. The grain of the data would be customer and coupon (that got used or went unused) and sample variables would be customer age, demographics, year-to-date purchases, departments and products purchased, coupon date, coupon type, coupon delivery method, etc. The modeling engine will use 90% of the data and look at the records with a 1 and try to find the variables and their common pattern; then it will look for the 0 records and find the variables and common patterns. Next it will combine the two sets of variables and try to determine the variables with high discriminatory power and come up with a predictive model fully trained. The model will get tested using the remaining 10% of the records. Testing and validation will be covered in detail in Chapter 4. The model will take one record at a time and assign a 1 or a 0 and return the predicted value. It can also return a probability in terms of actual percentage. Internally, the model always calculates the probability; it is up to the modeler to set up a cutoff like every record with higher than a 70% probability should be assigned a 1 and 0 otherwise.

The value of the predictive model here is the coupon cost reduction since it is only being sent to customers with a higher probability to buy. If the model returns a 1 or 0, the coupons are simply sent to the customers with a 1. If it turns out that coupon budget is left over or the customers with a 1 are small in number, then the probability threshold would get reduced. This is where decision strategy comes into play. In this scenario, it is better to get the probability as an output from the model and then the strategy will decide what to do based on the available budget. May be the customers with a higher probability to use (let’s say greater than 75%) will get a 20% off coupon and customers with a probability to use between 55% and 75% will get a 30% discount coupon.

Human Resource

The use of analytics in HR departments is not as widespread as in some other industries like finance or healthcare, but that needs to change (Davenport, 2010). In the recruitment department, a predictive model can predict the employees potentially leaving and the recruiters can start soliciting more candidates through their staffing provider network. An excellent case study on this topic shows how Xerox Corporation is hiring (Walker, 2012). In the benefits department, a predictive model can predict whether certain benefits will get utilized by employees and ones not likely to be used can be dropped. Similarly, employee satisfaction or feedback surveys also provide interesting insights using clustering to see how employees are similar and then design compensation, benefits, and other policies according to the clusters. This is an open field of investigation and has a lot of room for innovation and growth.

Employee Attrition

What is the probability that a new employee will leave the organization in the first three months of hiring?

Analytics Solution

This is a predictive model problem since every employee who left the organization within three months will get a 1 and every employee who stayed beyond three months will get a 0. The grain of the data prepared for this model will be at the employee level, meaning one employee gets one record. The variables used can be employee personal profile and demographics, educational background, interview process, referring firm or staffer, last job details, reasons for leaving last job, interest level in the new position, hiring manager, hiring department, new compensation, old compensation, etc. It is as much a science as an art form to see what variables can be made available, for example “interest level in the new position” is a tricky abstract concept to convert into a variable with discrete values. Chapter 4 shows how this can be done. Again, 90% of the data will be used to build the model.

The predictive modeling software will look for common patterns of variables with records that have a 1 in the predicted variable and the same process for records with a 0. Next, it will combine the two sets of variables and try to determine the variables with high discriminatory power and come up with a predictive model fully trained. The model will get tested using the remaining 10% of the records.

A critical mass of employee historical data has to be available for this work. An organization with less than 100 employees, for example, may not be able to benefit from this approach. In that scenario, models are built in larger organizations and can be used in smaller organizations provided there are certain similarities in their business models, cultures, and employee profiles.

Once the model is tested and provides reasonably accurate results, all new employees will be run through the model. This can be a scheduled exercise as new employees get hired on a regular basis in large organizations. As the predictive model assigns a probability of leaving to an employee, the HR staffer works with the hiring manager to ensure employee concerns are properly addressed and taken care of, ensuring the employee has a rewarding experience and becomes a valuable contributor to the team.

Resumé Matching

Another interesting HR problem that can be solved with classification is resumé matching. As defined in Chapter 1, text mining is a special type of data mining where a document is run through text mining software to get a prediction as to the class of that document. A text mining model learns from historical documents where a classification is assigned, such as legal, proposal, fiction, etc., and then uses that learning to assign classes to new incoming documents. In case of HR, the problem statement is” What is the probability that this resumé is from a good Java programmer?

Analytics Solution

Going through the troves of documents with assigned classifications, the text mining software learns the patterns common to each class of documents. Then a new document is submitted and the text mining software tries to determine its class. It predicts the class and its probability. Sometimes it would return multiple classes, each with an assigned probability depending on how the software has been configured. For this particular HR problem, existing known resumés in the HR recruitment database should be labeled according to their class, such as business analyst, database developer, etc. Sometimes even multiple classes can be assigned to the same resumé. Once the model is trained, all incoming resumés should be assigned a class and in some cases multiple classes. Next, we take job descriptions and run them through the text mining software so it can learn the job descriptions and their labels. A decision strategy then looks for the highest matching classes between resumé and a job description and returns the results.

In the absence of any standards for job descriptions or resumé styles, this is a fairly simplistic description of the problem, but it illustrates the innovative use of text mining. This approach is superior to the keyword-based matches widely used today. It can be tuned and, depending on the capability of the software matches and degree of matches, can be built with assigned probabilities. Fuzzy matching (that works on “sounds like”) can also be used to standardize terminology in the text before a text mining model is trained. This saves considerable time of the recruitment staff to sift through poorly aligned resumés that ended up matching high because of the same keyword use. A preprocessing can be introduced to format and structure the resumés and the job description and separate out sections like education, address and contact, references, summaries, trainings, etc., and then run text mining to classify the similar sections from job ads and resumés.

Consumer Risk

Consumer risk is perhaps the area with a very mature adoption of analytics for business decisions within the banking industry. Consumer lending business includes products like auto loans, credit cards, personal loans, and mortgages. As these products are offered to consumers, the expectation by the lending organization is to make money on interest as the consumer pays the loan back. However, if the consumer fails to pay back the loan, it becomes expensive to recover the loan and in some cases a loss is incurred. Analytics is widely used to predict the possibility of a consumer defaulting before a loan is approved. A consumer applies for a loan and the lender carries out some due diligence on the consumer and his or her profile. For consumer risk, regression has been the dominant analytics method that has been used for almost four decades (Bátiz-Lazo et al., 2010). Later developments in data mining provided an alternative to achieving the same goals using data mining. The following example uses data mining as an option to solve the consumer default risk problem.

Borrower Default

What is the probability that a customer will default on a loan in the next 12 months?

Analytics Solution

The data preparation will assign a 1 to all customer accounts that defaulted in the first 12 months after getting the loan, and it will assign a 0 to all customer accounts that did not default. The grain of the data will be at the loan account level, meaning one record will represent one loan account. The variables will be customer personal profile, demographic profile, type of account (credit card, auto loan, etc.), loan disbursed amount, term of loan, missed payments, installment amount, credit history from credit bureau, etc. The predictive modeling software will look for common patterns of variables with records that have a 1 in the predicted variable and the same process for records with a 0 in 90% of the data. Next, it will combine the two sets of variables and try to determine the variables with high discriminatory power and come up with a predictive model fully trained. The model will get tested using the remaining 10% of the records. The output of the model will be a probability of potential risk of default. Since this is a mature industry for using analytics, the probability has been converted into a score, with higher the score, the lower the probability of default. A lot of lending organizations do not invest in the technology and the required human resources to build models; they simply rely on the industry standard FICO (2012) score in the North American consumer lending markets.

In addition to consumer lending, almost a similar process is used in investment-grade securities default prediction. Models built and used by companies like S&P, Moody’s, and Fitch use the same concept to assign a risk rating to a security, and buyers or investors of that security rely on the risk rating.

In this particular case when the predictive model starts delivering a score or a probability percentage, the decision strategy kicks in. If the lender has a higher appetite for risk, they will increase their tolerance for risk and will lend to lower-level scores as well. Not only does this allow for lenders to reduce their risk of default, but analytics also allows for other cross- selling opportunities as well. If a loan applicant comes in at a very low risk of default, additional products can be readily sold to that customer and a specific strategy can be designed driven from the predictive model.

Insurance

The insurance industry uses analytics for all sorts of insurance products, such as life, property and casualty, healthcare, unemployment, etc. The actuarial scientists employed at insurance companies have been calculating the probability of potential claims going back to the 17th century in life insurance. In 1662 in London, John Graunt showed that there were predictable patterns of longevity and death in a defined group, or cohort of people, despite the uncertainty about the future longevity or mortality of any one individual person (Graunt, 1662). This is very similar to today’s clustering and classification approach of data mining tools to solve the same problem. The purpose of predicting the potential for a claim is to determine the premium to be charged for the insured entity or altogether deny the insurance policy application. A customer fills out an application for an insurance policy and until this predictive model is run, the insurance company cannot provide a rate quote for the premium.

Probability of a Claim

Insurance companies make money from premiums paid by their customer base assuming that not all of the customers would make a claim at the same time. The premiums coming in from millions of customers are used to pay off the claims of a smaller number of customers. That is the simplified profitability model of an insurance company. The insurance company wants to know the probability of a potential claim so it can assess the costs and risks associated with the policy and calculate the premium accordingly. So the problem statement would be: What is the probability that a policy will incur a claim within the first three years of issuing the policy?

Analytics Solution

Similar to what we have seen so far, this is a predictive modeling problem where claims issued within the first three years of an insurance policy will be marked with a predictive variable of 1 and the others with a 0. The data preparation would be at the policy level, meaning one record represents one insurance policy, and the data would include policyholder personal profile, policy type, open date, maturity date, premium amount, payout amount, sales representative, commission percentage, etc. Again, 90% of the data will be used to build the model (also known as training the model) and 10% will be used to test and validate the model. The predictive modeling software will look for common patterns of variables with records that have a 1 in the predicted variable and the same process for records with a 0. Next it will combine the two sets of variables and try to determine the variables with high discriminatory power and come up with a predictive model fully trained.

The banking and insurance industries have been using analytics perhaps the longest. However, even they can benefit from the analytics approach presented in this book that relies on open-source or built-in data mining tools to offset some of their human resources costs and ability to work with a very large number of variables, which humans typically find difficult to manage if they were using conventional mathematical (linear algebra) or statistical techniques (regression). In this case, again the degree of claim probability will determine the premium with higher the probability, the higher the premium. The insurance company can build decision strategies where certain ranges of probabilities have a fixed premium and then more analysis is warranted by an underwriter on probabilities that are lower than the threshold but higher than the refusal cutoff.

Telecommunication

In the telecommunication space, free minutes and interesting plans for families of all sizes is a challenge that clustering can solve. Cellular telecoms try to understand the usage patterns of their customers so appropriate capacity, sales, and marketing can be planned.

Call Usage Patterns

The cellular companies will build clusters of their customers and their usage patterns so appropriate service, pricing, and packaging of plans is worked out and managed accordingly. The problem statement is to build clusters of customers based on their similarities and provide specific variables and their value ranges in each cluster.

Analytics Solution

The grain of the data will be at the customer level, so one record representing one customer. This means that their calls, SMS messages, data utilization, and bill payment all have to be rolled up into performance variables like monthly minutes used, total minutes used, average call duration, count of unique phone numbers dialed, average billing amount, number of calls during daytime, number of calls during nighttime, along with the customer profile, such as age, income, length of relationship, multiple lines indicator, etc.

The breakdown of customers into clusters is important to understand the customer groups so specific plans can be built for these groups based on their usage patterns determined from the values and ranges of data in variables used for clustering.

Higher Education

Higher education is not known for using analytics for efficiencies and innovation. There is limited use of analytics in the admissions departments where ivy-league schools competing for the top talent would build a propensity model to predict whether a student, when offered an admission, will accept and join the school. They would rather offer admissions to students who are certain to prefer their school over other schools, because if the students granted admission end up going to other universities, that usually leaves a space open since students not getting admitted in the first round may have already taken admissions elsewhere.

Admission and Acceptance

What is the probability that upon granting admission to a student, he or she will accept it?

Analytics Solution

This is again a classification or prediction problem where people rejecting the admission offer will be assigned a 1 and those accepting the admission will be assigned a 0. The grain of the data will be the student application and some of the variables can be student aptitude test scores, essay score, interview score, economic background, ethnicity, personal profile, high school, school distance from home, siblings, financial aid requested, faculty applied in, etc. As before, 90% of the data will be used to train the model and 10% will be used to test and validate the model. The predictive modeling software will look for common patterns of variables with records that have a predicted variable of 1 and the same process for records with a 0. Next, it will combine the two sets of variables and try to determine the variables with high discriminatory power and come up with a predictive model fully trained.

The value of analytics here is that the best available talent interested in the school is actually given the opportunity to pursue their education at the school. A simple cutoff can be built to see what range of probability to accept will become the cutoff point for offering admissions.

Manufacturing

In the manufacturing space, forecasting and decision optimization methods are widely used to manage the supply chain. Forecasting is used on the demand side of the product and decision optimization on maximizing the value from the entire supply-chain execution. The use of analytics in manufacturing is destined for hypergrowth (Markillie, 2012) because of several driving forces, such as:

■ Three-dimensional printing that is revolutionizing engineering designs.

■ Global supply chains with a large choice of suppliers and materials.

■ Volatility in commodities and raw material pricing. Procurement and storage decisions are increasingly complex.

■ Customization in manufacturing product specifications for better customer experiences. Customization requires adjustments to design, materials, engineering, and manufacturing on a short notice.

While manufacturing will create newer use of analytics to manage this third industrial revolution (Markillie, 2012) within purchasing, pricing, commodity, and raw material trading, we will use a well-defined problem to demonstrate the application of analytics using data mining. The manufacturing process contains engineering where the product is designed; production where the product is built; sales and distribution; and then after-sales service where the customer interaction takes place for product support. Within the customer support function there is warranty and claims management. It is important for a manufacturing organization to understand warranty claims down to the very specific details of the product, its design, its production, its raw material, and other parts and assemblies from suppliers. There is a two-part problem here: one is to predict which product sales will result in a claim, and the other is to understand the common patterns of claims so remedial actions in engineering design and supply-chain operation can be implemented.

Predicting Warranty Claims

What is the probability that the next product off the assembly line would lead to a warranty claim?

Analytics Solution

The solution would require both a predictive model and a clustering model. The predictive model will use 90% of the data set to train the model and then 10% to test it. It will assign a 1 in the predictive variable to records where a warranty claim was paid, and a 0 to all other products sold. The grain for the predictive model is actually quite challenging because it will need product specification, the production schedule, employees who worked the production line, and customer details for those who bought it. The grain would be at the product level but will have a complex data structure. Another challenge here is that for products where a warranty claim was paid, usually there is good detailed data about the defect and about the customer, but where a customer never called for service, the model may not have the pertinent detail on the customer. It is important to have similar data sets available for any kind of modeling, and both the 1 and 0 records should have identical variables and more than 50% should not be blank or null. Variables may have to be left out if they are available for 1 records but not for 0 records.

The predictive modeling software will look for common patterns of variables with records that have a 1 in the predicted variable and the same process for records with a 0 in the data. Next, it will combine the two sets of variables and try to determine the variables with high discriminatory power and come up with a predictive model fully trained.

Once the model is ready, new products being rolled off the line would get a probability of a claim. The products with a higher probability may need to be moved to a red-tag area and investigated further. Value from such a predictive model is not easily achieved, because on an assembly line, thousands of products roll out that are identical in every respect, therefore, a predictive model may not find variables that are different or that have good discriminatory power. This is where art and science come into play, and the list of variables has to be extended by building performance variables or other derived variables using components’ manufacturers, batch or lot of the raw material, sourcing methods, etc., to find variables able to distinguish products that were filed for a claim and ones that weren’t. Chapter 4 goes in to more detail on performance variables and model development.

Analyzing Warranty Claims

Analyze the warranty claims and build clusters of similar claims to break down the defect problem into manageable chunks.

Analytics Solution

All the claims data will be submitted to clustering software to find similar claims clusters. The grain of the data will be at the claim level, meaning one record per claim, and the variables can include product details, product specifications, customer details, production schedules and line workers’ details, sales channel details, and finally the defect details. The software will try to build 10 clusters (by default) and provide the variables and their values in each cluster. If 10 clusters are not found and only 2 or 3 are, then more variables need to be added. Adding variables, creating variables, and deriving performance variables are covered in Chapter 4.

Depending on the properties of the cluster, it may get assigned to a production-line operations analyst or to the product engineering team to further investigate the similar characteristics of the cluster and try to assess the underlying weakness in the manufacturing or engineering process. This is a classic use of clustering where we are not sure what will be found as a common pattern across the thousands or hundreds of thousands of claims.

Energy and Utilities

Temperature, load, and pricing models are becoming the life blood of power companies in today’s deregulated electric utility and generation market. An electric utility used to have its own power generation, distribution, and billing and customer care capabilities. With the deregulation in the power sector, now power generation is separate from distribution (the utility), and increasingly the customer care and billing is handled by yet another entity called the energy services company. The utility is now only responsible for ensuring the delivery of power to the customer and charging back the energy services company that in turn sends a bill to the customer. The energy services company is also supposed to purchase bulk energy from generation companies. The price of energy uses megawatt-hours (MWh) as a unit and it keeps changing based on market conditions and trading activities. A power generation company wants to get the highest price for each kWh that they are producing, and the energy services company wants the price to be the lowest it can be. Usually, a customer pays a fixed rate to the energy services company, but during peak-load periods, the cost of generation goes up and also the price, but energy services companies have to buy and supply energy to the utility so their customers are not without power. This is presented in Figure 3.1.

image

Figure 3.1 Deregulated energy markets. Source: Used with permission of Nexus Energy Software, copyright ©2002 ENERGYguide.com, all rights reserved.

The New Power Management Challenge

The utility itself is a neutral party in this new environment, but the generation, energy services, and customers all have a significant stake, so analytics will be a critical piece of success in this arena. Turning power off is not an option and that is covered by the regulators. With these two constants (utility and regulators), the other three parties try to tilt the equation in their favor. The power consumption has a consistent behavior in general because factories, commercial buildings, and even households have a very predictable power-consumption pattern. This consistent consumption pattern is disrupted with weather. Extreme heat will force every customer to crank up their air-conditioning and start consuming more power. If the energy services company doesn’t have the contracted power capacity to handle the additional load, they may have to buy from generators in the open market. The generators (or other entities holding excess power units) will charge higher tariffs since they may have to tap into additional generation capacity, such as alternate energy sources (wind, solar, etc.), which are typically higher. If the energy services company anticipated this rise in temperature, they may have locked in lower rates already, but now the generator is picking up the additional cost. The customer may also be getting charged a higher rate for peak times as per their agreement with the energy services firm, generating a nice profit.

Analytics Solution

This is a very complex landscape, and it is evolving because the deregulation is starting to take its hold in the marketplace and these scenarios are becoming well understood. Different players in this space will use different analytics methods to maximize their value from this equation. The first thing is the weather forecasting, which of course uses forecasting techniques to see the trending and shifting weather down to the next hour. Organizations may rely on the National Weather Service but usually they need a more granular reading, for example, down to a city block where they may have a couple of large commercial buildings. Next, they have to correlate the weather (in degrees) to load (in MWh). They have to analyze historical data to see the degree of correlation and anticipate the load. Both the power generation and the energy services companies have to do this.

The generation company has to build a predictive model to assess the probability of firing up their expensive power generation plants. They have to manage their costs, raw material, and staffing shifts in line with the probability. The energy services company has to build predictive models on additional factories, commercial clients, and retail customers to come online at the same time as the peak load hits. They have to price their contracts with the generation companies accordingly. They also have to build clustering models to group their customers and their power consumption in “like” groups so pricing plans can be designed and offered tailored to customer needs. This problem is almost identical to telecommunications pricing with peak and off-peak minute utilization plans. Lastly, looking at the load forecast, decision optimization and pricing algorithms have to be used to maximize the profit for energy services companies. The pricing that needs to be optimized is both on the customer side and on the generation side. The generation company has to use decision optimization to factor in oil and coal prices in the global markets along with the cost of plant operation to see what price point works best for them. Both energy services firms and energy generation firms have to now build a negotiation strategy into their pricing mix.

The new energy sector is a wide open space for analytics and its applications. As we realized that multiple models are at work, the decision strategies are the key component that brings the outputs from multiple models and tries to execute an operational decision in near real-time to bid, price, and buy power contracts. Sometimes the excess contracts may have to be sold as power is a commodity that cannot be stored.

Fraud Detection

Fraud comes in wide varieties and we will illustrate two examples that specifically use classification (prediction) and clustering techniques to detect fraud. One example is from the public sector where benefits are provided to citizens and some citizens abuse those benefits. The other example is from banking with credit card fraud.

Benefits Fraud

Governments provide all sorts of benefits to its citizens, from food stamps to subsidized housing to medical insurance and childcare benefits. A benefits program typically has an application process where citizens apply for the benefits, then an eligibility screening where the benefits agency reviews the application and determines eligibility for benefits. Once the eligibility is established the payment process kicks in. The fraud occurs at two levels: (1) the eligibility is genuine but the payment transactions are fraudulent, such as someone having two children but receiving payments for four, or (2) the citizen is not eligible because of higher income level and is fraudulently misrepresenting his or her income. Let’s take the problem statement where fraudulent applicants are identified and that information is used to build a predictive model. So what is the probability that this application is fraudulent?

Analytics Solution

This problem is very similar to the loan application and detecting probability of default. This is a benefits eligibility application and we are trying to find the probability of fraud. The solution would work with all known or investigated fraudulent applications assigned a 1 in the predictive variable and a 0 otherwise. Again, 90% of the data will be used to train the predictive model and 10% of the data to validate the model. The grain of the data would be at the application level, so one record is one application. The predictive modeling software will look for common patterns of variables with records that have a predicted variable of 1 and the same process for records with a 0. Next, it will combine the two sets of variables and try to determine the variables with high discriminatory power and come up with a predictive model fully trained.

As the applications come in, they are run through the model and the model assigns a probability of the application being fraudulent. A decision strategy then determines what to do with that probability, whether it is in the acceptable threshold, in the rejection threshold, or if it needs to be referred to a case manager.

Credit Card Fraud

A lot of us may have experienced a declined credit card transaction where the merchant informs the customer that he or she needs to call the credit card company. This happens because the credit card company maintains profiles of all customers and their spending behavior, and whenever they try to do something that violates the expected behavior, the system declines the transaction. This type of fraud detection is also used in anti-money laundering approaches where typical transactional behavior (debits and credits in a bank account) is established and whenever someone violates that behavior, such as depositing an unusually large amount, an alert is generated. The problem statement is to build a “typical” behavior of the credit card customer.

Analytics Solution

This is a clustering problem and the clustering software builds clusters of use for a group of customers based on their behavior. The grain of the data is at the customer level, but all the purchases on the card are aggregated and stored with the customer record. Variables like average spending on gas, shopping, dining out, and travel; full payment or minimum payment; percentage of limit utilized; geography of spending; etc. are built and assigned to the customer’s profile. Science and art come into play, as you can be very creative with building interesting performance variables like gender-based purchases, shopping for kids’ items, typical driving distances between purchases, etc.

The clustering software will build clusters and assign each customer to its “like” cluster based on the patterns in the data. Once a cluster is identified and assigned to a customer, its variables and their values are established as thresholds in the credit card transaction system. Whenever a transaction comes in breaking a threshold by more than a certain percent, a fraud alert is generated, the transaction is declined, and the account is flagged preventing subsequent use.

The value of analytics comes from preventing fraud from happening, protecting customers, and minimizing credit card fraud losses. This type of a solution cannot function without the integration with the live transaction system and the clusters offer insight into the customer’s behavior that can be exploited to offer additional products.

Patterns of Problems

After going through the preceding examples, readers should start thinking about problems they may be able to attack within their area of responsibility using built-in or open-source data mining tools that allow building solutions economically and efficiently. Each of the problems presented in this chapter is far more complex, and commercial software built by specialized companies may provide a lot more sophistication to solving these problems. Generic models for specific problems are also available from vendors and do not require a training data set, so smaller or newer firms can make use of them. As these smaller firms mature and have significant data volumes along with enough variety in their data sets, a bespoke model can be trained.

The easiest pattern that can be derived and understood from the preceding examples is the prediction problem. Breaking down the problem into a 1 or 0 allows for predictive modeling software to work very effectively. If any business problem or opportunity can be broken into a 1 or 0 problem (i.e., event occurred or event didn’t occur) and data for both sets is available, predictive modeling works really well and is easier to demonstrate business value. Another pattern is tied to the future event’s occurrence—that is, if advance knowledge of an event can significantly improve the business’ ability to exploit the event, then analytics should be a candidate for solving that problem. However, any problem identified in the business should be worked out all the way to decision strategies, as merely building a model is not going to be enough if it is unaccompanied by innovative ideas on changing the business by adopting the model’s output.

How Much Data

The size of data available to perform this type of model building is a challenge. How much data is enough? While there is no harm in having more data—actually the more the better—having insufficient data is a problem. The data insufficiency is in two parts: not having enough records, and not having enough values in variables. For the latter, 50% is a good benchmark—that is, at least 50% of the variables should have a non-null value. But the number of records is a tricky problem. For a rule of thumb, with no scientific basis, for a pilot program on customer propensity predictive modeling, at least 10–15,000 customers’ data should be available with at least 12–15% being 1, meaning the event (customer bought in response to a promotion) that we are trying to predict. In case of web traffic data, even tens of millions may not be enough. The data should have enough spread and representation from various population samples and the data set shouldn’t be skewed in favor of a particular segment.

Performance or Derived Variables

The number of variables also matters in the quality of a model’s performance. Derived variables can be as simple as converting continuous data into discrete data, or deriving a more sophisticated aggregation to come up with new variables. This is covered in more detail in Chapter 4.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset