Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 9

Analytics Implementation Methodology

Contents

This chapter provides an overview of the analytics project methodology that needs to be followed to enable the successful implementation of an analytics project. Consistent with the theme throughout this book, analytics projects are not a one-time activity, rather a new business lifestyle, and as more and more processes get innovative and customized to the customer, product, and operational needs, analytics-driven impetus into those processes will be a key requirement. Therefore, analytics projects have to be built and managed with a structured and methodical approach that is reliable and repeatable. The reliability is far more important for analytics since daily business decisions are made relying on analytics output.

Analytics projects, analytics technology, people, and the problem space are very different from traditional IT projects. The main differentiating factor is the definition of the problem. In traditional IT development or even within data warehousing, the problem space is defined by existing processes and existing analysis, metrics, and reports. That provides enough direction for solutions and newer technology applications to address existing issues and accommodate innovative requirements needed to improve processes. For example, new smartphone- and tablet-based applications that open up a field staff to get things done while on the road is a new business process needed within the same problem space.

In traditional development the problem statement either comes from an existing issue, from business innovation, or from IT where they demonstrate the possibilities of newer technology. If analytics solutions worked the same way—that is, that it’s just a new technology for an existing problem space—the following would be some possible scenarios to initiate an analytics project:

■ The business can come forward with a specific need of an analytics solution to their existing obstacle. This would be a traditional way of IT and business engagement. But this is rare apart from some well-established problem domains like direct marketing, customer profiling, risk management, supply-chain optimization, and financial trading.

■ The analytics team, learning from a wide variety of problems presented in Chapter 3 or similar case studies and other industry publications, reaches out to business with ideas and tries to form a problem statement.

■ The IT analytics and business subject matter experts play with the existing data and try to find a problem that can be solved using analytics methods.

All three are legitimate ways of initiating analytics projects, but an organization’s approach for democratization of analytics will heavily influence how these scenarios play out.

Centralized versus Decentralized

The first issue that needs to be tackled head-on is whether the analytics approach for any organization is centralized (top-down) or decentralized (bottom-up).

Centralized Approach

In a centralized strategy, all the analytics assets, such as the software, hardware, skilled resources, and other information assets like metadata, business relationships, and the project management capabilities, are all under one organization. This approach has a lot of benefits of knowledge sharing, consistency of implementation, consistency of support, visibility and clear lines of communication, and, above all that, reduced cost as hardware and software assets are shared across various projects. These benefits may be tempting and make a lot of sense, but then getting to this centralized state throws an almost insurmountable set of challenges related to business case and budget approvals.

The premise is that the hardware and software should be procured first; then the team hired and trained; and then the engagement, project management, and development methodology built so the overall architectural framework is in place before real projects start and deliver value. It is very difficult to find sponsoring executives willing to fight for these kinds of approvals and budgets. And it would be very tricky for an IT executive to have internal resources to pull this off on his or her own. So the centralized approach will remain stuck in business case and budgetary approval processes.

In addition, the lack of business insight and specialization in particular business areas that can most benefit from analytics can also become a problem with a centralized approach. Pitching various analytics-driven improvement ideas to business executives requires in-depth understanding of the underlying processes and data, and it is not possible for a centralized team to build that capability across the board. Therefore, pretty much the only available option is when business comes with a defined problem statement, the analytics team can deliver value. That limits the usefulness of a centralized team and investment.

Decentralized Approach

A decentralized approach would imply that each department or business unit is fully aware of their needs and, understanding how analytics would help, buys software, procures the hardware, and hires a team to build their solution. The team is owned by the business unit or department and is dedicated to building and improving the analytics-based decisions for the department. Consulting firms and specialists in that business domain may also get hired. This is the most common approach for analytics and marketing departments, risk management departments, and other specialized areas within various industries like manufacturing, healthcare, banking, and insurance that use this approach.

The main benefit of this approach is specialization in a business domain and adoption of industry-leading practices for well-defined problems. But all along the intent of this book has been to show how an entire organization can benefit from analytics, not just one or two specialized areas. This decentralized approach cannot be applied to the wider organization because that would imply each department has their own hardware, software, and specialized teams. They may benefit from a specialized team and software, but in untested problem areas and a new innovative use of analytics, this approach will not work because of lack specialists in the market and problem-specific software. Not to mention the cost overheads of every team managing their own technology and resources, as they may not be able to fully utilize the capability acquired.

A Hybrid Approach

It should be obvious that a hybrid approach is needed for democratization of analytics, and therefore the methodology being proposed is based on that conclusion. The strength of a centralized technical team, a single enterprise instance of analytics software deployment, and shared hardware infrastructure make a strong case for following a centralized approach. However, then business domain–specific analytics business analysts and analytics specialists closely aligned with the respective business units and departments are needed. A matrix approach with dotted-line reporting for these two roles (analytics analyst and analytics specialist) will allow the business units to understand the capabilities and possibilities, while the dual-reporting roles will work closely with the implementation teams to deliver solutions.

Building on the Data Warehouse

The centralized parts of the hybrid approach dealing with technology and data processing skills are available within the data warehouse teams. Looking back to the Information Continuum, a robust data warehouse infrastructure (including its hardware, software, and skilled teams) is a prerequisite before an organization jumps into this hybrid approach for analytics. A data warehouse has the following working pieces already in place:

■ Knowledge of source systems, business processes, and data.

■ Integration with various source systems.

■ An integrated data structure with all the data loaded in one place.

■ Business-specific datamarts where relevant data for each business unit or department is available.

■ Dashboards, metrics, KPIs, and historical perspective of important data being used for decision making per business function.

■ Hundreds of reports that have been designed, developed, tested, and run on a regular basis, so there is an overall good idea on what data is important.

■ Alignment with power users across business functions who are the champions and promoters of increased and innovative information use for running the business.

■ The reports are consumed by all parts of the organization, so there is a clear idea by department who is consuming what.

Figure 9.1 shows the capabilities of a good data warehouse program.

Figure 9.1 DW capabilities.

If we look at the analytics projects’ data requirements, most of the information that they need should be already available with the data warehouse team. The project team does not need to reinvent the wheel; they just align with the data warehouse group. Additionally, the data warehouse team has tremendous capability in moving data around, applying data cleansing techniques, aggregating and summarizing data, managing histories, etc., and have robust and scalable infrastructure in terms of CPU and storage. They may also have data privacy and security controls in place across their data supply chain. Analytics needs all of that; Figure 9.2 is a snapshot of the overlap between data warehouse capabilities (see Figure 9.1) and the capability needed for an analytics program.

Figure 9.2 Analytics project needs.

Methodology

The analytics implementation methodology, therefore, relies on the data warehouse infrastructure, processes, and technology, and introduces the advanced layer of tools and human skills in analytics modeling and decision strategy as per the Information Continuum. The methodology presented in Figure 9.3 follows the traditional waterfall approach, but then it contrasts tremendously with traditional software development. Individual situations may demand adjusting this methodology for specific organizational needs or for specific projects.

Figure 9.3 Analytics project life cycle.

Requirements

The entire Chapter 8 is devoted to requirements gathering, emphasizing the importance of requirements in an analytics project. The four parts of the requirements gathering or extraction process result in a requirements set that includes the following:

■ Problem statement and goal

■ Data requirements

■ Model and decision strategy requirements

■ Operational integration requirements

These categories make up the requirements section of the methodology, and since the detail has been covered in Chapter 8, we will not further elaborate here.

Analysis

The purpose of analysis is to describe the requirements in greater detail with the context of existing business processes and data utilization, and include all the moving parts and components needed for delivering on the requirements. When gaps are identified and clarifications are needed, the business community is engaged and clarification is sought. The system, process, and requirements analysis allows for a holistic understanding of the overall problem and its potential solution, so the boundaries for design and development can be set. The analysis is also responsible for identifying the easier and challenging parts of the project, so appropriate staffing or consultative help can be sought and the feedback on general timelines can also be established once the analysis is concluded.

The following categories of analysis need further elaboration:

■ Problem statement and goal—analysis

■ Profiling and data—analysis

■ Model and decision strategy—analysis

■ Operational integration—analysis

■ Audit and control—analysis

Problem Statement and Goal—Analysis

This is the most important part of the analysis stage of the methodology, as to if what the business is trying to achieve is even possible knowing the available technology and skills. The analyst should have a solid handle on the various analytics techniques that will be used to achieve the stated goal. A review with the technology of that goal is also critical because sufficient historical data may not be available, the analytics method to be used may not be present in the technology stack or skills, and resources may have a gap. The analytics analyst, therefore, has to validate, clarify, and confirm the following.

Can the problem statement and goal in fact be solved by the available analytics methods? Are the methods available in the technology suite that the analytics team has and is there enough experience available within the team to undertake the analytics project? At times, incorrect expectations or misinformation result in broad assumptions about the capabilities of the analytics technology.

Profiling and Data—Analysis

Data requirements are also covered in detail in Chapter 8. Through business processes, the data that gets generated may already be available in the data warehouse. In the data analysis stage, data profiling of that data will be carried out both in the operational (source) system and in the data warehouse system. There are two categories of data profiling activity: syntactic and semantic.

Syntactic Profiling

This type of profiling deals with one field at a time and profiles are established on each field. While it can be carried out on all fields in the scope, that may be overwhelming. Therefore, this profiling should be limited to data fields of interest. The data fields of interest are those that the business routinely deals with in reports, KPIs, and other performance metrics. Some ideas on spotting these can be low cardinality fields like types, statuses, and codes, or cities, countries, and currencies. Low cardinality means that the total number of possible values in that field is relatively small compared with its occurrences in records. For example, we may have 100,000 customer records but the field named “Gender” may only have three to four possible values (Male, Female, Unknown, and Null), therefore, there are many repeats for the 100,000 records. High cardinality values are usually bad for profiling unless they are numeric and can be aggregated to represent averages, minimum and maximum like volumes, amounts, counts, etc. The sales amount per sales transaction is a good field for profiling, but a VIN number is bad along with various types of account numbers and IDs.

It is important to understand that profiling is not being done for data-quality or analytical reasons. It is being done to develop better understanding of the data. The quality component should already be covered within the data warehouse program when the data was acquired. This understanding will come in handy when candidate variables are being identified for analytics model development and used in decision strategy designs. The information gathered as part of syntactic data profiling is presented in the following list; not every profiling attribute is collected for every type of field. The profiling can also be done on a sample of the total records and not necessarily on multibillion row tables. Most database systems, reporting and analytical systems, and data mining systems provide some kind of automated capability of data profiling. Advanced profiling tools can also be acquired for this purpose or custom built for Big Data challenges.

1. Minimum. The minimum possible value in the data set for that field. This is not applicable to character type (string) fields but numbers and dates should have this.

2. Maximum. The maximum possible value in the data set for that field. This is not applicable to character type (string) fields but numbers and dates should have this.

3. Mean. Similar to minimum and maximum, the mean is the average and will only apply to numeric values.

4. Median. The value that lies in the middle of a list of possible values. The values have to be sorted and counted, and then the median is value at the center of that list.

5. Mode. The highest repeating value in a list of possible values.

6. Standard deviation. A good measure of how aligned the overall values are to the mean. Higher standard deviation means the values are spread out and the field has high dispersion, while small standard deviation means that the values are closer to the mean.

7. Frequency. Frequency is the count of number of occurrences of the same value. So for a field called Ticket Status, if there are a 100 records and 40 are booked, while 55 have not been purchased yet and 5 have been purchased and cancelled, the frequency of the field Ticket Status would look like the following:

Value	Frequency
Booked	40
Not Sold	55
Cancelled	5
Unknown	0

The value for Booked has the highest frequency. For high cardinality fields, the top five values are sufficient for frequency calculation. In this example, we had 100 records and all of them are accounted for since the frequency sums up to 100. If we were looking for the top two frequencies, we would get Booked and Not Sold and that would’ve represented 95% of the data.

Frequency is not limited to any particular data type. Also included are:

1. Distinct values count. The count of distinct (unique) values stored in a field. In the preceding example, the distinct values count for the field Ticket Status is 4. This is equally applicable to numeric, date, and string fields.

2. Maximum length. This only applies to character (string) fields and it is the size of the values in the field, not necessarily of the field itself. A field may have a data type of character (20), but upon reviewing the data it turns out that no value is over 10 characters long. The minimum length is typically not a useful profiling attribute.

3. Null. This covers how many values are actually null, meaning the value is not available. Null counts are also applicable to all data types.

4. Zero. Zeros apply to numeric fields only and counts how many times zero appears as a value.

5. Time series. This is a yes/no profiling attribute and indicates if a field has some time-series value. Sales Amount is a meaningful time-series field since as it can be trended over time. Currency Code is not a meaningful time-series field.

The preceding information should be recorded in a formal document that will be used in design input. In addition to profiling attributes for each field, a detailed definition should also be captured that should cover:

■ The business definition of the field.

■ How the field is created as part of the business process.

■ How the field is modified/deleted as part of the business process.

■ How the field is used in the Information Continuum.

Semantic Profiling

The semantic profiling of fields tries to establish their interdependencies and correlations in a business context. For any data set that has some kind of hierarchy, such as Customer → Account → Order → Transactions, the customer has accounts and those accounts are used to place orders that are then paid for by financial transactions. This type of hierarchy with their counts in the source or data warehouse system should be tracked. For example, there are a total of 1 million customers in the source system; a total of 3 million accounts (meaning an average of three accounts per customer) have placed 12 million orders in the last three years with a total sales amount of 120 million. This provides interesting correlations within the data and a relationship is built within these data entities. Then, if one of them is forecasted through analytics, the others can be easily estimated.

The organization structure of a corporation is another example of hierarchy, where the corporate legal entity sits at the top, then business units and departments underneath, then functional teams, and eventually the workforce. In this type of hierarchy, just the relationships are important as to which teams are part of which business units, and not necessarily volumes. So there can be a volume-based hierarchy or a structure-based hierarchy; both establish dependencies between fields, which is part of semantic profiling. A brief about historical data availability is also a good profiling attribute that can come in handy particularly when building forecasting models.

Model and Decision Strategy—Analysis

During the problem statement and goal analysis, we covered the analysis of the analytics method and its applicability to the problem. The model and decision strategy requirements are analyzed in this stage in light of the data analysis covered in the previous section. Every single field candidate to be used in the model as a general thought process has to be reviewed against its data profile. Fields that have skewed values, such as 90% of the field Gender, actually contain the same value (Male) and therefore it is not a good candidate for model development. Although it may be a field extensively used in decision making within the business community today, now the analysis reveals that it cannot be used in the analytics model.

Similarly, the decision strategies may be relying on decision variables (see Chapter 5) to break down the decisions and spread the workload across a larger workforce, and if the decision variable (the field on which segmentation is performed) is skewed, then most of the workload may go to the same queue. Let’s use an example of a loan collection system. A predictive model can assign a probability of “collect-ability” to all accounts in collection. Business decides that everybody over 70% collect-ability will be assigned to the internal collection staff, while the lower collect-ability accounts will be handed over to a third-party collections agency. If the collection department had 20 analysts, the cases with over 70% collect-ability now need to be further segmented on additional variables. Usually this input comes from the business, but after the data profiling, that has to be analyzed again to see how the decision variables are spreading the data.

Nulls and zeros are also important, as they have limited use in model and strategy decisions and have to be taken back to the business to explain and seek alternatives.

Operational Integration—Analysis

The analysis around operational integration is very similar to a traditional system integration analysis where systems, their interfaces, their technologies, and their timing are analyzed, understood, and documented. Typical requests from business are always for real-time availability of information in other systems, but the real-time integration has some cost associated with it and, therefore, the business has to be probed on why real-time integration is necessary. In the case of the preceding collections example, a nightly integration from the analytics system into the collections workflow and queuing system may be alright since workers come in the next morning and their work queues are preloaded.

For decision strategy integration into the operational system, the following key points have to be analyzed and reviewed with IT and business:

■ What is the event that would trigger the analytics module and the decision strategy to come into action? A new insurance policy request, a new phone connection, a missed payment, an insurance claim submission, a new order booked, etc. are all business process events that can trigger an analytics module and a decision strategy.

■ What is the data that will be passed to the analytics module? How complete is the data going to be at that time? It may be that waiting a day on the data allows for more complete information to be available in the system, for example, a field staff enters data manually toward the end of their trip.

■ Is a final decision being passed back after the strategy or just the analytics output? In case of FICO scores for consumer lending, typically a score is received from the credit bureau (e.g., Experian) and then the decision strategy is built into the operational system on that score.

■ Decisions received or computed by the operational system have to be stored somewhere; the analysis will highlight this aspect of the analytics solution. The storage of the results can be within the operational system or outside it, but the analysis has to document the details for designers.

Audit and Control—Analysis

This brings us to the last aspect of analysis—audit ability and control. The business would require some ability to audit the historical decisions recommended by the analytics solution. How exactly will that data be recorded, where would it be kept, and how would the auditors get to it? Does it fulfill internal control requirements or industry-specific requirements as well as that of regulators? Who exactly updates the audit information and how is it made temper-proof? Since the analytics-based decisions may need to be reviewed from an efficiency perspective (how effective is the analytics module) as well as from an audit perspective, should this data be stored in one place or two places? This is the line of questioning the analyst has to follow. It may be tough to get the answers to these questions depending on how mature an organization is toward analytics for their day-to-day decisions. The analyst, therefore, will have to educate and build consensus across departments about how to audit the automated decisions taken on the output of analytics models.

Design

Similar to the analysis phase, all the various categories that were analyzed now need to be designed. The design of an analytics solution has six components:

1. Data warehouse extension

2. Analytics variables

3. Analytics datamart

4. Decision strategy

5. Operational integration

6. Analytics governance—audit and control

Data Warehouse Extension

Based on the data requirements and the analysis of the data fields, one of the gaps is data availability in the data warehouse that needs to be designed. The design principles for this additional set of data from various source systems should be done exactly according to the process and methodology in place for data warehousing. They bring in thousands of fields from various source systems and have the technology and skill to do that for additional data of interest for analytics. It will add to their existing data structure as more data is added, and the analytics project team will not have to go through the learning curve of accessing source systems, staging the extract data, nightly scheduling of batch jobs, ETL tools and technology, and the database infrastructure. The data warehouse extension design will have a data modeling piece and an ETL piece that will be responsible for bringing the data in and keeping it up-to-date on a scheduled basis as part of the overall data warehouse maintenance process.

Analytics Variables

The variables are where the art and science come together. What creative variables can you invent that may have a high analytic value? There is a difference between variables and fields. In data warehouse systems, when an analyst goes to a source system to look for data that is needed by the business for reporting and analysis, two best-practice rules are followed:

■ The first rule deals with trying to get the most detailed level or atomic data possible. If the business has asked for daily sales data per customer per store, the source system usually has the line-item details of what the customer bought and how much it was. The most granular data would be at the line-item level in the basket of the customer, and that is recommended to be sourced and pulled into the data warehouse.

■ The second rule deals with a principle called triage when sourcing data. The data that is needed by the data warehouse driven from business requirements is priority 1. Then there are operationally important fields like CHNG_UID, which is a field that captures the user ID of the user who last changed a record; these are priority 3 fields. Everything in between is priority 2. The best practice is to pick up everything with priority 1 and 2 while you are in there building an extraction process. It may come in handy later.

These rules are why the data warehouse is supposed to have more fields than seem necessary for the analytics use. Going back later and getting the additional fields is far more difficult than keeping the additional data around just in case. The analytics project can actually get lost in this large river of fields and not know which ones would be valuable. Following are the four kinds of variables to help sort through the list of fields, and also an explanation on how to use these variables because their treatment changes through the project life cycle.

Base Variables

Base variables are the important fields present in the data warehouse within the analytics project’s scope. If the project’s scope is to build a sales forecasting model, then the business context is sales, and therefore all the fields in the data warehouse within or linked to sales are potentially base variables. If the sales department’s users access certain dimensions and facts for their normal business activities, then everything in those dimensions and facts is potentially a base variable. Examples are store types, customer age, product weight, price, cash register type, employee code, etc.

Performance Variables

Performance variables are specialized variables created for analytics models. If the project’s problem statement is a predictive model that calculates the probability that a customer will use a coupon to buy something (propensity modeling) or the probability that an order will ship on time, then it would need the base variables (formatted and transformed) as well as some new and interesting innovative variables that may have a high predictability for the problem statement. The base variables are raw data and usually have continuous values like age of a customer. The performance variables are preferred to be coded variables. So if the customer records have age as follows, 28, 33, 21, 67, 76, 45, 55, 68, 23, etc., then the coded values would replace the age variables as Code 1 (referring to ages less than 21), Code 2 (referring to age ranges from 21 to 38), and so on, and the last one would be Code n (age greater than 100). This way the age distribution, frequency, and its role in the predictive model can be analyzed and tuned.

Other performance variables could be as follows:

■ Total sales of grocery items

■ Total number of year-to-date transactions

■ Percentage when credit card is used for payments

■ Online user (yes/no)

■ At least one purchase of >$100 (yes/no)

These performance variables are designed looking at the problem statement. There is no well-defined method to what should be a performance variable. In established industries like customer analytics (within marketing) and consumer credit (risk management), the analytics experts know what performance variables are going to be important, but in other industries like education, shipping and logistics, state and local governments, etc., the performance variables have to be worked out through trial and error over time. Some may become useful and some may not have any predictive value. Chapter 11 deals with this in greater detail. However, they have to be designed so they can be implemented and therefore they are covered in the design stage.

Model Characteristics

The variables (base plus performance) that end up being used in the actual model get a promotion and they are labeled characteristics, as they will get weights, scores, and probabilities assigned to them in the model. While the design may not know exactly which variables are going to become characteristics, there should be a provision to have some characteristics stored as input and then the model’s output also stored with the characteristics. This is important for tuning and tracking the results of the model. The design has to accommodate characteristics (their creation from transactional data and the analytic output predicted, forecasted, optimized, or clustered).

Decision Variables

The decision variables were covered in detail in Chapter 5. These variables are used in decision strategies once the model output has been assigned to a record. So in case of manufacturing and prediction of warranty claims, if the model output assigns a 63% chance of a warranty claim against a product just being shipped, there may be 10,000 products with 63% or higher, so not all can be kept from shipping and not all can be inspected. Therefore, additional filtering is needed to separate the 10,000 products into smaller more meaningful chunks against which actionable strategies can be carried out. For example, if the product is part of a larger shipment, then let it go as long as the prediction is greater than 63% but not greater than 83%. This additional filtering is called segmentation in a decision strategy and the variables that are used to apply this are called decision variables. The importance of decision variables is establishing their thresholds for actions. In this example we have stated that if the product is part of a larger shipment, what does “larger” really mean—100, 500, or 1,200? These thresholds cannot be set arbitrarily and there is some structure to setting these segmentation values.

Analytics Datamart

The analytics datamart is a specialized database designed to handle all types of variables, their histories, their revisions (in business rules and calculations), and their structural dependencies (parent–child). The input into the model, test data, training data, output and validation results, and the model generation are all maintained in this datamart. The label “datamart” should not be confused with a dimensional star schema type mart (Kimball, 2002); datamart is used to refer to a specialized subject-area focused collection of relevant data. Since it hangs off the data warehouse, it has been labeled as a datamart. Its structure is very different from that of a traditional datamart and more innovative and out-of-the-box type designs may be necessary to meet its functional purpose. An experienced data modeler should be able to fully understand the purpose of the data in this collection and model to meet the requirements.

Decision Strategy

The decision strategy design is covered in detail in Chapter 5, therefore only a brief guide into its design will be covered here. If the requirements layout of decision strategy is detailed enough and all metrics, decision variables, their thresholds, filtering conditions, and actions against strategies are clearly articulated, then there is really not much to design. If a strategy design tool is available, then even the implementation part is simplified. The design step has to ensure that the business rules are defined quantitatively, so statements like “if the price is too high…” are in requirements, the design step will force the business to pick a number as to how much is “too high.” If the business doesn’t know, then the data profiling carried out during the data analysis will have to be shared so they can determine a quantitative equivalent of “too high” (e.g., greater than 4,000).

Operational Integration

The design of the operational integration has three pieces:

1. Strategy firing event

2. Input data

3. Output decision

Strategy Firing Event

This is the event that will cause the strategy to be fired, meaning executed on a data set. There are two types of strategy events:

1. Business process event. These are real-world business process events that can fire a strategy. An example is a loan application. An online loan application is submitted from a bank’s website and the data is passed to the analytics model for default prediction. Based on the probability of default, the strategy is fired to see what kind of pricing, fees, and other terms can be offered if the case is approved at all. There is no controlling when such an event occurs and strategy execution is therefore in real time, driven from the loan application event generated by the customer who visited the bank’s website.

2. Scheduled events. These are batch schedules where large amount of transactions are passed through the analytics model and the decision strategy and all of them are assigned an action and activity (the decision). An example of this would be a business scenario where transactions come in throughout the day but the processing starts with work assignments the next day on that data. Take college applications where the college staff wants to focus on applications with a higher probability of acceptance. Applications pour in the last day of the deadline and a nightly schedule assigns all of them with a probability; the decision strategy breaks the applications into buckets, which are reviewed the next morning by the staff. The same goes for insurance claims where a fraud probability is assigned to incoming cases and senior claims processors deal with higher fraud probability claims the next morning after a decision strategy decides how to assign the case workload.

Both of these scenarios need to be understood and the integration is designed accordingly with the operational system.

Input Data

The input data is the definition, format, and layout of the data that is passed to the analytics solution (model and decision strategy) by the operational system where the event-based or schedule-based transactional data resides. Careful attention has to be paid while mapping the variables in the model and decision strategy back to the data in the operational system. There will certainly be need for transformation of the data coming from the operational system, and it is up to the designer to either perform the data preparation (from fields to input variables and characteristics) in the analytics solution or in the operational system before data is sent. The data warehouse processes should be tightly aligned and leveraged as much as possible instead of reinventing the wheel.

Output Decision

An output decision is an actionable code that translates into some activity in the operational system. What to do with the output has to be coded in the operational system and, therefore, operational system engineers should be part of this design discussion. When the analytics solution sends an “approved” code back for an insurance claim or a loan application, what should be done within the operational system on that information is a design question requiring close business process input as well as operational system changes. Without this design piece, the decisions may get assigned but never actually carried out.

Analytics Governance—Audit and Control

The audit and control design regards what information is to be stored related to what data was sent in, what probabilities or other analytics output (cluster, forecast values, etc.) were assigned, what decisions were determined to be applicable, and how the decisions were carried out. This is not a simple design, and audit framework has to be understood before attempting to build a transparent analytics solution. Chapter 6 covers this in detail and, therefore, it will be briefly covered here. For the most part, once the model and strategy is in production, there is limited manual intervention and manual oversight of the transactions passing through the solution. An audit is therefore the only function where things can be tracked, monitored, and thresholds set up so automated decisions can be stopped or investigated. This same type of information is also needed for strategy and model tuning, and therefore the design for audit and control should accommodate both sets of requirements as far as storing the information is concerned, even though its use may be different for different sets of users. Even if the data is kept separately, its structure and processing should be done consistently within the same design.

Implementation

Implementation deals with the software program’s development that will be processing data. The bulk of the implementation is actually ETL, meaning taking data from one place, manipulating it, and then loading it in another place. ETL programs are data-centric and do not involve development of any workflows or screens and user interactions. They are usually SQL and other data cleansing, formatting, and aggregation programs written in a wide variety of supported languages that are implemented and operated like a black-box. ETL implementation includes sourcing data from source systems or from data warehouses all the way through to the analytics datamart, as well as the implementation of audit and control data recording, pulling from the integration inputs, outputs, and decision strategy segmentation.

The implementation of an analytics datamart should be like any other datamarts within the data warehouse, and no specialized tools or technologies should be needed. Therefore, the first two pieces of implementation are the analytics model and the decision strategy; the third one deals with governance. The analytics model is implemented in a data mining tool or in data mining libraries available within the database engine. Specific knowledge of that tool is essential for the implementation and specialized training is required. Ordinary software development resources will find these analytics programs very difficult to understand and manage. The terminology, concepts, and the model development steps, its testing, validation, and performance evaluation are all very specialized areas, and people are required to have formal training in data mining implementations through undergraduate or graduate classes in advance programming and artificial intelligence courses. Advanced knowledge of statistics can also be very useful but not required.

The implementation of a decision strategy has several options. It can be implemented in traditional programming environments like Java and DotNet; it can be implemented using specialized strategy management tools available from vendors like FICO and Experian; or with business process management (BPM) tools. Even ETL tools are capable of implementing decision strategies at a basic level. The second implementation piece is the integration of the analytics solution and the operational system, and it will be carried out using any supported environments already in place, such as a messaging or service bus integrating the operational systems. The last implementation piece is that of audit and control data access. The audit and control data is loaded using ETL, but then some reporting and analysis is required to set up thresholds. This can be done either in a database development environment or with specialized components available in OLAP tools already in place with the data warehouse, such as a dashboard.

Deployment

As part of the implementation, the analytics model is validated and business starts to believe in the model’s ability to help make business decisions. At this time the model is ready for deployment, so transactions can be passed to the model and strategies can be executed on the model output. This integration is part of the deployment phase. The model implementation is tackled separately from the decision strategy implementation because of the difference in technology and toolsets used for each. No data mining tool or library has built-in capabilities for running a strategy because of the difference in their technological nature.

Once the two are integrated, the combined effect has to be tested and validated on actual data. Since a strategy-based business process activity is usually a new paradigm for the business, it takes a while and a lot of trial and error before strategies perform at a level where business sees value in their output. This step can be quite complicated for larger user base deployments—that is, where a larger number of users are awaiting their work queues to be adjusted based on analytics model output. Areas like price optimization may have an easier time since not many people need to run various pricing models and evaluate output. Once the integrated deployment of the analytics model and strategy is tested, validated, and approved, the operational integration is hooked in for complete end-to-end integration testing of the analytics solution, including the audit and control data recording and review.

Execution and Monitoring

The last step of the analytics implementation methodology is execution and monitoring. Once the solution goes live and automated decisions are carried out using the analytics model and strategies, close monitoring is essential to see when the model may need tuning or decisions thresholds adjusted. Also important is the champion–challenger notion of new decision strategies where an existing strategy is champion and a new strategy is tested to see if it improves results. The new strategy becomes the challenger. If after some observation it appears that the challenger strategy is performing better, then it replaces the champion strategy.

This is an ongoing process and the monitoring of expiring strategies and their performance data has to be retained to review the basis on which it was expired. As organizations get better at building challenger strategies, more innovative and specialized improvements are introduced into the decisions and business processes become smarter, reacting to specific customer behavioral changes, economic and market shifts, industry and regulatory changes, as well as internal cost and efficiency drivers.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 9. Analytics Implementation Methodology

Create new playlist

Sign In

Sign Up