Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 15

Avoiding Analysis Traps

IN THIS CHAPTER

Examining your data

Exploring your data potential problems

Dealing with outliers and fitting data to a curve

Evaluating your technical analysis

Describing the limitations of a predictive model

In the quest for building a predictive model, you'll need to make decisions every step of the way — and some decisions are harder than others. Informed decisions — and an awareness of the common mistakes that most analysts make while building their predictive models — give you your best shot at success.

This chapter offers insights on the issues that could arise when you embark on a journey toward the effective use of predictive analytics. At the outset, consider this general definition:

A predictive model is a system that can predict the next possible outcome, and assign a realistic probability to that outcome.

As you build your predictive model, you're likely to run into problems in two areas — the data and the analysis. This chapter delves into both types of problems to help you strengthen your safeguards, and allow you to stay on top of your project.

How well you handle both the data and the analysis at the core of your predictive model defines the success of your predictive analytics project. Data issues are more prominent now because big data (massive amounts of analyzable data generated online) is all the rage — and only getting bigger, thanks to explosive growth of data in the digital and social media worlds.

The more data you have, however, the more diverse the cases that generate it. Your job as a modeler can be harder if the data you're using contains outliers (extreme cases): Then your model must take into account

How rare those outliers are
How much complexity they add to your model
Whether to create risk management components to deal with them

Modeling requires that you choose which variables to report on — and that you understand not only the variables but also their impact on the business (for example, how an unusually active storm season might affect a fishery). Right off the bat, you need to consult someone with domain knowledge (expertise in the business) so you can identify those variables accurately. After all, the model should reflect the real world in which the business is operating. The model's predictions should be accurate enough to improve return on investment for the business — by helping its decision-makers answer tough questions and make bold, informed decisions.

People with expert-level domain knowledge are best qualified to analyze the data in a comprehensive and meaningful way. Hint: You'll want to make sure your analysis has appropriate degrees of data validity, variety, velocity (the speed at which the data changes), and volume — all addressed at length in Chapter 4.

Data Challenges

Data mining is more than just gathering, generating, or retrieving data. It's a first step in treating data as a resource for your business, and it raises more issues:

Which type of analysis or analyses to adopt
Which algorithms to employ
Which data points to include and exclude and why
How to prepare your data for use in your model
How to select your variables for building your model
Which data to use for training and testing your model

Data is at the center of predictive analytics. Data quality is crucial to the effectiveness of the predictions. As a data analyst, you have to get familiar enough with the data to examine its limitations (if any). Here's a quick list of questions to ask about the data:

Where is it located?
How can I retrieve it?
Is it well documented?
Does it contain errors?
Are there any missing values?
Does it need to be cleaned of corrupt or inaccurate records?
Does it contain outliers?
Does it need smoothing to have individual aberrations minimized?
Does it require any kind of preprocessing before use in the model?

When the data is collected from multiple sources, potential problems — such as formatting issues, data duplication, and consolidation of similar data — become more likely. In view of all these questions and uncertainties, don't be surprised if your incoming data requires some preprocessing before you can run it through your model and use it in a meaningful analysis.

Data preparation (see Chapter 9 for more details) is tedious and time-consuming. At this stage of developing your predictive model, expertise and tools in data preparation can be essential to the success of your project.

Data preparation will take up to 80 percent of your time to build a predictive analytics model. Expect to spend most of your time getting your data ready so you can apply analysis. Collecting relevant variables, data cleansing, and preprocessing of the data (such as generating derived variables, integrating data from multiple sources, and data mapping) will take the majority of your time.

Outlining the limitations of the data

As with many aspects of any business system, data is apt to have some limits on its usability when you first obtain it. Here's an overview of some limitations you're likely to encounter:

The data could be incomplete. Missing values, even the lack of a section or a substantial part of the data, could limit its usability. Missing data is having empty fields for some variables that you're analyzing. When your data spans multiple seasons, it may need extra attention; even though your data may not have any missing values, but if it only spans a particular season or phase and it doesn't represent other seasons, then your data isn't complete. For example, your data might cover only one or two conditions of a larger set that you're trying to model — as when a model built to analyze stock market performance only has data available from the past 5 years, which skews both the data and the model toward the assumption of a bull market. The moment the market undergoes any correction that leads to a bear market, the model fails to adapt — simply because it wasn't trained and tested with data representing a bear market.

Make sure you're looking at a time frame that gives you a complete picture of the natural fluctuations of your data; your data shouldn't be limited by seasonality.
If you're using self-reported data, keep in mind that people don't always provide accurate information. Not everyone will answer truthfully about (say) how many times they exercise — or how many alcoholic beverages they consume — per week. People may not be dishonest so much as self-conscious, but the data is still skewed.
Data collected from different sources can vary in quality and format. Data collected from such diverse sources as surveys, emails, data-entry forms, and the company website will have different attributes and structures. For example, data embedded within pdf files or scattered across various publications will need to be extracted, assembled, and preprocessed before you can include it in any analysis. Data from various sources may not have much compatibility among data fields. Such data requires major preprocessing before it's analysis-ready. The accompanying sidebar provides an example.

Data collected from multiple sources may have differences in formatting, duplicate records, and inconsistencies across merged data fields. Expect to spend a long time cleaning such data — and even longer validating its reliability.

To determine the limitations of your data, be sure to:

Verify all the variables you'll use in your model.
Assess the scope of the data, especially over time, so your model can avoid the seasonality trap.
Check for missing values, identify them, and assess their impact on the overall analysis.
Watch out for extreme values (outliers) and decide on whether to include them in the analysis.
Confirm that the pool of training and test data is large enough.
Make sure data type (integers, decimal values, or characters, and so forth) is correct and set the upper and lower bounds of possible values.
Pay extra attention to data integration when your data comes from multiple sources.

Be sure you understand your data sources and their impact on the overall quality of your data.

Choose a relevant dataset that is representative of the whole population.
Choose the right parameters for your analysis.

Even after all this care and attention, don't be surprised if your data will still need preprocessing before you can analyze it accurately. In practice, you will find that you will do at least one or two of the tasks listed below. Preprocessing often takes a long time and significant effort because it has to address several issues related to the original data — these issues include:

Any values missing from the data.
Any inconsistencies and/or errors existing in the data.
Any duplicates or outliers in the data.
Any normalization or other transformation of the data.
Any derived data needed for the analysis.

Dealing with extreme cases (outliers)

Your data may contain outliers — extreme values that are often significant but atypical. You'll have to deal with these outliers — maintaining data integrity without affecting your predictive model negatively. It's a key challenge.

Part of your data preparation is to determine whether the values and data types of your data are what you expected. Check for

The expected minimum and maximum values
The data types and the expected values for a given field
Any anomaly in the values and/or data types

Outliers caused by outside forces

Be sure you check carefully for outliers before they influence your analysis. Outliers can distort both the data and data analysis. For example, any statistical analysis done with data that leaves outliers in place ends up skewing the means and variances. Some techniques, such as Trees, are more robust to outliers than others.

Unchecked or misinterpreted outliers may lead to false conclusions. Say your data that shows that a stock that was traded for a whole year at a price above $50 — but for only a few minutes out of that whole year the stock was traded at $20. The $20 price — an obvious exception — is the outlier in this dataset.

Now you have to decide whether to include the $20 stock price in your analysis; if you do, it has ramifications for the overall model. But what do you consider normal? Was the “flash crash” that took the stock market by surprise on May 6, 2010 a normal event or an exception? During that brief time, the stock market experienced a sharp decline in prices across the board — which knocked the sample stock price down from $50 to $20, but had less to do with the stock than with wider market conditions. Does your model need to take the larger fluctuations of the stock market into account?

Anyone who's lost money on brief moments of free-fall market considers those few minutes real and normal (even if they felt like an eternity to go through). A portfolio that diminishes in milliseconds due to a rapid decline, albeit short-lived, is clearly real. Yet the flash crash is an anomaly, an outlier that poses a problem for the model.

Regardless of what's considered normal (which can change anyway), data sometimes contains values that don't fit the expected values. This is especially true in the stock market, where virtually any event may send the market flying or plunging. You don't want your model to fail when the reality changes suddenly — but a model and a reality are two different things. As a data scientist, you should strive to build models that account for the unexpected and be able to address it in a way that strengthens the business.

Outliers caused by errors in the system

When we rely on technology or instrumentation to conduct a task, a glitch here or there can cause these instruments to register extreme or unusual values. If sensors register observational values that fail to meet basic quality-control standards, they can produce real disruptions that are reflected in data.

Someone performing data entry, for example, can easily add an extra 0 at the end of a value by mistake, taking the entry out of range and producing an outlier. If you're looking at observational data collected by a water sensor installed in Baltimore Harbor — and it reports a water height of 20 feet above mean sea level — you've got an outlier. The sensor is obviously wrong unless Baltimore is completely covered by water.

Data can end up having outliers because of external events or an error by a person or an instrument. If a real event such as a flash crash is traced to an error in the system, its consequences are still real — but if you know the source of the problem, you may conclude that a flaw in the data, not your model, was to blame if your model didn't predict the event. Such problems can be addressed outside the model, such as instituting risk management for your portfolio in case of financial data.

Knowing the source of the outlier will guide your decision on how to deal with it. Outliers that were the result of data-entry errors can easily be corrected after consulting the data source. Outliers that reflect a changed reality may prompt you to change your model.

There's no one-size-fits-all answer when you're deciding whether to include or disregard extreme data that isn't an error or glitch. Your response depends on the nature of the analysis you're doing — and on the type of the model you're building. In a few cases, the way to deal with those outliers is straightforward:

If you trace your outlier to a data-entry error when you consult the data source, you can easily correct the data and (probably) keep the model intact.
If that water sensor in Baltimore Harbor reports a water height of 20 feet above mean sea level, and you're in Baltimore, look out your window:
- If Baltimore isn't completely covered by water, the sensor is obviously wrong.
- If you see a fish looking in at you, the reality has changed; you may have to revise your model.
The flash crash may have been a one-time event (over the short term, anyway), but its effects were real — and if you've studied the market over the longer term, you know that something similar may happen again. If your business is in finance and you deal with the stock market all the time, you want your model to account for such aberrations.

In general, if the outcome of an event normally considered an outlier can have a significant impact on your business, consider how to deal with those events in your analysis. Keep these general points in mind about outliers:

The smaller dataset is, the more significant the impact outliers can have on the analysis.
As you develop your model, be sure you also develop techniques to find outliers and to systematically understand their impact on your business.
Detecting outliers can be a complex process; there is no simple way of identifying them.
A domain expert (someone who knows the field you're modeling) is your best go-to person to verify whether a data point is valid, an outlier you can disregard, or an outlier you have to take into account. The domain expert should be able to explain what factors created the outlier, what its range of variability is, and its impact on the business.
Visualization tools can help you spot outliers in the data. Also, if you know the expected range of values you can easily query for data that falls outside that range.

Keeping the outliers in the analysis — or not

Deciding to include outliers in the analysis — or to exclude them — will have implications for your model.

Keeping outliers as part of the data in your analysis may lead to a model that isn't applicable — either to the outliers or to the rest of the data. If you decide to keep an outlier, you'll need to choose techniques and statistical methods that excel at handling outliers without influencing the analysis. One such technique is to use mathematical functions such as natural algorithms and square root to reduce the gap between the outliers and the rest of the data. These functions, however, only work for numerical data that is greater than zero — and other issues may arise. For example, transforming the data may require interpretations of the relationship between variables in the newly transformed data that differ from the interpretation that governs those variables in the original data.

The mere presence of outliers in your data can provide insights into your business that can be very helpful in generating a robust model. Outliers may draw attention to a valid business case that illustrates an unusual but significant event.

Looking for outliers, identifying them, and assessing their impact should be part of data analysis and preprocessing. Business domain experts can provide insight and help you decide what to do with unusual cases in your analysis. Although sometimes common sense is all you need to deal with outliers, often it's helpful to ask someone who knows the ropes.

If you're in a business that benefits from rare events — say, an astronomical observatory with a grant to study Earth-orbit-crossing asteroids — you're more interested in the outliers than in the bulk of the data.

Outliers can be a great source of information. Deviating from the norm could be a signal of suspicious activity, breaking news, or an opportunistic or catastrophic event. You may need to develop models that help you identify outliers and asses the risks they signify.

It's prudent to conduct two analyses: one that includes outliers, and another that omits them. Then examine the differences, try to understand the implications of each method, and assess how adopting one method over the other would influence your business goals.

Data smoothing

Data smoothing is, essentially, trying to find the “signal” in the “noise” by discarding data points that are considered “noisy”. The idea is to sharpen the patterns in the data and highlight trends the data is pointing to.

Figure 15-1 shows a typical graph that results from data smoothing.

FIGURE 15-1: A graph showing smoothed data.

The implication behind data smoothing is that the data consists of two parts: one part (consisting of the core data points) that signifies overall trends or real trends, and another part that consists mostly of deviations (noise) — some fluctuating points that result from some volatility in the data. Data smoothing seeks to eliminate that second part.

Turning down the noise

Data smoothing operates on several assumptions:

That fluctuation in data is likeliest to be noise.
That the noisy part of the data is of short duration.
That the data's fluctuation, regardless of how varied it may be, won't affect the underlying trends represented by the core data points.

Noise in data tends to be random; its fluctuations shouldn't affect the overall trends drawn from examining the rest of the data. So reducing or eliminating noisy data points can clarify real trends and patterns in the data — in effect, improving the data's “signal-to-noise ratio.”

Provided you've identified the noise correctly and then reduced it, data smoothing can help you predict the next observed data point simply by following the major trends you've detected within the data. Data smoothing concerns itself with the majority of the data points, their positions in a graph, and what the resulting patterns predict about the general trend of (say) a stock price, whether its general direction is up, down, or sideways. This technique won't accurately predict the exact price of the next trade for a given stock — but predicting a general trend can yield more powerful insights than knowing the actual price or its fluctuations.

A forecast based on a general trend deduced from smoothed data assumes that whatever direction the data has followed thus far will continue into the future in a way consistent with the trend. In the stock market, for example, past performance is no definite indication of future performance, but it certainly can be a general guide to future movement of the stock price.

Methods, advantages, and downsides of data smoothing

Data smoothing focuses on establishing a fundamental direction for the core data points by

Ignoring any noisy data points
Drawing a smoother curve through the data points that skips the wriggling ones and emphasizes primary patterns — trends — in the data, no matter how slow their emergence

In a numerical time series, data smoothing serves as a form of filtering.

Data smoothing is not be confused with fitting a model, which is part of the data analysis consisting of two steps:

Find a suitable model that represents the data.
Make sure that the model fits the data effectively.

For details of the model-fitting process, see Chapter 12.

Data smoothing can use any of the following methods:

Random walk is based on the idea that the next outcome, or future data point, is a random deviation from the last known, or present, data point.
Moving average is a running average of consecutive, equally spaced periods. An example would the calculation of a 200-day moving average of a stock price.
Exponential smoothing assigns exponentially more weight, or importance, to recent data points than to older data points.
- Simple: This method should be used when the time series data has no trend and no seasonality.
- Linear: This method should be used when the time series data has a trend line.
- Seasonal: This method should be used when the time series data has no trend but seasonality.

What these smoothing methods all have in common is that they carry out some kind of averaging process on several data points. Such averaging of adjacent data points is the essential way to zero in on underlying trends or patterns.

The advantages of data smoothing are

It's easy to implement.
It helps identify trends.
It helps expose patterns in the data.
It eliminates data points that you've decided aren't of interest.
It helps predict the general direction of the next observed data points.
It generates nice smooth graphs.

But everything has a downside. The disadvantages of data smoothing are

It may eliminate valid data points that result from extreme events.
It may lead to inaccurate predictions if the test data has only one season and isn't fully representative of the reality that generated the data points.
It may shift or skew the data, especially the peaks, resulting in a distorted picture of what's going on.
It may be vulnerable to significant disruption from outliers within the data.
It may result in a major deviation from the original data.

If data smoothing does no more than give the data a mere facelift, it can lead to a fundamentally wrong path in the following ways:

It can introduce errors through distortions that treat the smoothed data as if it were identical to the original data.
It can skew interpretation by ignoring — and hiding — risks embedded within the data.
It can lead to a loss of detail within your data — which is one way that a smoothed curve may deviate greatly from that of the original data.

How seriously data smoothing may affect your data depends on the nature of the data at hand, and which smoothing technique was implemented on that data. For example, if the original data has more peaks in it, then data smoothing will lead to major shifting of those peaks in the smoothed graphs — most likely a distortion.

Here are some cautionary points to keep in mind as you approach data smoothing:

It's a good idea to compare smoothed graphs to untouched graphs that plot the original data.
Data points removed during data smoothing may not be noise; they could be valid data points that result from rare, but real, events.
Data smoothing can be helpful in moderation, but its overuse can lead to a misrepresentation of your data.

By applying your professional judgment and your business knowledge expertise, you can use data smoothing effectively. Removing noise from your data — without negatively affecting the accuracy and usefulness of the original data — is at least as much an art as a science.

Curve fitting

Curve fitting, as mentioned earlier, is a process distinct from data smoothing: Here the goal is to create a curve that depicts the mathematical function that best fits the actual (original) data points in a data series. In general, you smooth time series data, and curve fit regression relationships.

The curve can either pass through every data point or stay within the bulk of the data, ignoring some data points in hopes of drawing trends from the data. In either case, one single mathematical function is assigned to the entire body of data, with the goal of fitting all data points into a curve that delineates trends and aids predictions.

Figure 15-2 shows a typical graph that results from curve-fitting a body of data.

FIGURE 15-2: An example of curve fitting.

Curve fitting can be achieved in one of three ways:

By finding an exact fit for every data point (a process called interpolation)
By staying within the bulk of the data while ignoring some of data points in hopes of drawing trends out of the data
By employing data smoothing to come up with a function that represents the smoothed graph

Curve fitting can be used to fill in possible data points to replace missing values or help analysts visualize the data.

When you're working to generate a predictive analytics model, avoid tailoring your model to fit your data sample perfectly. Such a model will fail — miserably — to predict similar yet varying datasets outside the data sample. Fitting a model too closely to a particular data sample is a classic mistake called overfitting.

The woes of overfitting

In essence, overfitting a model is what happens when you train the model to represent only your sample data — which isn't a good representation of the data as a whole. Without a more realistic dataset to go on, the model can then be plagued with errors and risks when it goes operational — and the consequences to your business can be serious.

Overfitting a model is a common trap because people want to create models that work — and so are tempted to keep tweaking variables and parameters until the model performs perfectly — on too little data. To err is human. Fortunately, it's also human to create realistic solutions.

To avoid overfitting your model to your sample dataset, be sure to have a body of test data available that's separate from your sample data. Then you can measure the performance of your model independently before making the model operational. Thus one general safeguard against overfitting is to divide your data to two parts: training data and test data. The model's performance against the test data will tell you a lot about whether the model is ready for the real world.

Another best practice is to make sure that your data represents the larger population of the domain you're modeling for. All that an overtrained model knows is the specific features of the sample dataset it's trained for. If you train the model only on (say) snowshoe sales in winter, don't be surprised if it fails miserably when it's run again on data from any other season.

Avoiding overfitting

It's worth repeating: Too much tweaking of the model is apt to result in overfitting. One such tweak is including too many variables in the analysis. Keep those variables to a minimum. Start by including variables that you see as absolutely necessary — those you believe will make a significant difference to the outcome. This insight comes from intimate knowledge of the business domain you're in. That's where the expertise of domain experts can help keep you from falling into the trap of overfitting. In addition, we can obtain such insight through the modeling process; the data more often than not manages to surprise us.

Here's a checklist of best practices to help you avoid overfitting your model:

Chose a dataset to work with that is representative of the population as a whole.
Divide your dataset into two parts: training data and test data.
Keep the variables analyzed that have predictive value.
Enlist the help of domain knowledge experts.

In the stock market, for example, a classic analytical technique is back-testing — running a model against historical data to look for the best trading strategy. Suppose that, after running his new model against data generated by a recent bull market, and tweaking the number of variables used in his analysis, the analyst creates what looks like an optimal trading strategy — one that would yield the highest returns if he could go back and trade only during the year that produced the test data. Unfortunately, he can't. If he tries to apply that model in a current bear market, look out below: He'll incur losses by applying a model too optimized for a narrow period of time and set of conditions that doesn't fit current realities. (So much for hypothetical profits.) The model worked only for that vanished bull market because it was overtrained, bearing the earmarks of the context that produced the sample data — complete with its specifics, outliers, and shortcomings. All the circumstances surrounding that dataset probably won't be repeated in the future, or in a true representation of the whole population — but they all showed up in the overfitted model.

If a model's output is too accurate, consider that a hint to take a closer look. Enlist the help of domain knowledge experts to see whether your results really are too good to be true, and run that model on more test data for further comparisons.

Keeping the assumptions to a minimum

In spite of everything we've all been told about assumptions causing trouble, a few assumptions remain at the core of any predictive analytics model. Those assumptions show up in the variables selected and considered in the analysis — and those variables directly affect the accuracy of the final model's output. Therefore your wisest precaution at the outset is to identify which assumptions matter most to your model — and to keep them to an absolute minimum.

Creating a predictive model that works well in the real world requires an intimate knowledge of the business. Your model starts out knowing only the sample data — in practical terms, almost nothing. So start small and keep on enhancing the model as necessary. Probing possible questions and scenarios can lead to key discoveries and/or can shed more light on the factors at play in the real world. This process can identify the core variables that could affect the outcome of the analysis. In a systematic approach to predictive analysis, this phase — exploring “what-if” scenarios — is especially interesting and useful. Here's where you change the model inputs to measure the effects of one variable or another on the output of the model; what you're really testing is its forecasting capability.

Improving the model's assumptions — by testing how they affect the model's output, probing to see how sensitive the model is to them, and paring them down to the minimum — will help you guide the model toward a more reliable predictive capability. Before you can optimize your model, you have to know the predictive variables — features that have a direct impact on its output.

You can derive those decision variables by running multiple simulations of your model — while changing a few parameters with each run — and recording the results, especially the accuracy of the model's forecasts. Usually you can trace variations in accuracy back to the specific parameters you changed.

At this point, the twenty-first century can turn to the fourteenth for help. William of Ockham, an English Franciscan friar and scholastic philosopher who lived in the 1300s, developed the research principle we know as Occam's Razor: You should cut away unnecessary assumptions until your theory has as few of them as possible. Then it's likeliest to be true.

Too many assumptions weigh down your model's forecasts with uncertainties and inaccuracies. Eliminating unnecessary variables leads to a more robust model, but it isn't easy to decide which variables to include in the analysis — and those decisions directly affect the performance of the model. But here's where the analyst can run into a dilemma: Including unnecessary factors can skew or distort the output of the model, but excluding a relevant variable leaves the model incomplete. So when it comes time to select those all-important decision variables, call in your domain knowledge experts. When you have an accurate, reality-based set of decision variables, you don't have to make too many assumptions — and the result can be fewer errors in your predictive model.

Analysis Challenges

Predictive modeling is gaining popularity as a tool for managing many aspects of business. Ensuring that data analysis is done right will boost confidence in the models employed — which, in turn, can generate the needed buy-in for predictive analytics to become part of your organization's standard toolkit.

Perhaps this increased popularity comes from the ways in which a predictive analytics project can support decision-making by creating models that describe datasets, discover possible new patterns and trends (as indicated by the data), and predict outcomes with greater reliability.

To accomplish this goal, a predictive analytics project must deliver a model that best fits the data by selecting the decision variables correctly and efficiently. Some vital questions must be answered in pursuit of that goal:

Which are the minimum assumptions and decision variables that enable the model to best fit the data?
How does the model under construction compare to other applicable models?
Which criteria are best for evaluating and scoring this model?

Once again, you can call the voice of experience to the rescue: Domain knowledge experts can discuss these questions, interpret any results that show hidden patterns in the data, and help verify and validate the model's output. They can also help you navigate the tricky aspects of predictive analytics described in the upcoming sections of this chapter.

Supervised analytics

In supervised analytics, both input and historical output are part of the training data. The model is presented with the correct results as part of its learning process. Such supervised learning assumes pre-classified examples: The goal is to get the model to learn from the previously known classification so it can correctly label the next unknown data point based on what it has learned.

When the model's training is complete, a mathematical function is inferred by examining the training data. That function will be used to label new data points.

For this approach to work correctly, the training data — along with the test data — must be carefully selected. The trained model should be able to predict the correct label for a new data point quickly and precisely, based on the data type(s) the model has seen in the training data.

Supervised analytics offer some distinct advantages:

The analyst is in charge of the process.
Labeling is based on known classifications.
Labeling errors can be easily resolved.

The flip side of these advantages is an equally distinct set of potential disadvantages:

Any mistakes at the training phase will be reinforced later on.
The classification provided by the analyst may not describe the whole population adequately.
The model may be unable to detect classes that deviate from the original training set.
The assumption that the clusters within the data don't overlap — and that they can easily be separated — may not prove valid.

Relying on only one analysis

As you probably guessed, predictive analytics isn't a one-size-fits-all activity — nor are its results once-and-for-all. For the technique to work correctly, you have to apply it again and again over time — so you'll need an overall approach that fits your business well. The success of your predictive analytics project depends on multiple factors:

The nature of your data
The nature of your business and its culture
The availability of the in-house expertise
Access to appropriate analytical tools

The approach you choose will influence the model's output, the process of analyzing its results, and the interpretation of its forecasts. And choosing an approach is no walk in the park. There are many things that can go wrong, many traps that you can fall into, and misleading paths you can take.

Happily, you can defend against these pitfalls by adopting a couple of wise practices early on:

Continuously test the results of your predictive analytics model. Don't rely on the results of one single analysis; instead, run multiple analyses in parallel — and compare their outcome.
Run, test, compare, and evaluate multiple models and their outcomes. Use as many simulations as you can, and check as many permutations as you can. Some limitations in your data can only come to light when you compare the results you get from your model to those you get from other models. Then you can assess the impact of each model's results vis-à-vis your business objectives.
Use multiple models to identify as many relevant patterns as possible in your data.

Describing the limitations of the model

Any predictive analytic model has certain limitations based on the algorithms it employs and the dataset it runs on. You should be aware of those limitations and make them work to your advantage; those related to the algorithms include

Whether the data has nonlinear patterns (doesn't form a line)
How highly correlated the variables are (statistical relationships between features)
Whether the variables are independent (no relationships between features)
Whether the scope of the sample data makes the model prone to overfitting (as described earlier in this chapter)

To overcome the limitations of your model, use sound cross-validation techniques to test your models. Start by dividing your data into training and test datasets, and run the model against each of those datasets separately to evaluate and score the predictions of the model.

Testing and evaluating your model

No model can produce 100-percent accurate forecasts; any model has the potential to produce inaccurate results. Be on the lookout for any significant variation between the forecasts your model produces and the observed data — especially if the model's outputs contradict common sense. If it looks too good, bad, or extreme to be true, then it probably isn't true (to reality, anyway).

In the evaluation process, thoroughly examine the outputs of the models you're testing and compare them to the input variables. Your model's forecast capability should answer all stated business goals that drove its creation in the first place.

If errors or biases crop up in your model's output, try tracing them back to

The validity, reliability, and relative seasonality of the data
Assumptions used in the model
Variables that were included or excluded in the analysis

The preceding factors will directly influence the accuracy of your model. If your data isn't appropriate for the model you're building, then your model is bound to fail to answer its business needs. If the data is good but flawed, like absent representation of a type of data due to seasonality, or that the data isn't valid or reliable, your data will not account for all cases that the model should be aware of and address when similar situations arise in the future.

As part of building the model, the data scientists or business stakeholders may make some assumptions either about the data used to create the model, or the business environment in which the model will run. If those assumptions are wrong, the model will be less or inaccurate.

The decision to include a variable or exclude it from the analysis has a direct impact on the outcome of the model. Some variables may only be effective and their predictive powers may only come in play in the presence of other variables. The decision to whether include or exclude a variable from an analysis is at the core of building predictive analytics models. Business stakeholders, data scientists’ experience, tools, and quality data should be of tremendous help in building successful predictive analytics models.

Work with business users to evaluate every step of your model's process; make sure that the model outputs can be easily interpreted and used in a real-world business situation. Balance the accuracy and reliability of the model with how easily the model's outputs can be interpreted and put to practical use.

The ability to easily interpret the model and for that interpretation to make sense to the business stakeholders is essential. That interpretation may be new, but you should be able to explain why this variable, or that variable, or a combination of variable will allow your business to make an accurate prediction.

Avoiding non-scalable models

When you're building a model, always keep scalability in mind. Always check the performance, accuracy, and reliability of the model at various scales. Your model should be able to change its scale — and scale up as big as necessary — without falling apart or outputting bad predictions.

Scalability is quite a challenge. Predictive models can take a long time to build and to run. In the past, the datasets the models ran on were small, and the data was expensive to collect, store, and search. But that was all in the “pre-big data” era.

Today big data is cheap, plentiful, and growing. In fact, another potential problem looms: The formidable data volume and data velocity, and the rate of the incoming data currently available and possible, may negatively affect the model and degrade its performance, outdating the model in a relatively short period of time. Properly implemented, scalability can help “future-proof” your model.

The future isn't the only threat. Even in the present online era, streamed data can overwhelm a model — especially if the streams of data increase to a flood.

If the model refreshes at a rate slower than the incoming data, then the model’s predictions are almost obsolete for the data; the universe it represents has already changed and the model needs to play catch up to address the new reality.

Data volume alone can cause the decision variables and predicting factors to grow to giant numbers that require continuous updating to the model. So yes, your model had better be scalable — rapidly scalable.

Scoring your predictions accurately

When analyzing the quality of a predictive model, you'll want to measure its accuracy. The more accurate a forecast the model makes, the more useful it is to the business, which is an indication of its quality. This is all good — except for when the predicted event is rare. In such case, the high accuracy of the predictive model may be meaningless.

For example if the probability of rare event to occur is 5 percent, a model that simply answers “no” all the time when asked whether the rare event has occurred would be right 95 percent of the time. But how useful would such a model be? Thus, if your business is interested in predicting and handling rare events, don't rely on accuracy alone as a measure of your model's reliability.

In such a case, you can evaluate the efficacy and the quality of a predictive model in the light of the how likely the rare event is to take place. A useful metric to follow is to specify which types of errors you can accept from the model and which you cannot.

Here's a quick list of other ways to evaluate your model:

Check to see whether the model's output meets your evaluation criteria.
Devise a testing strategy so you can test your model repeatedly and consistently.
Measure how well the model meets the business goals for which it was built.
Assess the risks of deploying the model live.

Help stamp out overfitting. When building a predictive model, keep in mind that your dataset is only a sample of the whole population. There will always be unknown factors that your data can't account for, no matter what.

Approach the analysis of your predictive model with care, starting with this quick checklist:

Prepare your data with the utmost diligence before using it to train your model.
Carefully consider outliers before including or excluding them.
Remain vigilant in repeated testing and evaluation.
Cross-check sample data and test data to steer away from overfitting.
Consult your domain knowledge experts often and appropriately.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.