Chapter 15
IN THIS CHAPTER
Examining your data
Exploring your data potential problems
Dealing with outliers and fitting data to a curve
Evaluating your technical analysis
Describing the limitations of a predictive model
In the quest for building a predictive model, you'll need to make decisions every step of the way — and some decisions are harder than others. Informed decisions — and an awareness of the common mistakes that most analysts make while building their predictive models — give you your best shot at success.
This chapter offers insights on the issues that could arise when you embark on a journey toward the effective use of predictive analytics. At the outset, consider this general definition:
A predictive model is a system that can predict the next possible outcome, and assign a realistic probability to that outcome.
As you build your predictive model, you're likely to run into problems in two areas — the data and the analysis. This chapter delves into both types of problems to help you strengthen your safeguards, and allow you to stay on top of your project.
How well you handle both the data and the analysis at the core of your predictive model defines the success of your predictive analytics project. Data issues are more prominent now because big data (massive amounts of analyzable data generated online) is all the rage — and only getting bigger, thanks to explosive growth of data in the digital and social media worlds.
The more data you have, however, the more diverse the cases that generate it. Your job as a modeler can be harder if the data you're using contains outliers (extreme cases): Then your model must take into account
Modeling requires that you choose which variables to report on — and that you understand not only the variables but also their impact on the business (for example, how an unusually active storm season might affect a fishery). Right off the bat, you need to consult someone with domain knowledge (expertise in the business) so you can identify those variables accurately. After all, the model should reflect the real world in which the business is operating. The model's predictions should be accurate enough to improve return on investment for the business — by helping its decision-makers answer tough questions and make bold, informed decisions.
Data mining is more than just gathering, generating, or retrieving data. It's a first step in treating data as a resource for your business, and it raises more issues:
Data is at the center of predictive analytics. Data quality is crucial to the effectiveness of the predictions. As a data analyst, you have to get familiar enough with the data to examine its limitations (if any). Here's a quick list of questions to ask about the data:
Data preparation (see Chapter 9 for more details) is tedious and time-consuming. At this stage of developing your predictive model, expertise and tools in data preparation can be essential to the success of your project.
Data preparation will take up to 80 percent of your time to build a predictive analytics model. Expect to spend most of your time getting your data ready so you can apply analysis. Collecting relevant variables, data cleansing, and preprocessing of the data (such as generating derived variables, integrating data from multiple sources, and data mapping) will take the majority of your time.
As with many aspects of any business system, data is apt to have some limits on its usability when you first obtain it. Here's an overview of some limitations you're likely to encounter:
The data could be incomplete. Missing values, even the lack of a section or a substantial part of the data, could limit its usability. Missing data is having empty fields for some variables that you're analyzing. When your data spans multiple seasons, it may need extra attention; even though your data may not have any missing values, but if it only spans a particular season or phase and it doesn't represent other seasons, then your data isn't complete. For example, your data might cover only one or two conditions of a larger set that you're trying to model — as when a model built to analyze stock market performance only has data available from the past 5 years, which skews both the data and the model toward the assumption of a bull market. The moment the market undergoes any correction that leads to a bear market, the model fails to adapt — simply because it wasn't trained and tested with data representing a bear market.
Make sure you're looking at a time frame that gives you a complete picture of the natural fluctuations of your data; your data shouldn't be limited by seasonality.
To determine the limitations of your data, be sure to:
Even after all this care and attention, don't be surprised if your data will still need preprocessing before you can analyze it accurately. In practice, you will find that you will do at least one or two of the tasks listed below. Preprocessing often takes a long time and significant effort because it has to address several issues related to the original data — these issues include:
Your data may contain outliers — extreme values that are often significant but atypical. You'll have to deal with these outliers — maintaining data integrity without affecting your predictive model negatively. It's a key challenge.
Part of your data preparation is to determine whether the values and data types of your data are what you expected. Check for
Be sure you check carefully for outliers before they influence your analysis. Outliers can distort both the data and data analysis. For example, any statistical analysis done with data that leaves outliers in place ends up skewing the means and variances. Some techniques, such as Trees, are more robust to outliers than others.
Unchecked or misinterpreted outliers may lead to false conclusions. Say your data that shows that a stock that was traded for a whole year at a price above $50 — but for only a few minutes out of that whole year the stock was traded at $20. The $20 price — an obvious exception — is the outlier in this dataset.
Now you have to decide whether to include the $20 stock price in your analysis; if you do, it has ramifications for the overall model. But what do you consider normal? Was the “flash crash” that took the stock market by surprise on May 6, 2010 a normal event or an exception? During that brief time, the stock market experienced a sharp decline in prices across the board — which knocked the sample stock price down from $50 to $20, but had less to do with the stock than with wider market conditions. Does your model need to take the larger fluctuations of the stock market into account?
Anyone who's lost money on brief moments of free-fall market considers those few minutes real and normal (even if they felt like an eternity to go through). A portfolio that diminishes in milliseconds due to a rapid decline, albeit short-lived, is clearly real. Yet the flash crash is an anomaly, an outlier that poses a problem for the model.
Regardless of what's considered normal (which can change anyway), data sometimes contains values that don't fit the expected values. This is especially true in the stock market, where virtually any event may send the market flying or plunging. You don't want your model to fail when the reality changes suddenly — but a model and a reality are two different things. As a data scientist, you should strive to build models that account for the unexpected and be able to address it in a way that strengthens the business.
When we rely on technology or instrumentation to conduct a task, a glitch here or there can cause these instruments to register extreme or unusual values. If sensors register observational values that fail to meet basic quality-control standards, they can produce real disruptions that are reflected in data.
Someone performing data entry, for example, can easily add an extra 0 at the end of a value by mistake, taking the entry out of range and producing an outlier. If you're looking at observational data collected by a water sensor installed in Baltimore Harbor — and it reports a water height of 20 feet above mean sea level — you've got an outlier. The sensor is obviously wrong unless Baltimore is completely covered by water.
Data can end up having outliers because of external events or an error by a person or an instrument. If a real event such as a flash crash is traced to an error in the system, its consequences are still real — but if you know the source of the problem, you may conclude that a flaw in the data, not your model, was to blame if your model didn't predict the event. Such problems can be addressed outside the model, such as instituting risk management for your portfolio in case of financial data.
There's no one-size-fits-all answer when you're deciding whether to include or disregard extreme data that isn't an error or glitch. Your response depends on the nature of the analysis you're doing — and on the type of the model you're building. In a few cases, the way to deal with those outliers is straightforward:
In general, if the outcome of an event normally considered an outlier can have a significant impact on your business, consider how to deal with those events in your analysis. Keep these general points in mind about outliers:
Deciding to include outliers in the analysis — or to exclude them — will have implications for your model.
Looking for outliers, identifying them, and assessing their impact should be part of data analysis and preprocessing. Business domain experts can provide insight and help you decide what to do with unusual cases in your analysis. Although sometimes common sense is all you need to deal with outliers, often it's helpful to ask someone who knows the ropes.
Outliers can be a great source of information. Deviating from the norm could be a signal of suspicious activity, breaking news, or an opportunistic or catastrophic event. You may need to develop models that help you identify outliers and asses the risks they signify.
Data smoothing is, essentially, trying to find the “signal” in the “noise” by discarding data points that are considered “noisy”. The idea is to sharpen the patterns in the data and highlight trends the data is pointing to.
Figure 15-1 shows a typical graph that results from data smoothing.
The implication behind data smoothing is that the data consists of two parts: one part (consisting of the core data points) that signifies overall trends or real trends, and another part that consists mostly of deviations (noise) — some fluctuating points that result from some volatility in the data. Data smoothing seeks to eliminate that second part.
Data smoothing operates on several assumptions:
Noise in data tends to be random; its fluctuations shouldn't affect the overall trends drawn from examining the rest of the data. So reducing or eliminating noisy data points can clarify real trends and patterns in the data — in effect, improving the data's “signal-to-noise ratio.”
Provided you've identified the noise correctly and then reduced it, data smoothing can help you predict the next observed data point simply by following the major trends you've detected within the data. Data smoothing concerns itself with the majority of the data points, their positions in a graph, and what the resulting patterns predict about the general trend of (say) a stock price, whether its general direction is up, down, or sideways. This technique won't accurately predict the exact price of the next trade for a given stock — but predicting a general trend can yield more powerful insights than knowing the actual price or its fluctuations.
Data smoothing focuses on establishing a fundamental direction for the core data points by
In a numerical time series, data smoothing serves as a form of filtering.
Make sure that the model fits the data effectively.
For details of the model-fitting process, see Chapter 12.
Data smoothing can use any of the following methods:
What these smoothing methods all have in common is that they carry out some kind of averaging process on several data points. Such averaging of adjacent data points is the essential way to zero in on underlying trends or patterns.
The advantages of data smoothing are
But everything has a downside. The disadvantages of data smoothing are
If data smoothing does no more than give the data a mere facelift, it can lead to a fundamentally wrong path in the following ways:
How seriously data smoothing may affect your data depends on the nature of the data at hand, and which smoothing technique was implemented on that data. For example, if the original data has more peaks in it, then data smoothing will lead to major shifting of those peaks in the smoothed graphs — most likely a distortion.
By applying your professional judgment and your business knowledge expertise, you can use data smoothing effectively. Removing noise from your data — without negatively affecting the accuracy and usefulness of the original data — is at least as much an art as a science.
Curve fitting, as mentioned earlier, is a process distinct from data smoothing: Here the goal is to create a curve that depicts the mathematical function that best fits the actual (original) data points in a data series. In general, you smooth time series data, and curve fit regression relationships.
The curve can either pass through every data point or stay within the bulk of the data, ignoring some data points in hopes of drawing trends from the data. In either case, one single mathematical function is assigned to the entire body of data, with the goal of fitting all data points into a curve that delineates trends and aids predictions.
Figure 15-2 shows a typical graph that results from curve-fitting a body of data.
Curve fitting can be achieved in one of three ways:
Curve fitting can be used to fill in possible data points to replace missing values or help analysts visualize the data.
In essence, overfitting a model is what happens when you train the model to represent only your sample data — which isn't a good representation of the data as a whole. Without a more realistic dataset to go on, the model can then be plagued with errors and risks when it goes operational — and the consequences to your business can be serious.
Overfitting a model is a common trap because people want to create models that work — and so are tempted to keep tweaking variables and parameters until the model performs perfectly — on too little data. To err is human. Fortunately, it's also human to create realistic solutions.
Another best practice is to make sure that your data represents the larger population of the domain you're modeling for. All that an overtrained model knows is the specific features of the sample dataset it's trained for. If you train the model only on (say) snowshoe sales in winter, don't be surprised if it fails miserably when it's run again on data from any other season.
It's worth repeating: Too much tweaking of the model is apt to result in overfitting. One such tweak is including too many variables in the analysis. Keep those variables to a minimum. Start by including variables that you see as absolutely necessary — those you believe will make a significant difference to the outcome. This insight comes from intimate knowledge of the business domain you're in. That's where the expertise of domain experts can help keep you from falling into the trap of overfitting. In addition, we can obtain such insight through the modeling process; the data more often than not manages to surprise us.
Here's a checklist of best practices to help you avoid overfitting your model:
In the stock market, for example, a classic analytical technique is back-testing — running a model against historical data to look for the best trading strategy. Suppose that, after running his new model against data generated by a recent bull market, and tweaking the number of variables used in his analysis, the analyst creates what looks like an optimal trading strategy — one that would yield the highest returns if he could go back and trade only during the year that produced the test data. Unfortunately, he can't. If he tries to apply that model in a current bear market, look out below: He'll incur losses by applying a model too optimized for a narrow period of time and set of conditions that doesn't fit current realities. (So much for hypothetical profits.) The model worked only for that vanished bull market because it was overtrained, bearing the earmarks of the context that produced the sample data — complete with its specifics, outliers, and shortcomings. All the circumstances surrounding that dataset probably won't be repeated in the future, or in a true representation of the whole population — but they all showed up in the overfitted model.
In spite of everything we've all been told about assumptions causing trouble, a few assumptions remain at the core of any predictive analytics model. Those assumptions show up in the variables selected and considered in the analysis — and those variables directly affect the accuracy of the final model's output. Therefore your wisest precaution at the outset is to identify which assumptions matter most to your model — and to keep them to an absolute minimum.
Creating a predictive model that works well in the real world requires an intimate knowledge of the business. Your model starts out knowing only the sample data — in practical terms, almost nothing. So start small and keep on enhancing the model as necessary. Probing possible questions and scenarios can lead to key discoveries and/or can shed more light on the factors at play in the real world. This process can identify the core variables that could affect the outcome of the analysis. In a systematic approach to predictive analysis, this phase — exploring “what-if” scenarios — is especially interesting and useful. Here's where you change the model inputs to measure the effects of one variable or another on the output of the model; what you're really testing is its forecasting capability.
Improving the model's assumptions — by testing how they affect the model's output, probing to see how sensitive the model is to them, and paring them down to the minimum — will help you guide the model toward a more reliable predictive capability. Before you can optimize your model, you have to know the predictive variables — features that have a direct impact on its output.
You can derive those decision variables by running multiple simulations of your model — while changing a few parameters with each run — and recording the results, especially the accuracy of the model's forecasts. Usually you can trace variations in accuracy back to the specific parameters you changed.
At this point, the twenty-first century can turn to the fourteenth for help. William of Ockham, an English Franciscan friar and scholastic philosopher who lived in the 1300s, developed the research principle we know as Occam's Razor: You should cut away unnecessary assumptions until your theory has as few of them as possible. Then it's likeliest to be true.
Too many assumptions weigh down your model's forecasts with uncertainties and inaccuracies. Eliminating unnecessary variables leads to a more robust model, but it isn't easy to decide which variables to include in the analysis — and those decisions directly affect the performance of the model. But here's where the analyst can run into a dilemma: Including unnecessary factors can skew or distort the output of the model, but excluding a relevant variable leaves the model incomplete. So when it comes time to select those all-important decision variables, call in your domain knowledge experts. When you have an accurate, reality-based set of decision variables, you don't have to make too many assumptions — and the result can be fewer errors in your predictive model.
Predictive modeling is gaining popularity as a tool for managing many aspects of business. Ensuring that data analysis is done right will boost confidence in the models employed — which, in turn, can generate the needed buy-in for predictive analytics to become part of your organization's standard toolkit.
Perhaps this increased popularity comes from the ways in which a predictive analytics project can support decision-making by creating models that describe datasets, discover possible new patterns and trends (as indicated by the data), and predict outcomes with greater reliability.
To accomplish this goal, a predictive analytics project must deliver a model that best fits the data by selecting the decision variables correctly and efficiently. Some vital questions must be answered in pursuit of that goal:
Once again, you can call the voice of experience to the rescue: Domain knowledge experts can discuss these questions, interpret any results that show hidden patterns in the data, and help verify and validate the model's output. They can also help you navigate the tricky aspects of predictive analytics described in the upcoming sections of this chapter.
In supervised analytics, both input and historical output are part of the training data. The model is presented with the correct results as part of its learning process. Such supervised learning assumes pre-classified examples: The goal is to get the model to learn from the previously known classification so it can correctly label the next unknown data point based on what it has learned.
When the model's training is complete, a mathematical function is inferred by examining the training data. That function will be used to label new data points.
For this approach to work correctly, the training data — along with the test data — must be carefully selected. The trained model should be able to predict the correct label for a new data point quickly and precisely, based on the data type(s) the model has seen in the training data.
Supervised analytics offer some distinct advantages:
The flip side of these advantages is an equally distinct set of potential disadvantages:
As you probably guessed, predictive analytics isn't a one-size-fits-all activity — nor are its results once-and-for-all. For the technique to work correctly, you have to apply it again and again over time — so you'll need an overall approach that fits your business well. The success of your predictive analytics project depends on multiple factors:
The approach you choose will influence the model's output, the process of analyzing its results, and the interpretation of its forecasts. And choosing an approach is no walk in the park. There are many things that can go wrong, many traps that you can fall into, and misleading paths you can take.
Happily, you can defend against these pitfalls by adopting a couple of wise practices early on:
Use multiple models to identify as many relevant patterns as possible in your data.
Any predictive analytic model has certain limitations based on the algorithms it employs and the dataset it runs on. You should be aware of those limitations and make them work to your advantage; those related to the algorithms include
To overcome the limitations of your model, use sound cross-validation techniques to test your models. Start by dividing your data into training and test datasets, and run the model against each of those datasets separately to evaluate and score the predictions of the model.
No model can produce 100-percent accurate forecasts; any model has the potential to produce inaccurate results. Be on the lookout for any significant variation between the forecasts your model produces and the observed data — especially if the model's outputs contradict common sense. If it looks too good, bad, or extreme to be true, then it probably isn't true (to reality, anyway).
In the evaluation process, thoroughly examine the outputs of the models you're testing and compare them to the input variables. Your model's forecast capability should answer all stated business goals that drove its creation in the first place.
If errors or biases crop up in your model's output, try tracing them back to
The preceding factors will directly influence the accuracy of your model. If your data isn't appropriate for the model you're building, then your model is bound to fail to answer its business needs. If the data is good but flawed, like absent representation of a type of data due to seasonality, or that the data isn't valid or reliable, your data will not account for all cases that the model should be aware of and address when similar situations arise in the future.
As part of building the model, the data scientists or business stakeholders may make some assumptions either about the data used to create the model, or the business environment in which the model will run. If those assumptions are wrong, the model will be less or inaccurate.
The decision to include a variable or exclude it from the analysis has a direct impact on the outcome of the model. Some variables may only be effective and their predictive powers may only come in play in the presence of other variables. The decision to whether include or exclude a variable from an analysis is at the core of building predictive analytics models. Business stakeholders, data scientists’ experience, tools, and quality data should be of tremendous help in building successful predictive analytics models.
The ability to easily interpret the model and for that interpretation to make sense to the business stakeholders is essential. That interpretation may be new, but you should be able to explain why this variable, or that variable, or a combination of variable will allow your business to make an accurate prediction.
When you're building a model, always keep scalability in mind. Always check the performance, accuracy, and reliability of the model at various scales. Your model should be able to change its scale — and scale up as big as necessary — without falling apart or outputting bad predictions.
Scalability is quite a challenge. Predictive models can take a long time to build and to run. In the past, the datasets the models ran on were small, and the data was expensive to collect, store, and search. But that was all in the “pre-big data” era.
Today big data is cheap, plentiful, and growing. In fact, another potential problem looms: The formidable data volume and data velocity, and the rate of the incoming data currently available and possible, may negatively affect the model and degrade its performance, outdating the model in a relatively short period of time. Properly implemented, scalability can help “future-proof” your model.
The future isn't the only threat. Even in the present online era, streamed data can overwhelm a model — especially if the streams of data increase to a flood.
If the model refreshes at a rate slower than the incoming data, then the model’s predictions are almost obsolete for the data; the universe it represents has already changed and the model needs to play catch up to address the new reality.
Data volume alone can cause the decision variables and predicting factors to grow to giant numbers that require continuous updating to the model. So yes, your model had better be scalable — rapidly scalable.
When analyzing the quality of a predictive model, you'll want to measure its accuracy. The more accurate a forecast the model makes, the more useful it is to the business, which is an indication of its quality. This is all good — except for when the predicted event is rare. In such case, the high accuracy of the predictive model may be meaningless.
For example if the probability of rare event to occur is 5 percent, a model that simply answers “no” all the time when asked whether the rare event has occurred would be right 95 percent of the time. But how useful would such a model be? Thus, if your business is interested in predicting and handling rare events, don't rely on accuracy alone as a measure of your model's reliability.
In such a case, you can evaluate the efficacy and the quality of a predictive model in the light of the how likely the rare event is to take place. A useful metric to follow is to specify which types of errors you can accept from the model and which you cannot.
Here's a quick list of other ways to evaluate your model:
Approach the analysis of your predictive model with care, starting with this quick checklist: