Chapter 9
IN THIS CHAPTER
Documenting your business objectives
Processing your data
Sampling your data
Transforming your data
Extracting features
Selecting features
The roadmap to building a successful predictive model involves defining business objectives, preparing the data, and then building and deploying the model. This chapter delves into data preparation, which involves
Data is a four-letter word. It's amazing that such a small word can describe trillions of gigabytes of information: customer names, addresses, products, discounted versus original prices, store codes, times of purchase, supplier locations, run rates for print advertising, the color of your delivery vans. And that's just for openers. Data is, or can be, literally everything.
Not every source or type of data will be relevant to the business question you're trying to answer. Predictive analytics models are built from multiple data sources, and one of the first critical steps is to determine which sources to include in your model. If you're trying to determine (for example) whether customers who subscribe to e-magazines in the spring are more likely to purchase hardcover print books in the fall, you may decide to omit the January paperback sales records. Then you have to vet the specific records and attributes of each possible source for format, quantity, and quality. Data may be a small word, but it brings on a lot of big tasks.
At this stage, presumably, you've already sat down with the business managers and collected the objectives they're after. Now you have to go into detail, evaluate what information sources will help achieve the objectives, and choose the variables you'll analyze for operational use.
Understanding what the stakeholders really want from the project is challenging; you may encounter multiple competing needs as well as limitations on what can be realistically done.
For this stage, you and the intended recipient of the results have to roll up your sleeves and brainstorm potential information sources. The goal is to determine what information, from which sources, will be relevant to reaching the type of concrete outcome that will provide true value for both the business and the customer. Without this in-the-trenches activity, your results may be no more than academic — of little practical value to your client. You may uncover fascinating inferences from (say) the accessory division's second-quarter sales records — and discover just how likely cross-dressers who wear flat shoes are to purchase faux leather purses — but this will fall on deaf ears if the accessories division is discontinuing its faux leather product line next quarter.
A business objective can be quantifiable and objective — for example, “identify two currently unknown major customer groups with a greater-than-50-percent likelihood of churning in the next six months” or “identify three supplier groups in Asia with decreasing delivery timeframes over the next five years.” You might also list more subjective goals such as “provide valuable insights into the effectiveness of customer rewards incentives programs.”
Typically, there will be many subsets of business questions that the customer would like to address — any and all of which can provide insights into the main question. For example, the primary business goal might be to identify unhappy customers before they churn (move to a competing product). Related business questions could be: “How many times does a customer abandon a shopping cart online before purchasing from another online retailer?” “Does decreasing the threshold for free shipping from $100 to $75 prevent churn?” Table 9-1 shows some handy examples of primary and secondary business questions.
TABLE 9-1 Primary and Secondary Business Questions
Primary |
Secondary |
How do we increase print book sales? |
Which percentage of people who bought or downloaded a fiction e-book in FY12 then bought a print fiction paperback in FY13? |
How do we predict the effect of health-based decisions on fitness-related products more accurately? |
If customers are buying fewer French fries this year, will they buy more or fewer yoga mats next year? |
How will a new tablet affect sales of existing digital products? |
Are iPad users less likely to purchase laptops? |
Okay, suppose the high-level objectives have been documented and now you're moving into the details. What are the project requirements and timeframes you'll need to fulfill and follow? What are the requirements for your business, project, system, models, and data?
In order to avoid mismatched expectations, project managers should meet with all relevant groups in the client department. For marketing, this could include social media marketing managers, marketing analytics specialists, or database marketing managers. Information sources to specify can include (for example) customer lists, budgets, schedules and other logistics.
Thorough documentation — and key management sign-off — are critical to ensure that everyone is embarking on the coming intensive effort with the same understanding, commitment, and expectations.
Don't be surprised if preparing your data to be fed into the predictive model is as tedious a task as it is crucial. Understanding data quality, availability, sources, and any existing constraints will have a direct effect on the successful implementation of your predictive analytics project.
The raw data usually has to be cleaned — and possibly integrated, combined with other datasets, and used to derive new data variables. Hence data quality and quantity should be carefully and thoroughly examined across all data sources used to build the model.
In this exploration phase, you'll gain intimate knowledge of your data — which in turn will help you choose the relevant variables to analyze. This understanding will also help you evaluate the results of your model.
For your analytics project, you'll need to identify appropriate sources of data, pool data from those sources, and put it in a structured, well-organized format. These tasks can be very challenging and will likely require careful coordination among different data stewards across your organization.
You'll also need to select the variables you're going to analyze. This process must take data constraints, project constraints, and business objectives into consideration.
Expect to spend considerable time on this phase of the project. Data collection, data analysis, and the process of addressing data content, quality, and structure can add up to a time-consuming to-do list.
During the process of data identification, it helps to understand your data and its properties; this knowledge will help you choose which algorithm to use to build your model. For example, time series data can be analyzed by regression algorithms; classification algorithms can be used to analyze discrete data.
Variable selection is affected by how well you understand the data. Don't be surprised if you have to look at and evaluate hundreds of variables, at least at first. Fortunately, as you work with those variables and start gaining key insights, you start narrowing them down to a few dozen. Also, expect variable selection to change as your understanding of the data changes throughout the project.
You'll need to make sure that the data is clean of extraneous stuff before you can use it in your model. This includes finding and correcting any records that contain erroneous values, and attempting to fill in any missing values. You'll also need to decide whether to include duplicate records (two customer accounts, for example). The overall goal is to ensure the integrity of the information you're using to build your predictive model. Pay particular attention to the completeness, correctness, and timeliness of the data.
Missing data could be due to the fact that particular information was not recorded. In such a case, you can attempt to fill in as much as you can; suitable defaults can easily be added to fill the blanks of certain fields. For example, for patients in a hospital maternity ward where the gender field is missing a value, the application can simply fill it in as female. For that matter, for any male who was admitted to a hospital with a missing record for the pregnancy status, that record can similarly be filled in as not applicable. A missing zip code for an address can be inferred from the street name and the city provided in that address.
In the cases where the information is unknown or can't be inferred, then you would need to use values other than a blank space to indicate that the data is missing without affecting the correctness of the analysis. A blank in the data can mean multiple things, most of them not good or useful. Whenever you can, you should specify the nature of that blank by meaningful place filler. For numerical data made entirely of small and positive numbers (values between 0 and 100), the user, for example, can define the number -999.99 as place filler for missing data.
Just as it's possible to define a rose in a cornfield as a weed, outliers can mean different things to different analyses. It's common for some models to be built solely to track those outliers and flag them. Fraud-detection models and criminal activities monitoring are interested in those outliers, which in such cases indicate something unwanted taking place. So keeping the outliers in the dataset in cases like these is recommended. However, when outliers are considered anomalies within the data — and will only skew the analyses and lead to erroneous results — remove them from your data. What we don’t want to happen is that our model will try to predict the outliers, and end up failing to predict anything else.
Duplication in the data can also be useful or a nuisance; some of it can be necessary, can indicate value, and can reflect an accurate state of the data. For example, a record of a customer with multiple accounts can be represented with multiple entries that are (technically, anyway) duplicate and repetitive of the same records. Another example would be a customer who has both a work phone and a personal phone with the same company and with the bill going to the same address – something that would be valuable to know. By the same token, when the duplicate records don't contribute value to the analysis and aren't necessary, then removing them can be of tremendous value. This is especially true for large datasets where removing duplicate records can simplify the complexity of the data and reduce the time needed for analysis.
Derived attributes are entirely new records constructed from one or more existing attributes. An example would be the creation of records identifying books that are bestsellers at book fairs. Raw data may not capture such records — but for modeling purposes, those derived records can be important. Price-per-earnings ratio and 200-day moving average are two examples of derived data that are heavily used in financial applications.
Derived attributes can be obtained from simple calculation such as deducing age from birth date. Derived attributes can also be computed by summarizing information from multiple records. For example, converting a table of customers and their purchased books into a table can enable you to track the number of books sold via a recommender system, through targeted marketing, and at a book fair — and identify the demographic of customers who bought those books.
Generating such additional attributes brings additional predictive power to the analysis. In fact, many such attributes are created so as to probe their potential predictive power. Some predictive models may use more derived attributes than the attributes in their raw state. If some derived attributes prove especially predictive and their power is proven to be relevant, then it makes sense to automate the process that generates them.
The data used in predictive models is usually pooled from multiple sources. Your analysis can draw from data scattered across multiple data formats, files, and databases, or multiple tables within the same database. Pooling the data together and combining it into an integrated format for the data modelers to use is essential.
If your data contains any hierarchical content, it may need to be flattened. Some data has some hierarchical characteristics such as parent-child relationships, or a record that is made up of other records. For example, a product such as a car may have multiple makers; flattening data, in this case, means including each maker as an additional feature of the record you're analyzing. Another example is that a single customer can have multiple transactions.
Flattening data is essential when it merged from multiple related records to form a better picture. For example, analyzing adverse events for several drugs made by several companies may require that the data be flattened at the substance level. By doing so, you end up removing the one-to-many relationships (in this case, many makers and many substances for one product) that can cause too much duplication of data by repeating multiple substance entries that repeat product and maker information at each entry.
Flattening forces you to think about reducing the dimensionality of the data, which is represented by the number of features a record or an observation has. For example, a customer can have the following features: name, age, address, items purchased. When you start your analysis, you may find yourself evaluating records with many features, only some of which are important to the analysis. So you should eliminate all but the very few features that have the most predictive power for your specific project.
Reducing the dimensionality of the data can be achieved by putting all the data in a single table that uses multiple columns to represent attributes of interest. At the beginning of the analysis, of course, the analysis has to evaluate a large number of columns — but that number can be narrowed down as the analysis progresses. This process can be aided by reconstituting the fields — for example, by grouping the data in categories that have similar characteristics.
The resultant dataset — the cleaned dataset — is usually put in a separate database for the analysts to use. During the modeling process, this data should be easily accessed, managed, and kept up to date.
Principal component analysis (PCA) is a valuable technique that is widely used in data science. It studies a dataset to learn the most relevant variables responsible for the highest variation in that dataset. PCA is mostly used as a data reduction technique.
While building predictive models, you may need to reduce the number of features describing your dataset. It's very useful to reduce this high dimensionality of data through approximation techniques, at which PCA excels. The approximated data summarizes all the important variations of the original data.
For example, the feature set of data about stocks may include stock prices, daily highs and lows, trading volumes, 200-day moving averages, price-to-earning ratios, relative strength to other markets, interest rates, and strength of currencies.
Finding the most important predictive variables is at the core of building a predictive model. The way many have been doing it is by using a brute force approach. The idea is to start with as many relevant variables as you can, and then use a funnel approach to eliminating features that have no impact, or no predictive value. The intelligence and insight is brought to this method by engaging business stakeholders, because they have some hunches about which variables will have the biggest impact in the analysis. The experience of the data scientists engaged in the project is also important in knowing what variables to work with and what algorithms to use for a specific data-type or a domain-specific problem.
To help with the process, data scientists employ many predictive analytics tools that make it easier and faster to run multiple permutations and analyses on a dataset in order to measure the impact of each variable on that dataset.
Knowing that there is a large amount of data to work with, you can employ PCA for help.
Certainly, you could find correlation between weather data in a given country and the performance of its stock market. Or with the color of a person’s shoes and the route she or he takes to the office, and the performance of their portfolio for that day. However, including those variables in a predictive model is more than just overfitting, it's misleading and leads to false predictions.
PCA uses a mathematically valid approach to determine the subset of your dataset that includes the most important features; in building your model on that smaller dataset, you will have a model that has predictive value for the overall, bigger dataset you're working with. In short, PCA should help you make sense of your variables by identifying the subset of variables responsible for the most variation with your original dataset. It helps you spot redundancy. It helps you find out that two (or more variables) are telling you the same thing.
Moreover, principal components analysis takes your multidimensional dataset and produces a new dataset whose variables are representative of the linearity of the variables in the original dataset. In addition, the outputted dataset has individually un-correlated variables, and their variance is ordered by their principal components where the first one is the largest, and so on. In this regard, PCA can also be considered as a technique for constructing features.
To preserve the performance of the model, you may need to carefully evaluate the effectiveness of each variable, measuring its usefulness in the shaping of the final model.
Knowing that the PCA can be especially useful when the variables are highly correlated within a given dataset, then having a dataset with non-correlated predictive variables can only complicate the task of reducing the dimensionality of multivariate data. Many other techniques can be used here in addition to the PCA, such as forward feature selection and backward feature elimination (covered in this chapter).
Singular value decomposition (SVD) represents a dataset by eliminating the less important parts and generating an accurate approximation of the original dataset. In this regard, SVD and PCA are methods of data reduction.
SVD will take a matrix as an input and decompose it into a product of three simpler matrices.
An m by n matrix M can be represented as a product of three other matrices as follows:
M = U * S * V T
Where U is an m by r matrix, V is an n by r matrix, and S is an r by r matrix; where r is the rank of the matrix M. The * represents matrix multiplication. T indicates matrix transposition.
In a data matrix where fewer concepts can describe the data, or can relate the data matrix’s columns to its rows, then SVD is a very useful tool to extract those concepts. For example, a dataset might contains books’ ratings, where the reviews are the rows and books the columns. The books can be grouped by type or domain, such as literature and fiction, history, biographies, children’s or teen books. Those will be the concepts that SVD can help extract.
These concepts must be meaningful and conclusive. If we stick to only a few concepts or dimensions to describe a larger dataset, our approximation will not be as accurate. This is primarily why it's important to only eliminate concepts that are less important and not relevant to the overall dataset.
Latent semantic indexing is a data mining and natural language processing technique that is used in document retrieval and word similarity. Latent semantic indexing employs SVD to group documents to the concepts that could consist of different words found in those documents. The universe of words can be very large, and various words can be grouped into a concept. SVD helps reduce the noisy correlation between those words and their documents, and it gives us a representation of that universe using far fewer dimensions than the original dataset.
SVD and PCA have been used in classification and clustering (see Chapter 6 and Chapter 7). Generating those concepts is just a form of classification and grouping the data. Both have also been used for collaborative filtering (see Chapter 2).
When you have a dataset, selecting the most relevant features is what makes or breaks the model. The more predictive your features are, the more successful your model will be.
There are algorithms and tools that will help you with feature selection and feature extraction, and you may even need to look at ranking your features based on their importance. Doing so by relying solely on brute force is always an option; some scientists adopt a funnel approach and go through the set of features they have, one by one, and select the most relevant ones. However, this will be time consuming, may risk missing iterations, and will be more complex if features have high dependency among each other.
Very often, you won’t be sure which feature to include and which to disregard. If you follow a trial and error approach, adding a feature or removing it (one at a time), you can witness that such addition or subtraction will have a major impact on the model you're building. The result will vary substantially by whether you include this feature or the other, and this approach becomes even more complex if one feature is only relevant in the presence of another one. It's a challenge to undergo such an approach if the features or variables are highly correlated. A feature can have a big impact on the analysis when grouped with another and yet that same feature has no effect by itself. The case where the effect of a feature is only manifested when combined with other features, and is absent in their absence or when not grouped with them.
Let’s say you're building a decision tree as your model. That tree can grow larger or shrink, according to the features you include or exclude. Furthermore, often you won’t be sure which model is better, especially if your dataset is small and you don’t have enough data to test or make an informed decision about the outcome. In addition to the importance of spending the time necessary to get this part of the process right, it's here where experience and tools will make a difference. It's also at this step where we refer to this predictive analytics as a business and a discipline that is partly art and partly science.
Expect the data to be overwhelming. Only a handful of projects won’t have enough data to build accurate models. Most projects will suffer from the abundance of the data. Nowadays, we are experiencing exponential growth in the data. This abundance applies equally to the sample size of the data as it does to its dimensionality. As such, the data may include a lot of noise. Differentiating the signal from the noise is at the core of what data scientists do.
In some applications, such as bioinformatics or document classification, it's common for a dataset to have thousands of features. Not all features are important for all problems. Feature selection and extraction are two methods that can help reduce the dimensionality of the data set and identify relevant features to work with.
Both feature extraction and feature selection will improve the predictive power of your model and speed up its performance.
Feature selection is the process of selecting a subset of features from the original features. The subset is selected without undergoing any transformation and while keeping the properties of the original features intact. For example, a scientist examining multiple proteins and their effects on a disease is looking to identify which proteins are most relevant in the analysis. For a loan application, your credit score is probably the most important deciding factor.
In a classification problem (see Chapter 7) where the training data is already labeled and the classes are known (like spam and non-spam emails), selecting the most important features determining whether an email is spam can be iterative. As long as the features you're selecting still produce the correct class, you're heading in the right direction.
Feature selection for classification aims at selecting a subset of the original feature without impacting the accuracy of the classifier. The subset of features should still be a good predictor of the classification, if all available features are included.
Feature selection is very complex and the degree of the difficulty will vary with the dimensionality of the data, the correlation level among features, whether they are highly dependent or independent, and the structure of the data.
There are two widely used methods for feature selection:
There are three feature selection approaches:
Feature extraction transforms your original features and creates a small subset of new ones, resulting in a much lower dimensionality. As shown in the preceding sections, dimensionality reduction can assist you in getting rid of redundant features, and noise in your data. Feature extraction projects and maps your features to a new set of features that is much smaller than the original.
We have discussed the idea of creating concepts when analyzing books and generating meaningful groups, such as fiction and literature, history, or biographies. Then we can use those new concepts to analyze our data. This transformation from individual books to concepts, or logical grouping, is a kind of transformation that results in dimensionality reduction. However, these new generated features that are produced by feature extraction still need further analysis to fully make sense of the data, and eventually build your predictive model.
Another example of feature extraction, often used in text analytics, is the capability to transform text to a numerical representation. Like the ability to generate either
Term frequency inverse document frequency (TFIDF)
TFIDF often is used to adjust for the fact that some words are more frequently used than others; the term frequency, or word count, is offset with the frequency of that word across documents.
Feature selection keeps the original set of features; it only lowers the number of the original features. Feature selection aims at eliminating redundancy and maximizing relevance.
When working with features, one thinks of ranking them. Don’t you want to know which feature is the most important in your dataset? Which feature or set of features is the absolute indicator of a given class or a label? In the case of the biologist in her lab, it makes perfect sense to zero in on that one gene or a subset of genes responsible for a biological condition. The model then can easily watch for the existence of that gene or its expression to predict the expected behavior.
Ranking methods help select features and reduce dimensionality of the dataset. To rank features, you can choose from the following ranking methods:
These algorithms can be divided into two broad categories:
In a decision-tree model, the importance of an attribute is measured using an entropy-based approach. The information gains in determining a given classification are used to select the features for the model. With such a process, a decision tree focuses on relevant features that lead to a given decision.
The forward selection algorithm uses a better approach to select features. At each iteration, the algorithm searches for the best features that provide high accuracy when joined together. The feature selection algorithms, such as forward selection or backward selection, are widely used, despite their relatively high computational time to complete.
Raw data is a potential resource, but it can't be usefully analyzed until it's been given a consistent structure. Data residing in multiple systems has to be collected and transformed to get it ready for analysis. The collected data should reside in a separate system so it won't interfere with the live production system. While building your model, split your dataset into a training dataset to train the model, and a test dataset to validate the model.
After it's initially collected, data is usually in a dispersed state; it resides in multiple systems or databases. Before you can use it for a predictive analytics model, you have to consolidate it into one place. Also, you don't want to work on data that resides in operational systems — that's asking for trouble. Instead, place a portion of it somewhere where you can work on it freely without affecting operations. ETL (extract, transform and load) is the process that achieves that desirable state.
Many organizations have multiple databases; your predictive model will likely utilize data from all of them. ETL is the process that collects all the information needed and places it in a separate environment where you can run your analysis. ETL is not, however, a once-and-for-all operation; usually it's an ongoing process that refreshes the data and keeps it up to date. Be sure you run your ETL processes at night or at other times when the load on the operational system is low.
You should follow a systematic approach to build your ETL processes to fulfill the business requirements. It's a good practice to keep a copy of the original data in a separate area so you can always go back to it in case an error disrupts the transformation or the loading steps of the processes. The copy of the original data serves as a backup that you can use to rebuild the entire dataset employed by your analysis if necessary. The goal is to head off Murphy's Law and get back on your feet quickly if you have to rerun the entire ETL process from scratch.
Your ETL process should incorporate modularity — separating the tasks and accomplishing the work in stages. This approach has advantages in case you want to reprocess or reload the data, or if you want to use some of that data for a different analysis or to build different predictive models. The design of your ETL should be able to accommodate even major business requirement changes — with only minimal changes to your ETL process.
After the loading step of ETL, after you get your data into that separate database, data mart, or warehouse, you'll need to keep the data fresh so the modelers can rerun previously built models on new data.
Implementing a data mart for the data you want to analyze and keeping it up to date will enable you to refresh the models. You should, for that matter, refresh the operational models regularly after they're deployed; new data can increase the predictive power of your models. New data can allow the model to depict new insights, trends, and relationships.
Having a separate environment for the data also allows you to achieve better performance for the systems used to run the models. That's because you aren't overloading operational systems with the intensive queries or analysis required for the models to run.
Data keeps on coming — more of it, faster, and in greater variety all the time. Implementing automation and the separation of tasks and environments can help you manage that flood of data and support the real-time response of your predictive models.
When your data is ready and you're about to start building your predictive model, it's useful to outline your testing methodology and draft a test plan. Testing should be driven by the business goals you've gathered, documented, and collected all necessary data to help you achieve.
Right off the bat, you should devise a method to test whether a business goal has been attained successfully. Because predictive analytics measure the likelihood of a future outcome — and the only way to be ready to run such a test is by training your model on past data, you still have to see what it can do when it's up against future data. Of course, you can't risk running an untried model on real future data, so you'll need to use existing data to simulate future data realistically. To do so, you have to split the data you're working on into training and test datasets.
When you split your data into test and training datasets, you're effectively avoiding any overfitting issues that could arise from overtraining the model on the entire dataset and picking up all the noise patterns or specific features that only belong to the sample dataset and aren't applicable to other datasets. (See Chapter 15 for more on the pitfalls of overfitting.)