Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 9

Preparing Data

IN THIS CHAPTER

Documenting your business objectives

Processing your data

Sampling your data

Transforming your data

Extracting features

Selecting features

The roadmap to building a successful predictive model involves defining business objectives, preparing the data, and then building and deploying the model. This chapter delves into data preparation, which involves

Acquiring the data
Exploring the data
Cleaning the data
Selecting variables of interest
Generating derived variables
Extracting, loading, and transforming the data
Sampling the data into training and test datasets

Data is a four-letter word. It's amazing that such a small word can describe trillions of gigabytes of information: customer names, addresses, products, discounted versus original prices, store codes, times of purchase, supplier locations, run rates for print advertising, the color of your delivery vans. And that's just for openers. Data is, or can be, literally everything.

Not every source or type of data will be relevant to the business question you're trying to answer. Predictive analytics models are built from multiple data sources, and one of the first critical steps is to determine which sources to include in your model. If you're trying to determine (for example) whether customers who subscribe to e-magazines in the spring are more likely to purchase hardcover print books in the fall, you may decide to omit the January paperback sales records. Then you have to vet the specific records and attributes of each possible source for format, quantity, and quality. Data may be a small word, but it brings on a lot of big tasks.

Listing the Business Objectives

At this stage, presumably, you've already sat down with the business managers and collected the objectives they're after. Now you have to go into detail, evaluate what information sources will help achieve the objectives, and choose the variables you'll analyze for operational use.

Understanding what the stakeholders really want from the project is challenging; you may encounter multiple competing needs as well as limitations on what can be realistically done.

For this stage, you and the intended recipient of the results have to roll up your sleeves and brainstorm potential information sources. The goal is to determine what information, from which sources, will be relevant to reaching the type of concrete outcome that will provide true value for both the business and the customer. Without this in-the-trenches activity, your results may be no more than academic — of little practical value to your client. You may uncover fascinating inferences from (say) the accessory division's second-quarter sales records — and discover just how likely cross-dressers who wear flat shoes are to purchase faux leather purses — but this will fall on deaf ears if the accessories division is discontinuing its faux leather product line next quarter.

A business objective can be quantifiable and objective — for example, “identify two currently unknown major customer groups with a greater-than-50-percent likelihood of churning in the next six months” or “identify three supplier groups in Asia with decreasing delivery timeframes over the next five years.” You might also list more subjective goals such as “provide valuable insights into the effectiveness of customer rewards incentives programs.”

In the subjective cases, be sure to define what you mean by “valuable.”

Identifying related objectives

Typically, there will be many subsets of business questions that the customer would like to address — any and all of which can provide insights into the main question. For example, the primary business goal might be to identify unhappy customers before they churn (move to a competing product). Related business questions could be: “How many times does a customer abandon a shopping cart online before purchasing from another online retailer?” “Does decreasing the threshold for free shipping from $100 to $75 prevent churn?” Table 9-1 shows some handy examples of primary and secondary business questions.

TABLE 9-1 Primary and Secondary Business Questions

Primary	Secondary
How do we increase print book sales?	Which percentage of people who bought or downloaded a fiction e-book in FY12 then bought a print fiction paperback in FY13?
How do we predict the effect of health-based decisions on fitness-related products more accurately?	If customers are buying fewer French fries this year, will they buy more or fewer yoga mats next year?
How will a new tablet affect sales of existing digital products?	Are iPad users less likely to purchase laptops?

Gathering user requirements

Okay, suppose the high-level objectives have been documented and now you're moving into the details. What are the project requirements and timeframes you'll need to fulfill and follow? What are the requirements for your business, project, system, models, and data?

In order to avoid mismatched expectations, project managers should meet with all relevant groups in the client department. For marketing, this could include social media marketing managers, marketing analytics specialists, or database marketing managers. Information sources to specify can include (for example) customer lists, budgets, schedules and other logistics.

Thorough documentation — and key management sign-off — are critical to ensure that everyone is embarking on the coming intensive effort with the same understanding, commitment, and expectations.

Processing Your Data

Don't be surprised if preparing your data to be fed into the predictive model is as tedious a task as it is crucial. Understanding data quality, availability, sources, and any existing constraints will have a direct effect on the successful implementation of your predictive analytics project.

The raw data usually has to be cleaned — and possibly integrated, combined with other datasets, and used to derive new data variables. Hence data quality and quantity should be carefully and thoroughly examined across all data sources used to build the model.

In this exploration phase, you'll gain intimate knowledge of your data — which in turn will help you choose the relevant variables to analyze. This understanding will also help you evaluate the results of your model.

Identifying the data

For your analytics project, you'll need to identify appropriate sources of data, pool data from those sources, and put it in a structured, well-organized format. These tasks can be very challenging and will likely require careful coordination among different data stewards across your organization.

You'll also need to select the variables you're going to analyze. This process must take data constraints, project constraints, and business objectives into consideration.

The variables you select must have predictive power. Also, you need to consider variables that are both valuable and feasible for your project within the budget and timeframes. For example, if you're analyzing bank transactions in a criminal investigation, phone records for all parties involved may be relevant to the analysis but not accessible to the analysts.

Expect to spend considerable time on this phase of the project. Data collection, data analysis, and the process of addressing data content, quality, and structure can add up to a time-consuming to-do list.

During the process of data identification, it helps to understand your data and its properties; this knowledge will help you choose which algorithm to use to build your model. For example, time series data can be analyzed by regression algorithms; classification algorithms can be used to analyze discrete data.

Variable selection is affected by how well you understand the data. Don't be surprised if you have to look at and evaluate hundreds of variables, at least at first. Fortunately, as you work with those variables and start gaining key insights, you start narrowing them down to a few dozen. Also, expect variable selection to change as your understanding of the data changes throughout the project.

You may find it beneficial to build a data inventory that you can use to track what you know, what you don't know, and what might be missing. The data inventory should include a listing of the various data elements and any attributes that are relevant in the subsequent steps of the process. For example, you may want to document whether any segments are missing zip codes or missing records for a specific period of time.

Your go-to people for business knowledge (also known as domain knowledge experts) will help you select the key variables that can positively influence the results of your project. They can help explain to you the importance of these variables, as well as where and how to get them, among other valuable input.

Cleaning the data

You'll need to make sure that the data is clean of extraneous stuff before you can use it in your model. This includes finding and correcting any records that contain erroneous values, and attempting to fill in any missing values. You'll also need to decide whether to include duplicate records (two customer accounts, for example). The overall goal is to ensure the integrity of the information you're using to build your predictive model. Pay particular attention to the completeness, correctness, and timeliness of the data.

It's useful to create descriptive statistics (quantitative characteristics) for various fields, such as calculating min and max, checking frequency distribution (how often something occurs) and verifying the expected ranges. Running a regular check can help you flag any data that is outside the expected range for further investigation. Any records showing retirees with birth dates in the 1990s can be flagged by this method. Also, cross-checking the information is important so that you make sure the data is accurate. For deeper analysis of the data characteristics and the identification of the relationship between data records, you can make use of data profiling (analyzing data availability and gathering statistics on the data quality), and visualization tools.

Missing data could be due to the fact that particular information was not recorded. In such a case, you can attempt to fill in as much as you can; suitable defaults can easily be added to fill the blanks of certain fields. For example, for patients in a hospital maternity ward where the gender field is missing a value, the application can simply fill it in as female. For that matter, for any male who was admitted to a hospital with a missing record for the pregnancy status, that record can similarly be filled in as not applicable. A missing zip code for an address can be inferred from the street name and the city provided in that address.

In the cases where the information is unknown or can't be inferred, then you would need to use values other than a blank space to indicate that the data is missing without affecting the correctness of the analysis. A blank in the data can mean multiple things, most of them not good or useful. Whenever you can, you should specify the nature of that blank by meaningful place filler. For numerical data made entirely of small and positive numbers (values between 0 and 100), the user, for example, can define the number -999.99 as place filler for missing data.

Just as it's possible to define a rose in a cornfield as a weed, outliers can mean different things to different analyses. It's common for some models to be built solely to track those outliers and flag them. Fraud-detection models and criminal activities monitoring are interested in those outliers, which in such cases indicate something unwanted taking place. So keeping the outliers in the dataset in cases like these is recommended. However, when outliers are considered anomalies within the data — and will only skew the analyses and lead to erroneous results — remove them from your data. What we don’t want to happen is that our model will try to predict the outliers, and end up failing to predict anything else.

Duplication in the data can also be useful or a nuisance; some of it can be necessary, can indicate value, and can reflect an accurate state of the data. For example, a record of a customer with multiple accounts can be represented with multiple entries that are (technically, anyway) duplicate and repetitive of the same records. Another example would be a customer who has both a work phone and a personal phone with the same company and with the bill going to the same address – something that would be valuable to know. By the same token, when the duplicate records don't contribute value to the analysis and aren't necessary, then removing them can be of tremendous value. This is especially true for large datasets where removing duplicate records can simplify the complexity of the data and reduce the time needed for analysis.

You can pre-emptively prevent incorrect data from entering your systems by adopting some specific procedures:

Institute quality checks and data validation for all data being collected.
Allow your customers to validate and self-correct their personal data.
Provide your clients with possible and expected values to choose from.
Routinely run checks on the integrity, consistency, and accuracy of the data.

Generating any derived data

Derived attributes are entirely new records constructed from one or more existing attributes. An example would be the creation of records identifying books that are bestsellers at book fairs. Raw data may not capture such records — but for modeling purposes, those derived records can be important. Price-per-earnings ratio and 200-day moving average are two examples of derived data that are heavily used in financial applications.

Derived attributes can be obtained from simple calculation such as deducing age from birth date. Derived attributes can also be computed by summarizing information from multiple records. For example, converting a table of customers and their purchased books into a table can enable you to track the number of books sold via a recommender system, through targeted marketing, and at a book fair — and identify the demographic of customers who bought those books.

Generating such additional attributes brings additional predictive power to the analysis. In fact, many such attributes are created so as to probe their potential predictive power. Some predictive models may use more derived attributes than the attributes in their raw state. If some derived attributes prove especially predictive and their power is proven to be relevant, then it makes sense to automate the process that generates them.

Derived records are new records that bring in new information and provide new ways of presenting raw data; they can be of tremendous value to predictive modeling. They are often considered the most important contribution that a modeler can make to the process.

Reducing the dimensionality of your data

The data used in predictive models is usually pooled from multiple sources. Your analysis can draw from data scattered across multiple data formats, files, and databases, or multiple tables within the same database. Pooling the data together and combining it into an integrated format for the data modelers to use is essential.

If your data contains any hierarchical content, it may need to be flattened. Some data has some hierarchical characteristics such as parent-child relationships, or a record that is made up of other records. For example, a product such as a car may have multiple makers; flattening data, in this case, means including each maker as an additional feature of the record you're analyzing. Another example is that a single customer can have multiple transactions.

Flattening data is essential when it merged from multiple related records to form a better picture. For example, analyzing adverse events for several drugs made by several companies may require that the data be flattened at the substance level. By doing so, you end up removing the one-to-many relationships (in this case, many makers and many substances for one product) that can cause too much duplication of data by repeating multiple substance entries that repeat product and maker information at each entry.

Flattening forces you to think about reducing the dimensionality of the data, which is represented by the number of features a record or an observation has. For example, a customer can have the following features: name, age, address, items purchased. When you start your analysis, you may find yourself evaluating records with many features, only some of which are important to the analysis. So you should eliminate all but the very few features that have the most predictive power for your specific project.

Reducing the dimensionality of the data can be achieved by putting all the data in a single table that uses multiple columns to represent attributes of interest. At the beginning of the analysis, of course, the analysis has to evaluate a large number of columns — but that number can be narrowed down as the analysis progresses. This process can be aided by reconstituting the fields — for example, by grouping the data in categories that have similar characteristics.

The resultant dataset — the cleaned dataset — is usually put in a separate database for the analysts to use. During the modeling process, this data should be easily accessed, managed, and kept up to date.

Applying principal component analysis

Principal component analysis (PCA) is a valuable technique that is widely used in data science. It studies a dataset to learn the most relevant variables responsible for the highest variation in that dataset. PCA is mostly used as a data reduction technique.

While building predictive models, you may need to reduce the number of features describing your dataset. It's very useful to reduce this high dimensionality of data through approximation techniques, at which PCA excels. The approximated data summarizes all the important variations of the original data.

For example, the feature set of data about stocks may include stock prices, daily highs and lows, trading volumes, 200-day moving averages, price-to-earning ratios, relative strength to other markets, interest rates, and strength of currencies.

Finding the most important predictive variables is at the core of building a predictive model. The way many have been doing it is by using a brute force approach. The idea is to start with as many relevant variables as you can, and then use a funnel approach to eliminating features that have no impact, or no predictive value. The intelligence and insight is brought to this method by engaging business stakeholders, because they have some hunches about which variables will have the biggest impact in the analysis. The experience of the data scientists engaged in the project is also important in knowing what variables to work with and what algorithms to use for a specific data-type or a domain-specific problem.

To help with the process, data scientists employ many predictive analytics tools that make it easier and faster to run multiple permutations and analyses on a dataset in order to measure the impact of each variable on that dataset.

Knowing that there is a large amount of data to work with, you can employ PCA for help.

Reducing the number of variables you look at is reason enough to employ PCA. In addition, by using PCA you're automatically protecting yourself from overfitting (see Chapter 15) the model.

Certainly, you could find correlation between weather data in a given country and the performance of its stock market. Or with the color of a person’s shoes and the route she or he takes to the office, and the performance of their portfolio for that day. However, including those variables in a predictive model is more than just overfitting, it's misleading and leads to false predictions.

PCA uses a mathematically valid approach to determine the subset of your dataset that includes the most important features; in building your model on that smaller dataset, you will have a model that has predictive value for the overall, bigger dataset you're working with. In short, PCA should help you make sense of your variables by identifying the subset of variables responsible for the most variation with your original dataset. It helps you spot redundancy. It helps you find out that two (or more variables) are telling you the same thing.

Moreover, principal components analysis takes your multidimensional dataset and produces a new dataset whose variables are representative of the linearity of the variables in the original dataset. In addition, the outputted dataset has individually un-correlated variables, and their variance is ordered by their principal components where the first one is the largest, and so on. In this regard, PCA can also be considered as a technique for constructing features.

While employing PCA or other similar techniques that help reduce the dimensionality of the dataset you're dealing with, you have to always exercise caution to not affect the performance of the model negatively. Reducing the size of the data should not come at the expense of negatively impacting the performance (the accuracy of the predictive model). Tread safely and manage your dataset with care.

The increased complexity of a model doesn't translate to higher quality in the outcome.

To preserve the performance of the model, you may need to carefully evaluate the effectiveness of each variable, measuring its usefulness in the shaping of the final model.

Knowing that the PCA can be especially useful when the variables are highly correlated within a given dataset, then having a dataset with non-correlated predictive variables can only complicate the task of reducing the dimensionality of multivariate data. Many other techniques can be used here in addition to the PCA, such as forward feature selection and backward feature elimination (covered in this chapter).

PCA is not a magic bullet that will solve all issues with multi-dimensional data. Its success is highly dependent on the data you're working with. The statistical variance may not align to variables with the most predictive values, even though it is safe to work with such approximations.

Leveraging singular value decomposition

Singular value decomposition (SVD) represents a dataset by eliminating the less important parts and generating an accurate approximation of the original dataset. In this regard, SVD and PCA are methods of data reduction.

SVD will take a matrix as an input and decompose it into a product of three simpler matrices.

An m by n matrix M can be represented as a product of three other matrices as follows:

M = U * S * V ^T

Where U is an m by r matrix, V is an n by r matrix, and S is an r by r matrix; where r is the rank of the matrix M. The * represents matrix multiplication. ^T indicates matrix transposition.

In a data matrix where fewer concepts can describe the data, or can relate the data matrix’s columns to its rows, then SVD is a very useful tool to extract those concepts. For example, a dataset might contains books’ ratings, where the reviews are the rows and books the columns. The books can be grouped by type or domain, such as literature and fiction, history, biographies, children’s or teen books. Those will be the concepts that SVD can help extract.

These concepts must be meaningful and conclusive. If we stick to only a few concepts or dimensions to describe a larger dataset, our approximation will not be as accurate. This is primarily why it's important to only eliminate concepts that are less important and not relevant to the overall dataset.

Latent semantic indexing is a data mining and natural language processing technique that is used in document retrieval and word similarity. Latent semantic indexing employs SVD to group documents to the concepts that could consist of different words found in those documents. The universe of words can be very large, and various words can be grouped into a concept. SVD helps reduce the noisy correlation between those words and their documents, and it gives us a representation of that universe using far fewer dimensions than the original dataset.

It is easier to see that documents discussing similar topics can use different words to describe those same topics. A document describing lions in Zimbabwe and another document describing elephants in Kenya should be grouped together. So we rely on concepts (wildlife in Africa, in this case), not words, to group these documents. The relation between documents and their words is established with those concepts or topics.

SVD and PCA have been used in classification and clustering (see Chapter 6 and Chapter 7). Generating those concepts is just a form of classification and grouping the data. Both have also been used for collaborative filtering (see Chapter 2).

Working with Features

When you have a dataset, selecting the most relevant features is what makes or breaks the model. The more predictive your features are, the more successful your model will be.

In their quests to build predictive models, data scientists spend most of their time preparing the data and selecting the relevant features.

There are algorithms and tools that will help you with feature selection and feature extraction, and you may even need to look at ranking your features based on their importance. Doing so by relying solely on brute force is always an option; some scientists adopt a funnel approach and go through the set of features they have, one by one, and select the most relevant ones. However, this will be time consuming, may risk missing iterations, and will be more complex if features have high dependency among each other.

Very often, you won’t be sure which feature to include and which to disregard. If you follow a trial and error approach, adding a feature or removing it (one at a time), you can witness that such addition or subtraction will have a major impact on the model you're building. The result will vary substantially by whether you include this feature or the other, and this approach becomes even more complex if one feature is only relevant in the presence of another one. It's a challenge to undergo such an approach if the features or variables are highly correlated. A feature can have a big impact on the analysis when grouped with another and yet that same feature has no effect by itself. The case where the effect of a feature is only manifested when combined with other features, and is absent in their absence or when not grouped with them.

Let’s say you're building a decision tree as your model. That tree can grow larger or shrink, according to the features you include or exclude. Furthermore, often you won’t be sure which model is better, especially if your dataset is small and you don’t have enough data to test or make an informed decision about the outcome. In addition to the importance of spending the time necessary to get this part of the process right, it's here where experience and tools will make a difference. It's also at this step where we refer to this predictive analytics as a business and a discipline that is partly art and partly science.

The following are guidelines to keep in mind as you go about preparing your data and building your model:

Expect your data to have many features.
Allow plenty of time to prepare and understand your data.
Know the business domain represented in your data.
Allow yourself time to select the relevant features.
Employ tools and algorithms to help you with feature selection and feature extraction.
Avoid overfitting.
Avoid oversimplifying.
Expect to go through many iterations as you select features and zero in building the model.
Let the data analysis and the model’s insights guide your decision.

Expect the data to be overwhelming. Only a handful of projects won’t have enough data to build accurate models. Most projects will suffer from the abundance of the data. Nowadays, we are experiencing exponential growth in the data. This abundance applies equally to the sample size of the data as it does to its dimensionality. As such, the data may include a lot of noise. Differentiating the signal from the noise is at the core of what data scientists do.

In some applications, such as bioinformatics or document classification, it's common for a dataset to have thousands of features. Not all features are important for all problems. Feature selection and extraction are two methods that can help reduce the dimensionality of the data set and identify relevant features to work with.

Both feature extraction and feature selection will improve the predictive power of your model and speed up its performance.

Selecting features

Feature selection is the process of selecting a subset of features from the original features. The subset is selected without undergoing any transformation and while keeping the properties of the original features intact. For example, a scientist examining multiple proteins and their effects on a disease is looking to identify which proteins are most relevant in the analysis. For a loan application, your credit score is probably the most important deciding factor.

In a classification problem (see Chapter 7) where the training data is already labeled and the classes are known (like spam and non-spam emails), selecting the most important features determining whether an email is spam can be iterative. As long as the features you're selecting still produce the correct class, you're heading in the right direction.

The goal is to identify which features you must keep and still get the labeled data classified correctly during the training phase.

Feature selection for classification aims at selecting a subset of the original feature without impacting the accuracy of the classifier. The subset of features should still be a good predictor of the classification, if all available features are included.

Feature selection is very complex and the degree of the difficulty will vary with the dimensionality of the data, the correlation level among features, whether they are highly dependent or independent, and the structure of the data.

Identifying the right features will help in the performance of your model, both in terms of speed and accuracy of its prediction.

There are two widely used methods for feature selection:

Forward selection starts with one feature and adds one each iteration. It keeps adding one variable at a time, which helps decrease the error until further additions either do not improve the model or have no significance in decreasing the error.
Backward selection starts with all features in the dataset, then removes one feature at a time while making sure that the removal either decreases the error or only slightly increases it. A feature is removed if its removal produces the smallest increase in the error rate. When no further improvement to the model is achieved, or when the error is increased significantly, then the process stops.

There are three feature selection approaches:

Filters: Preprocessing techniques that compute a score for each feature, and then select features with high scores.
Wrappers: Techniques that use the learning algorithm to score subsets of features according to their usefulness to a given predictor. Multiple subsets of features are generated and evaluated.
Embedded: Techniques that perform the selection as part of the training algorithm or the learning procedure. The search for features is embedded into the classifier itself.

Extracting features

Feature extraction transforms your original features and creates a small subset of new ones, resulting in a much lower dimensionality. As shown in the preceding sections, dimensionality reduction can assist you in getting rid of redundant features, and noise in your data. Feature extraction projects and maps your features to a new set of features that is much smaller than the original.

We have discussed the idea of creating concepts when analyzing books and generating meaningful groups, such as fiction and literature, history, or biographies. Then we can use those new concepts to analyze our data. This transformation from individual books to concepts, or logical grouping, is a kind of transformation that results in dimensionality reduction. However, these new generated features that are produced by feature extraction still need further analysis to fully make sense of the data, and eventually build your predictive model.

Another example of feature extraction, often used in text analytics, is the capability to transform text to a numerical representation. Like the ability to generate either

Term frequency (TF) of words
Term frequency inverse document frequency (TFIDF)

TFIDF often is used to adjust for the fact that some words are more frequently used than others; the term frequency, or word count, is offset with the frequency of that word across documents.

The key difference between features selection and features extraction is that the extraction reduces the dimensionality without necessarily preserving the actual attributes, such as units of the original features. Also, feature extraction can be a data transformation technique by taking a set of original features and transforming them to, or extracting, new ones.

Feature selection keeps the original set of features; it only lowers the number of the original features. Feature selection aims at eliminating redundancy and maximizing relevance.

Ranking features

When working with features, one thinks of ranking them. Don’t you want to know which feature is the most important in your dataset? Which feature or set of features is the absolute indicator of a given class or a label? In the case of the biologist in her lab, it makes perfect sense to zero in on that one gene or a subset of genes responsible for a biological condition. The model then can easily watch for the existence of that gene or its expression to predict the expected behavior.

Ranking methods help select features and reduce dimensionality of the dataset. To rank features, you can choose from the following ranking methods:

Gain Ratio
Information Gain
Chi-square
SVM Ranker

These algorithms can be divided into two broad categories:

Statistical methods, such as Chi-square, rely on the computation of the chi-squared statistic value for attributes to evaluate their ranking.
Entropy-based methods measure the amount of information in a feature. The relevance of a feature, as indicated by its ranking score, is measured through the calculation of the expected value of the information contained in a message in relation to the output of that variable:
- A high entropy value indicates that the variable belongs to a uniform distribution.
- A low entropy value indicates that the variable belongs to a varied distribution.

In a decision-tree model, the importance of an attribute is measured using an entropy-based approach. The information gains in determining a given classification are used to select the features for the model. With such a process, a decision tree focuses on relevant features that lead to a given decision.

The two highly ranked features aren't necessarily the best two for the overall model. In other words, if you rely on ranking algorithms alone to sort features based on the amount of information they hold (that is, entropy), you aren't necessary selecting the best features. A feature of high entropy may not contribute to a high accuracy if this feature is being combined with other features.

The forward selection algorithm uses a better approach to select features. At each iteration, the algorithm searches for the best features that provide high accuracy when joined together. The feature selection algorithms, such as forward selection or backward selection, are widely used, despite their relatively high computational time to complete.

You can use ranking algorithms first, such as Information Gain, to eliminate those features with no information on them (very low value of Information Gain); then you can apply feature selection algorithms on the remaining subset.

Structuring Your Data

Raw data is a potential resource, but it can't be usefully analyzed until it's been given a consistent structure. Data residing in multiple systems has to be collected and transformed to get it ready for analysis. The collected data should reside in a separate system so it won't interfere with the live production system. While building your model, split your dataset into a training dataset to train the model, and a test dataset to validate the model.

Extracting, transforming and loading your data

After it's initially collected, data is usually in a dispersed state; it resides in multiple systems or databases. Before you can use it for a predictive analytics model, you have to consolidate it into one place. Also, you don't want to work on data that resides in operational systems — that's asking for trouble. Instead, place a portion of it somewhere where you can work on it freely without affecting operations. ETL (extract, transform and load) is the process that achieves that desirable state.

Many organizations have multiple databases; your predictive model will likely utilize data from all of them. ETL is the process that collects all the information needed and places it in a separate environment where you can run your analysis. ETL is not, however, a once-and-for-all operation; usually it's an ongoing process that refreshes the data and keeps it up to date. Be sure you run your ETL processes at night or at other times when the load on the operational system is low.

The extraction step collects the desired data in its raw form from operational systems.
The transformation step makes the collected data ready to be used in your predictive model — merging it, generating the desired derived attributes, and putting the transformed data in the appropriate format to fit business requirements.
The loading step places the data in its designated location, where you can run your analysis on it — for example, in a data mart, data warehouse, or another database.

You should follow a systematic approach to build your ETL processes to fulfill the business requirements. It's a good practice to keep a copy of the original data in a separate area so you can always go back to it in case an error disrupts the transformation or the loading steps of the processes. The copy of the original data serves as a backup that you can use to rebuild the entire dataset employed by your analysis if necessary. The goal is to head off Murphy's Law and get back on your feet quickly if you have to rerun the entire ETL process from scratch.

Your ETL process should incorporate modularity — separating the tasks and accomplishing the work in stages. This approach has advantages in case you want to reprocess or reload the data, or if you want to use some of that data for a different analysis or to build different predictive models. The design of your ETL should be able to accommodate even major business requirement changes — with only minimal changes to your ETL process.

Keeping the data up to date

After the loading step of ETL, after you get your data into that separate database, data mart, or warehouse, you'll need to keep the data fresh so the modelers can rerun previously built models on new data.

Implementing a data mart for the data you want to analyze and keeping it up to date will enable you to refresh the models. You should, for that matter, refresh the operational models regularly after they're deployed; new data can increase the predictive power of your models. New data can allow the model to depict new insights, trends, and relationships.

Having a separate environment for the data also allows you to achieve better performance for the systems used to run the models. That's because you aren't overloading operational systems with the intensive queries or analysis required for the models to run.

Data keeps on coming — more of it, faster, and in greater variety all the time. Implementing automation and the separation of tasks and environments can help you manage that flood of data and support the real-time response of your predictive models.

To ensure that you're capturing the data streams and that you're refreshing your models while supporting automated ETL processes, analytical architecture should be highly modular and adaptive. If you keep this design goal in mind for every part you build for your overall predictive analytic project, the continuous improvement and tweaking that go along with predictive analytics will be smoother to maintain and will achieve better success.

Outlining testing and test data

When your data is ready and you're about to start building your predictive model, it's useful to outline your testing methodology and draft a test plan. Testing should be driven by the business goals you've gathered, documented, and collected all necessary data to help you achieve.

Right off the bat, you should devise a method to test whether a business goal has been attained successfully. Because predictive analytics measure the likelihood of a future outcome — and the only way to be ready to run such a test is by training your model on past data, you still have to see what it can do when it's up against future data. Of course, you can't risk running an untried model on real future data, so you'll need to use existing data to simulate future data realistically. To do so, you have to split the data you're working on into training and test datasets.

Be sure that you select these two datasets at random, and that both datasets contain and cover all the data parameters you're measuring.

When you split your data into test and training datasets, you're effectively avoiding any overfitting issues that could arise from overtraining the model on the entire dataset and picking up all the noise patterns or specific features that only belong to the sample dataset and aren't applicable to other datasets. (See Chapter 15 for more on the pitfalls of overfitting.)

Separating your data into training and test datasets, about 70 percent and 30 percent respectively, ensures an accurate measurement of the performance of the predictive analytics model you're building. You want to evaluate your model against the test data because it's a straightforward way to measure whether the model's predictions are accurate. Succeeding here is an indication that the model will succeed when it's deployed. A test dataset will serve as an independent set of data that the model hasn't yet seen; running your model against the test dataset provides a preview of how the model will perform when it goes live.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 9: Preparing Data

Create new playlist

Sign In

Sign Up