Which features should be used to create a predictive model is not only a vital question, but also a difficult question that may require deep knowledge of the problem domain to be answered. It is possible to automatically select those features in data that are most useful or most relevant for the problem someone is working on. Considering these questions, this chapter covers Feature Engineering in detail, explaining the reasons why to apply it along with some best practices in feature engineering.
In addition to this, we will provide the theoretical descriptions and examples of feature extraction, transformations, and selection applied in large scale machine learning techniques, using both Spark MLlib and Spark ML APIs. Furthermore, this chapter also covers the basic idea of advanced feature engineering (also known as extreme feature engineering).
Please note that you will require having R and RStudio installed on your machine prior to proceeding with this chapter since an example towards exploratory data analysis will be shown using R.
In a nutshell, the following topics will be covered throughout this chapter:
Even though feature engineering is an informal topic, however, it is considered as an essential part in applied machine learning. Andrew Ng, who is one of the leading scientists in the area of machine learning, defined the term feature engineering in his book Machine Learning and AI via Brain simulations (see also, feature engineering defined at: https://en.wikipedia.org/wiki/Andrew_Nghttps://en.wikipedia.org/wiki/Andrew_Ng) as follows:
Coming up with features is difficult, time-consuming, requires expert knowledge. Applied machine learning is basically feature engineering.
Based on the preceding definition, we can argue that feature engineering is actually human intelligence, not artificial intelligence. Moreover, we will explain what feature engineering is from other perspectives. Feature engineering also can be defined as the process of converting raw data into useful features (also often called feature vectors). The features help you in better representation of the underlying problem to the predictive models eventually; so that the predictive modeling can be applied to new data types to avail high predictive accuracy.
Alternatively, we can define the term feature engineering as a software engineering process of using or reusing someone's advanced domain knowledge about the underlying problem and the available data to create features, which makes the machine learning algorithms work with ease.
This is how we define the term feature engineering. If you read it carefully, you will see four dependencies in these definitions:
Now based on these four dependencies, we can conclude a workflow out of this. First, you have to understand your problem itself, then you have to know your data and if it is in good order, if not, process your data to find a certain pattern or features so that you can build your model.
Once you have identified the features, you need to know which categories your problem falls under. In other words, you have to be able to identify if it is a classification, clustering, or a regression problem based on the features. Finally, you will build the model to make a prediction on the test set or validation set using a well-known method such as random forest or Support Vector Machine (SVMs).
Throughout this chapter, you will see and argue that feature engineering is an art that deals with uncertain and often unstructured data. It's also true that there are many well-defined procedures of applying the classification, clustering, regression model, or methods such as SVMs that are both methodical and provable; however, the data is a variable and comes often with a variety of characteristics at different times.
You will get to know when and how you might be good at deciding which procedures to be followed by practice from the empirical apprenticeship. The main tasks involved in feature engineering are:
When the ultimate goal is to achieve the most accurate and reliable results from a predictive model, you have to invest your best in what you have. The best investment, in this case, would be the three parameters: time and patience, data and availability, and best algorithm. However, how do you get the most valuable treasures out of your data for the predictive modeling? is the problem that the process and practice of feature engineering solves in an emerging way.
In fact, the success of most of the machine learning algorithms depends on how you properly and intelligently utilize value and present your data. It is often agreed that the hidden treasure (that is, features or patterns) out of your data will directly stimulate the results of the predictive model.
Therefore, better features (that is, what you extract and select from the datasets) mean better results (that is, the results you will achieve from the model). However, please remember one thing before you generalize the earlier statement for your machine learning model, you need a great feature which is true nonetheless with the properties that describe the structures inherent in your data.
In summary, better features signify three pros: flexibility, tuning, and better results:
We also suggest readers not to be overconfident with only the features. The preceding statements are often true; however, sometimes they are misleading. We would like to clear the preceding statements further. Actually, if you receive the best predictive results from a model, it is actually of three factors: the model you selected, the data you had, and the features you had prepared.
Therefore, if you have enough time and computational resources, always try to use the standard model since often the simplicity does not imply better accuracy. Nonetheless, better features will contribute the most out of these three factors. One thing you should know is that, unfortunately, even if you master feature engineering emanates with many hands-on practices, and research what others are doing well in the state of the arts, some machine learning projects fail at the very end.
Very often, an intelligent choice for both training and test samples out of better features leads to better solutions. Although in the previous section we argued that there are two tasks in the feature engineering: feature extraction from the raw data and feature selection. However, there is no definite or fixed path for feature engineering.
Conversely, the whole step in feature engineering is very much directed by the available raw data. If the data is well-structured you would be feeling lucky. Nonetheless, the reality is often that the raw data comes from diverse sources in multiple formats. Therefore, exploring this data is very important before you proceed to feature extraction and feature selection.
We suggest you to figure out the data skewness and kurtosis using the histogram and outliers using the box-plot and bootstrapping the data using Data Sidekick techniques (introduced by Abe Gong) in the literature (see at: https://curiosity.com/paths/abe-gong-building-for-resilience-solid-2014-keynote-oreilly/#abe-gong-building-for-resilience-solid-2014-keynote-oreilly).
The following questions need to be answered and known by means of data exploration before applying the feature engineering:
You can use simple visualization tools such as density plots for doing this, as explained by the following example.
Example 1. Suppose you are interested in fitness walking and you walked at a sports ground or countryside in the last four weeks (excluding the weekends). You spent the following time (in minutes to finish a 4 KM walking track):15, 16, 18, 17.16, 16.5, 18.6, 19.0, 20.4, 20.6, 25.15, 27.27, 25.24, 21.05, 21.65, 20.92, 22.61, 23.71, 35, 39, and 50. Now let's compute and interpret the skewness and kurtosis of these values using R.
We will show how to configure and work with SparkR in Chapter 10, Configuring and Working with External Libraries and show how to execute the same code on SparkR. The reason behind this is some plotting packages such as ggplot2
are still not implemented in the current version of Spark used for SparkR directly. However, the ggplot2
is available as a combined package named ggplot2.SparkR
on GitHub, which can be installed and configured using the following command:
devtools::install_github("SKKU-SKT/ggplot2.SparkR")
However, there are numerous dependencies that need to be ensured before and during the configuration process. Therefore, we should resolve this issue in a later chapter instead. For the time being, we assume you have basic knowledge of using R and if you have R installed and configured on your computer then please use the following steps. However, a step-by-step example on how to install and configure SparkR using RStudio will be shown in Chapter 10, Configuring and Working with External Libraries.
Now just copy the following code snippets and try to execute to make sure you have the correct value of the Skewness and Kurtosis.
Install the moments
package for calculating Skewness and Kurtosis:
install.packages("moments")
Use the moments
package:
library(moments)
Make a vector for the time you have taken during the workout:
time_taken <- c (15, 16, 18, 17.16, 16.5, 18.6, 19.0, 20.4, 20.6, 25.15, 27.27, 25.24, 21.05, 21.65, 20.92, 22.61, 23.71, 35, 39, 50)
Convert the time into DataFrame:
df<- data.frame(time_taken)
Now calculate the skewness
:
skewness(df) [1]1.769592
Now calculate the kurtosis
:
> kurtosis(df) [1]5.650427
Interpretation of the result: The skewness of your workout time is 1.769592, which means your data is skewed to the right or positively skewed. The kurtosis, on the other hand, is 5.650427, which means the distribution of the data is leptokurtic. Now to check the outliers or tails check the following histogram. Again, for simplicity, we will use R to plot the density plot that will interpret your workout time.
Install ggplot2package
for plotting the histogram:
install.packages("ggplot2")
Use the moments
package:
library(ggplot2)
Now plot the histogram using the qplot()
method of ggplot2
:
ggplot(df, aes(x = time_taken)) + stat_density(geom="line", col= "green", size = 1, bw = 4) + theme_bw()
The interpretation presented in Figure 2 of the distribution of data (workout times) shows the density plot is skewed to the right so is leptokurtic. Besides the density plot, you can also look at the box-plots for each individual feature. Where the box plot displays the data distribution based on five-number summaries: minimum, first quartile, median, third quartile, and maximum, as shown in Figure 3, where we can look for outliers beyond three (3) Inter-Quartile Range (IQR):
Bootstrapping the datasets also sometimes offers insights on outliers. If the data volume is too large (that is, big data) doing the Data Sidekick, evaluations and predictions are also useful. The idea of Data Sidekick is to use a small part of the available data to figure out what insights can be concluded from the datasets and it is also commonly referred to as using small data to multiply the value of big data.
It is very useful for large-scale text analytics. For example, suppose you have a huge corpus of text, and of course you can use a small portion of it to test various sentiment analysis models and choose the one which gives the best results in terms of performance (computation time, memory usage, scalability, and throughput).
Now we would like to draw your attention to the other aspects of feature engineering. Moreover, converting continuous variables into categorical variables (with a certain combination of features) results in better predictor variables.
In statistical language, a variable in your data either represents measurements on some continuous scale, or on some categorical or discrete characteristics. For example, weight, height, and age of an athlete would represent the continuous variables. Alternatively, the survival or failure in terms of time is also considered as continuous variables. A person's gender, occupation, or marital status, on the other hand, is categorical or discrete variables. Statistically, some variables could be considered in either way. For example, a movie viewer's rating of a move on a 10 point scale may be considered a continuous variable, or we may consider it as a discrete variable with 10 categories. Time series data or real-time streaming data are usually collected for continuous variables until a certain time.
In parallel, considering the square or cube or even using the non-linear models of the features can also provide better insights. Also, consider the forward selection or backwards selection wisely since both of them are computationally expensive.
Finally, when the number of features becomes significantly large it is a wise decision to use the Principal Component Analysis (PCA) or Singular Value Decomposition (SVD) technique to find the right combination of features.
Feature extraction is the automatic way of constructing new features from the raw data you have or will be collecting. During the feature extraction process, reducing the dimensionality of complex raw data is usually done by making the observation into a much smaller set automatically that can be modeled into later stages. Projection methods such as PCA and unsupervised clustering methods are used for tabular data in TXT, CSV, TSV, or RDB format. However, feature extraction from another data format is very complex. Specially parsing many data formats such as XML and SDRF is a tedious process if the number of fields to extract is huge.
For multimedia data such as image data, the most common type of technique includes line or edge detection or image segmentation. However, subject to the domain and image, video and audio observations advance themselves to many of the same types of Digital Signal Processing (DSP) methods where typically the analogue observations are stored in digital formats.
The most positive pros and the key to feature extraction are that the methods that are developed and available are automatic; therefore, thay can solve the problem of unmanageable high dimensional data. As we stated in Chapter 3, Understanding the Problem by Understanding the Data that more data exploration and better feature extraction eventually increases the performance of your ML model (since feature extraction also involves feature selection). The reality is more data will provide more insights towards the performance of the predictive models eventually. However, the data has to be useful and dumping unwanted data will kill your valuable time; therefore, think of the meaning of the statement before collecting your data.
There are several steps involved in the feature extraction process; including the data transformation and feature transformation. As we stated several times, a machine learning model is likely to provide a better result if the model is well trained with better features out of the raw data. Optimized for learning and generalization is a key characteristic of good data. Therefore, the process of putting together the data in this optimal format is achieved through some data processing steps such as cleaning, missing values handling, and some intermediate transformation like from a text document to words transformation.
The methods that help to create new features as predictor variables are called feature transformation, which is actually a group of methods. Feature transformation is essentially required for the dimension reduction. Usually, when the transformed features have a descriptive dimension, it is likely to have better order compared to the original features.
Therefore, less descriptive features can be dropped from the training or test samples when building the machine learning models. The most common tasks included in the feature transformation are non-negative matrix factorization, principal component analysis, and factor analysis using scaling, decomposition, and aggregation operations.
Examples of feature extraction include the extraction of contours in images, extraction of diagrams from a text, extraction of phonemes from the recording of spoken text, and so on. Feature extraction involves a transformation of the features, which is often not reversible because some information is lost eventually in the process of dimensionality reduction.
Feature selection is a process for preparing the training datasets or validation dataset for predictive modeling and analytics. Feature selection has practical implication in most of the machine learning problem types including classification, clustering, dimensionality reduction, collaborative filtering, regression, and so on.
Therefore, the ultimate goal is to select a subset from the large collection of features from the original data set. And often dimensionality reduction algorithms are applied, such as Singular Value Decomposition (SVD) and Principal Component Analysis (PCA).
An interesting power of the feature selection technique is that a minimal feature set can be applied to represent the maximum amount of variance in the available data. In other words, the minimal subset of the feature is enough to train your machine learning model quite efficiently.
This subset of features is used to train the model. There are two types of feature selection techniques, namely forward selection and backwards selection. The forward selection starts with the strongest feature and keeps adding more features. On the contrary, the backwards selection starts with all the features and removes the weakest features. However, both techniques are computationally expensive.
Since not all the features are equally important; consequently, you will find some features with more importance than others for making the model more accurate. Therefore, those attributes can be treated as irrelevant to the problem. As a result, you need to remove those features before preparing the training and test sets. Sometimes, the same technique might be applied to the validation sets.
In parallel to importance, you will always find some features that will be redundant in the context of other features. Feature selection is not only involved with removing irrelevant or redundant features, it also serves other purposes that are important to increase the model's accuracy, as stated here:
Although by using the feature selection technique it is quietly possible to reduce the number of features by selecting certain features in the dataset. And later on, the subset is used to train the model. However, the entire process usually, cannot be used interchangeably with the term dimensionality reduction.
The reality is that the feature selection methods are used to extract a subset from the total set in the data without changing their underlying properties.
In contrast, the dimensionality reduction method, on the other hand, employs already engineered features that can transform the original features into corresponding feature vectors by reducing the number of variables under certain considerations and requirements of the machine learning problem.
Thus, it actually modifies the underlying data, extracts the original features from raw and noisy features by compressing the data, but maintains the original structure and most of the time is irreversible. Typical examples of dimensionality reduction methods include Principal Component Analysis (PCA), Canonical Correlation Analysis (CCA), and Singular Value Decomposition (SVD).
Other feature selection techniques use the filter-based, wrapper methods and embedded methods feature selection by evaluating the correlation between each feature and the target attribute in a supervised context. These methods apply some statistical measures to assign a score to each feature also known as filter methods.
The features are then ranked based on the scoring system that can help to eliminate the specific features. Examples of such techniques are information gain, correlation coefficient scores, and Chi-squared test. An example of wrapper methods, which is a feature selection process as a search problem, is the recursive feature elimination algorithm. On the other hand, Least Absolute Shrinkage and Selection Operator (LASSO), Elastic Net, and Ridge Regression are typical examples of embedded methods of feature selection, which is also known as regularizations methods.
The current implementation of Spark MLlib provides the support for dimensionality reduction on the RowMatrix
class only for the SVD and PCA. On the other hand, some typical steps from raw data collection to feature selection are feature extractions, feature transformation, and feature selection.
Interested readers are suggested to read the API documentation for the feature selection and dimensionality reduction at: http://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html.