Chapter 4. Extracting Knowledge through Feature Engineering

Which features should be used to create a predictive model is not only a vital question, but also a difficult question that may require deep knowledge of the problem domain to be answered. It is possible to automatically select those features in data that are most useful or most relevant for the problem someone is working on. Considering these questions, this chapter covers Feature Engineering in detail, explaining the reasons why to apply it along with some best practices in feature engineering.

In addition to this, we will provide the theoretical descriptions and examples of feature extraction, transformations, and selection applied in large scale machine learning techniques, using both Spark MLlib and Spark ML APIs. Furthermore, this chapter also covers the basic idea of advanced feature engineering (also known as extreme feature engineering).

Please note that you will require having R and RStudio installed on your machine prior to proceeding with this chapter since an example towards exploratory data analysis will be shown using R.

In a nutshell, the following topics will be covered throughout this chapter:

  • The state of the art of feature engineering
  • Best practices in feature engineering
  • Feature engineering with Spark
  • Advanced feature engineering

The state of the art of feature engineering

Even though feature engineering is an informal topic, however, it is considered as an essential part in applied machine learning. Andrew Ng, who is one of the leading scientists in the area of machine learning, defined the term feature engineering in his book Machine Learning and AI via Brain simulations (see also, feature engineering defined at: https://en.wikipedia.org/wiki/Andrew_Nghttps://en.wikipedia.org/wiki/Andrew_Ng) as follows:

Coming up with features is difficult, time-consuming, requires expert knowledge. Applied machine learning is basically feature engineering.

Based on the preceding definition, we can argue that feature engineering is actually human intelligence, not artificial intelligence. Moreover, we will explain what feature engineering is from other perspectives. Feature engineering also can be defined as the process of converting raw data into useful features (also often called feature vectors). The features help you in better representation of the underlying problem to the predictive models eventually; so that the predictive modeling can be applied to new data types to avail high predictive accuracy.

Alternatively, we can define the term feature engineering as a software engineering process of using or reusing someone's advanced domain knowledge about the underlying problem and the available data to create features, which makes the machine learning algorithms work with ease.

This is how we define the term feature engineering. If you read it carefully, you will see four dependencies in these definitions:

  • The problem itself
  • The raw data you will be working with to find out useful patterns or features
  • The type of the machine learning problem or classes
  • The predictive models you'll be using

Now based on these four dependencies, we can conclude a workflow out of this. First, you have to understand your problem itself, then you have to know your data and if it is in good order, if not, process your data to find a certain pattern or features so that you can build your model.

Once you have identified the features, you need to know which categories your problem falls under. In other words, you have to be able to identify if it is a classification, clustering, or a regression problem based on the features. Finally, you will build the model to make a prediction on the test set or validation set using a well-known method such as random forest or Support Vector Machine (SVMs).

Throughout this chapter, you will see and argue that feature engineering is an art that deals with uncertain and often unstructured data. It's also true that there are many well-defined procedures of applying the classification, clustering, regression model, or methods such as SVMs that are both methodical and provable; however, the data is a variable and comes often with a variety of characteristics at different times.

Feature extraction versus feature selection

You will get to know when and how you might be good at deciding which procedures to be followed by practice from the empirical apprenticeship. The main tasks involved in feature engineering are:

  • Data exploration and feature extraction: This is the process of uncovering the hidden treasure in the raw data. Generally, this process does not vary much by algorithms consuming the features. However, a better understanding of the hands-on experience, business domain, and intuition play a vital role in this regard.
  • Feature selection: This is the process for deciding which features to be selected based on the machine learning problem you are dealing with. You can use diverse techniques for selecting the features; however, it may vary in algorithms and using the features.

Importance of feature engineering

When the ultimate goal is to achieve the most accurate and reliable results from a predictive model, you have to invest your best in what you have. The best investment, in this case, would be the three parameters: time and patience, data and availability, and best algorithm. However, how do you get the most valuable treasures out of your data for the predictive modeling? is the problem that the process and practice of feature engineering solves in an emerging way.

In fact, the success of most of the machine learning algorithms depends on how you properly and intelligently utilize value and present your data. It is often agreed that the hidden treasure (that is, features or patterns) out of your data will directly stimulate the results of the predictive model.

Therefore, better features (that is, what you extract and select from the datasets) mean better results (that is, the results you will achieve from the model). However, please remember one thing before you generalize the earlier statement for your machine learning model, you need a great feature which is true nonetheless with the properties that describe the structures inherent in your data.

In summary, better features signify three pros: flexibility, tuning, and better results:

  • Better features (better flexibility): If you are successful in extracting and selecting the better features, you will get better results for sure, even if you choose a non-optimal or wrong model. In fact, optimal or most suitable models can be selected or picked up based on the good structure of the original data you have. In addition to this, good features will allow you to use less complex but efficient, faster, easily understandable, and easy to maintain models eventually.
  • Better features (better tuning): As we already stated, if you do not choose your machine learning model intelligently or if your features are not in good shape, you are more likely to get worse results out of the ML model. However, even if you choose some wrong parameters during building the model and if you do have some well-engineered features, still you can expect better results out of the model. Furthermore, you don't need to worry much or even work harder to choose the most optimal models and related parameters. The reason is simple, which is the good feature, you have actually understood the problem well and ready to use the better represented by all the data by characterizing the problem itself.
  • Better features (better results): You are most likely to get better results even if you spent most of your efforts in feature engineering towards better features selections.

We also suggest readers not to be overconfident with only the features. The preceding statements are often true; however, sometimes they are misleading. We would like to clear the preceding statements further. Actually, if you receive the best predictive results from a model, it is actually of three factors: the model you selected, the data you had, and the features you had prepared.

Therefore, if you have enough time and computational resources, always try to use the standard model since often the simplicity does not imply better accuracy. Nonetheless, better features will contribute the most out of these three factors. One thing you should know is that, unfortunately, even if you master feature engineering emanates with many hands-on practices, and research what others are doing well in the state of the arts, some machine learning projects fail at the very end.

Feature engineering and data exploration

Very often, an intelligent choice for both training and test samples out of better features leads to better solutions. Although in the previous section we argued that there are two tasks in the feature engineering: feature extraction from the raw data and feature selection. However, there is no definite or fixed path for feature engineering.

Conversely, the whole step in feature engineering is very much directed by the available raw data. If the data is well-structured you would be feeling lucky. Nonetheless, the reality is often that the raw data comes from diverse sources in multiple formats. Therefore, exploring this data is very important before you proceed to feature extraction and feature selection.

Tip

We suggest you to figure out the data skewness and kurtosis using the histogram and outliers using the box-plot and bootstrapping the data using Data Sidekick techniques (introduced by Abe Gong) in the literature (see at: https://curiosity.com/paths/abe-gong-building-for-resilience-solid-2014-keynote-oreilly/#abe-gong-building-for-resilience-solid-2014-keynote-oreilly).

The following questions need to be answered and known by means of data exploration before applying the feature engineering:

  • What is the percentage of the total data being present or not having null or missing values for all the available fields? Then try to handle those missing values and interpret them well without losing the data semantics.
  • What is the correlation between the fields? What is the correlation of each field with the predicted variable? What values do they take (that is, categorical or non-categorical, numerical or alpha-numerical, and so on)?
  • Then find out if the data distribution is skewed or not. You can identify the skewness by seeing the outliers or long tail (slightly skewed to the right or positively skewed, slightly skewed to the left or negatively skewed, as shown in Figure 1). Now identify if the outliers contribute towards making the prediction or not.
  • After that, observe the data kurtosis. More technically, check if your kurtosis is mesokurtic (less than but almost equal to 3), leptokurtic (more than 3), or platykurtic (less than 3). Note, the kurtosis of any univariate normal distribution is considered to be 3.
  • Now play with the tail and observe (do the predictions get better?) what happens when you remove the long tail?
    Feature engineering and data exploration

    Figure 1: Skewness of the data distribution (x-axis = data, y-axis = density).

You can use simple visualization tools such as density plots for doing this, as explained by the following example.

Example 1. Suppose you are interested in fitness walking and you walked at a sports ground or countryside in the last four weeks (excluding the weekends). You spent the following time (in minutes to finish a 4 KM walking track):15, 16, 18, 17.16, 16.5, 18.6, 19.0, 20.4, 20.6, 25.15, 27.27, 25.24, 21.05, 21.65, 20.92, 22.61, 23.71, 35, 39, and 50. Now let's compute and interpret the skewness and kurtosis of these values using R.

Tip

We will show how to configure and work with SparkR in Chapter 10, Configuring and Working with External Libraries and show how to execute the same code on SparkR. The reason behind this is some plotting packages such as ggplot2 are still not implemented in the current version of Spark used for SparkR directly. However, the ggplot2 is available as a combined package named ggplot2.SparkR on GitHub, which can be installed and configured using the following command:

devtools::install_github("SKKU-SKT/ggplot2.SparkR")

However, there are numerous dependencies that need to be ensured before and during the configuration process. Therefore, we should resolve this issue in a later chapter instead. For the time being, we assume you have basic knowledge of using R and if you have R installed and configured on your computer then please use the following steps. However, a step-by-step example on how to install and configure SparkR using RStudio will be shown in Chapter 10, Configuring and Working with External Libraries.

Now just copy the following code snippets and try to execute to make sure you have the correct value of the Skewness and Kurtosis.

Install the moments package for calculating Skewness and Kurtosis:

install.packages("moments")  

Use the moments package:

library(moments) 

Make a vector for the time you have taken during the workout:

time_taken <- c (15, 16, 18, 17.16, 16.5, 18.6, 19.0, 20.4, 20.6, 25.15, 27.27, 25.24, 21.05, 21.65, 20.92, 22.61, 23.71, 35, 39, 50) 

Convert the time into DataFrame:

df<- data.frame(time_taken) 

Now calculate the skewness:

skewness(df) 
[1]1.769592  

Now calculate the kurtosis:

> kurtosis(df) 
[1]5.650427  

Interpretation of the result: The skewness of your workout time is 1.769592, which means your data is skewed to the right or positively skewed. The kurtosis, on the other hand, is 5.650427, which means the distribution of the data is leptokurtic. Now to check the outliers or tails check the following histogram. Again, for simplicity, we will use R to plot the density plot that will interpret your workout time.

Install ggplot2package for plotting the histogram:

install.packages("ggplot2") 

Use the moments package:

library(ggplot2)

Now plot the histogram using the qplot() method of ggplot2:

ggplot(df, aes(x = time_taken)) + stat_density(geom="line", col=  
"green", size = 1, bw = 4) + theme_bw() 
Feature engineering and data exploration

Figure 2. Histogram of the workout time (right-skewed).

The interpretation presented in Figure 2 of the distribution of data (workout times) shows the density plot is skewed to the right so is leptokurtic. Besides the density plot, you can also look at the box-plots for each individual feature. Where the box plot displays the data distribution based on five-number summaries: minimum, first quartile, median, third quartile, and maximum, as shown in Figure 3, where we can look for outliers beyond three (3) Inter-Quartile Range (IQR):

Feature engineering and data exploration

Figure 3. Histogram of the workout time (figure courtesy of Box Plot: Display of Distribution, http://www.physics.csbsju.edu/stats/box2.htmlhttp://www.physics.csbsju.edu/stats/box2.html).

Bootstrapping the datasets also sometimes offers insights on outliers. If the data volume is too large (that is, big data) doing the Data Sidekick, evaluations and predictions are also useful. The idea of Data Sidekick is to use a small part of the available data to figure out what insights can be concluded from the datasets and it is also commonly referred to as using small data to multiply the value of big data.

It is very useful for large-scale text analytics. For example, suppose you have a huge corpus of text, and of course you can use a small portion of it to test various sentiment analysis models and choose the one which gives the best results in terms of performance (computation time, memory usage, scalability, and throughput).

Now we would like to draw your attention to the other aspects of feature engineering. Moreover, converting continuous variables into categorical variables (with a certain combination of features) results in better predictor variables.

Tip

In statistical language, a variable in your data either represents measurements on some continuous scale, or on some categorical or discrete characteristics. For example, weight, height, and age of an athlete would represent the continuous variables. Alternatively, the survival or failure in terms of time is also considered as continuous variables. A person's gender, occupation, or marital status, on the other hand, is categorical or discrete variables. Statistically, some variables could be considered in either way. For example, a movie viewer's rating of a move on a 10 point scale may be considered a continuous variable, or we may consider it as a discrete variable with 10 categories. Time series data or real-time streaming data are usually collected for continuous variables until a certain time.

In parallel, considering the square or cube or even using the non-linear models of the features can also provide better insights. Also, consider the forward selection or backwards selection wisely since both of them are computationally expensive.

Finally, when the number of features becomes significantly large it is a wise decision to use the Principal Component Analysis (PCA) or Singular Value Decomposition (SVD) technique to find the right combination of features.

Feature extraction – creating features out of data

Feature extraction is the automatic way of constructing new features from the raw data you have or will be collecting. During the feature extraction process, reducing the dimensionality of complex raw data is usually done by making the observation into a much smaller set automatically that can be modeled into later stages. Projection methods such as PCA and unsupervised clustering methods are used for tabular data in TXT, CSV, TSV, or RDB format. However, feature extraction from another data format is very complex. Specially parsing many data formats such as XML and SDRF is a tedious process if the number of fields to extract is huge.

For multimedia data such as image data, the most common type of technique includes line or edge detection or image segmentation. However, subject to the domain and image, video and audio observations advance themselves to many of the same types of Digital Signal Processing (DSP) methods where typically the analogue observations are stored in digital formats.

The most positive pros and the key to feature extraction are that the methods that are developed and available are automatic; therefore, thay can solve the problem of unmanageable high dimensional data. As we stated in Chapter 3, Understanding the Problem by Understanding the Data that more data exploration and better feature extraction eventually increases the performance of your ML model (since feature extraction also involves feature selection). The reality is more data will provide more insights towards the performance of the predictive models eventually. However, the data has to be useful and dumping unwanted data will kill your valuable time; therefore, think of the meaning of the statement before collecting your data.

There are several steps involved in the feature extraction process; including the data transformation and feature transformation. As we stated several times, a machine learning model is likely to provide a better result if the model is well trained with better features out of the raw data. Optimized for learning and generalization is a key characteristic of good data. Therefore, the process of putting together the data in this optimal format is achieved through some data processing steps such as cleaning, missing values handling, and some intermediate transformation like from a text document to words transformation.

The methods that help to create new features as predictor variables are called feature transformation, which is actually a group of methods. Feature transformation is essentially required for the dimension reduction. Usually, when the transformed features have a descriptive dimension, it is likely to have better order compared to the original features.

Therefore, less descriptive features can be dropped from the training or test samples when building the machine learning models. The most common tasks included in the feature transformation are non-negative matrix factorization, principal component analysis, and factor analysis using scaling, decomposition, and aggregation operations.

Examples of feature extraction include the extraction of contours in images, extraction of diagrams from a text, extraction of phonemes from the recording of spoken text, and so on. Feature extraction involves a transformation of the features, which is often not reversible because some information is lost eventually in the process of dimensionality reduction.

Feature selection – filtering features from data

Feature selection is a process for preparing the training datasets or validation dataset for predictive modeling and analytics. Feature selection has practical implication in most of the machine learning problem types including classification, clustering, dimensionality reduction, collaborative filtering, regression, and so on.

Therefore, the ultimate goal is to select a subset from the large collection of features from the original data set. And often dimensionality reduction algorithms are applied, such as Singular Value Decomposition (SVD) and Principal Component Analysis (PCA).

An interesting power of the feature selection technique is that a minimal feature set can be applied to represent the maximum amount of variance in the available data. In other words, the minimal subset of the feature is enough to train your machine learning model quite efficiently.

This subset of features is used to train the model. There are two types of feature selection techniques, namely forward selection and backwards selection. The forward selection starts with the strongest feature and keeps adding more features. On the contrary, the backwards selection starts with all the features and removes the weakest features. However, both techniques are computationally expensive.

Importance of feature selection

Since not all the features are equally important; consequently, you will find some features with more importance than others for making the model more accurate. Therefore, those attributes can be treated as irrelevant to the problem. As a result, you need to remove those features before preparing the training and test sets. Sometimes, the same technique might be applied to the validation sets.

In parallel to importance, you will always find some features that will be redundant in the context of other features. Feature selection is not only involved with removing irrelevant or redundant features, it also serves other purposes that are important to increase the model's accuracy, as stated here:

  • Feature selection increases the predictive accuracy of the model you are using by eliminating irrelevant, null/missing, and redundant features. It also deals with highly correlated features.
  • Feature selection techniques make the model training process more robust and faster by decreasing the number of features.

Feature selection versus dimensionality reduction

Although by using the feature selection technique it is quietly possible to reduce the number of features by selecting certain features in the dataset. And later on, the subset is used to train the model. However, the entire process usually, cannot be used interchangeably with the term dimensionality reduction.

The reality is that the feature selection methods are used to extract a subset from the total set in the data without changing their underlying properties.

In contrast, the dimensionality reduction method, on the other hand, employs already engineered features that can transform the original features into corresponding feature vectors by reducing the number of variables under certain considerations and requirements of the machine learning problem.

Thus, it actually modifies the underlying data, extracts the original features from raw and noisy features by compressing the data, but maintains the original structure and most of the time is irreversible. Typical examples of dimensionality reduction methods include Principal Component Analysis (PCA), Canonical Correlation Analysis (CCA), and Singular Value Decomposition (SVD).

Other feature selection techniques use the filter-based, wrapper methods and embedded methods feature selection by evaluating the correlation between each feature and the target attribute in a supervised context. These methods apply some statistical measures to assign a score to each feature also known as filter methods.

The features are then ranked based on the scoring system that can help to eliminate the specific features. Examples of such techniques are information gain, correlation coefficient scores, and Chi-squared test. An example of wrapper methods, which is a feature selection process as a search problem, is the recursive feature elimination algorithm. On the other hand, Least Absolute Shrinkage and Selection Operator (LASSO), Elastic Net, and Ridge Regression are typical examples of embedded methods of feature selection, which is also known as regularizations methods.

The current implementation of Spark MLlib provides the support for dimensionality reduction on the RowMatrix class only for the SVD and PCA. On the other hand, some typical steps from raw data collection to feature selection are feature extractions, feature transformation, and feature selection.

Tip

Interested readers are suggested to read the API documentation for the feature selection and dimensionality reduction at: http://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset