How it works...

If you recall our customer churn dataset, there are 14 features:

After performing step 1, you have 11 valid features remaining. The following marked features have zero significance on the prediction outcome. For example, the customer name doesn't influence whether a customer would leave the organization or not.

In the above screenshot, we have marked the features that are not required for the training. These features can be removed from the dataset as it doesn't have any impact on outcome. 

In step 1, we tagged the noise features (RowNumberCustomerid, and Surname) in our dataset for removal during the schema transformation process using the removeColumns() method.

The customer churn dataset used in this chapter has only 14 features. Also, the feature labels are meaningful. So, a manual inspection was just enough. In the case of a large number of features, you might need to consider using PCA (short for Principal Component Analysis), as explained in the previous chapter. 

In step 2, we used the AnalyzeLocal utility class to find the missing values in the dataset by calling analyzeQuality(). You should see the following result when you print out the information in the DataQualityAnalysis object: 

As you can see in the preceding screenshot, each of the features is analyzed for its quality (in terms of invalid/missing data), and the count is displayed for us to decide if we need to normalize it further. Since all features appeared to be OK, we can proceed further. 

There are two ways in which missing values can be handled. Either we remove the entire record or replace them with a value. In most cases, we don't remove records; instead, we replace them with a value to indicate absence. We can do it during the transformation process using conditionalReplaceValueTransform() or conditionalReplaceValueTransformWithDefault(). In step 3/4, we removed missing or invalid values from the dataset. Note that the feature needs to be known beforehand. We cannot check the whole set of features for this purpose. At the moment, DataVec doesn't support this functionality. You may perform step 2 to identify features that need attention. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset