How it works...

Before we start schema creation, we need to examine all the features in our dataset. Then, we need to clear all the noisy features, such as name, where it is fair to assume that they have no effect on the produced outcome. If some features are unclear to you, just keep them as such and include them in the schema. If you remove a feature that happens to be a signal unknowingly, then you'll degrade the efficiency of the neural network. This process of removing outliers and keeping signals (valid features) is referred to in step 1. Principal Component Analysis (PCA) would be an ideal choice, and the same has been implemented in ND4J. The PCA class can perform dimensionality reduction in the case of a dataset with a large number of features where you want to reduce the number of features to reduce the complexity. Reducing the features just means removing irrelevant features (outliers/noise). In step 1, we generated a PCA factor matrix by calling pca_factor() with the following arguments:

inputFeatures: Input features as a matrix
projectedDimension: The number of features to project from the actual set of features (for example, 100 important features out of 1,000)
normalize: A Boolean variable (true/false) indicating whether the features are to be normalized (zero mean)

Matrix multiplication is performed by calling the mmul() method and the end result. reduced is the feature matrix that we use after performing the dimensionality reduction based on the PCA factor. Note that you may need to perform multiple training sessions using input features (which are generated using the PCA factor) to understand signals.

In step 2, we used the customer churn dataset (the simple dataset that we used in the next chapter) to demonstrate the Schema creation process. The data types that are mentioned in the schema are for the respective features or labels. For example, if you want to add a schema definition for an integer feature, then it would be addColumnInteger(). Similarly, there are other Schema methods available that we can use to manage other data types.

Categorical variables can be added using addColumnCategorical(), as we mentioned in step 2. Here, we marked the categorical variables and the possible values were supplied. Even if we get a masked set of features, we can still construct their schema if the features are arranged in numbered format (for example, column1, column2, and similar).

Table of Contents for How it works...

Create new playlist

Sign In

Sign Up

Table of Contents for
How it works...