How to do it...

Identify the outliers in the data: For a small dataset with just a few features, we can spot outliers/noise via manual inspection. For a dataset with a large number of features, we can perform Principal Component Analysis (PCA), as shown in the following code:

INDArray factor = org.nd4j.linalg.dimensionalityreduction.PCA.pca_factor(inputFeatures, projectedDimension, normalize);
 INDArray reduced = inputFeatures.mmul(factor);

Use a schema to define the structure of the data: The following is an example of a basic schema for a customer churn dataset. You can download the dataset from https://www.kaggle.com/barelydedicated/bank-customer-churn-modeling/downloads/bank-customer-churn-modeling.zip/1:

  Schema schema = new Schema.Builder()
 .addColumnString("RowNumber")
 .addColumnInteger("CustomerId")
 .addColumnString("Surname")
 .addColumnInteger("CreditScore")
 .addColumnCategorical("Geography",  
  Arrays.asList("France","Germany","Spain"))
 .addColumnCategorical("Gender", Arrays.asList("Male","Female"))
 .addColumnsInteger("Age", "Tenure")
 .addColumnDouble("Balance")
 .addColumnsInteger("NumOfProducts","HasCrCard","IsActiveMember")
 .addColumnDouble("EstimatedSalary")
 .build();

Table of Contents for How to do it...

Create new playlist

Sign In

Sign Up

Table of Contents for
How to do it...