How to do it...

  1. Identify the outliers in the data: For a small dataset with just a few features, we can spot outliers/noise via manual inspection. For a dataset with a large number of features, we can perform Principal Component Analysis (PCA), as shown in the following code:
INDArray factor = org.nd4j.linalg.dimensionalityreduction.PCA.pca_factor(inputFeatures, projectedDimension, normalize);
INDArray reduced = inputFeatures.mmul(factor);
  1. Use a schema to define the structure of the data: The following is an example of a basic schema for a customer churn dataset. You can download the dataset from https://www.kaggle.com/barelydedicated/bank-customer-churn-modeling/downloads/bank-customer-churn-modeling.zip/1:
  Schema schema = new Schema.Builder()
.addColumnString("RowNumber")
.addColumnInteger("CustomerId")
.addColumnString("Surname")
.addColumnInteger("CreditScore")
.addColumnCategorical("Geography",
Arrays.asList("France","Germany","Spain"))
.addColumnCategorical("Gender", Arrays.asList("Male","Female"))
.addColumnsInteger("Age", "Tenure")
.addColumnDouble("Balance")
.addColumnsInteger("NumOfProducts","HasCrCard","IsActiveMember")
.addColumnDouble("EstimatedSalary")
.build();
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset