Identify the outliers in the data : For a small dataset with just a few features, we can spot outliers/noise via manual inspection. For a dataset with a large number of features, we can perform Principal Component Analysis (PCA ), as shown in the following code:
INDArray factor = org.nd4j.linalg.dimensionalityreduction.PCA.pca_factor(inputFeatures, projectedDimension, normalize); INDArray reduced = inputFeatures.mmul(factor);
Use a schema to define the structure of the data : The following is an example of a basic schema for a customer churn dataset. You can download the dataset from https://www.kaggle.com/barelydedicated/bank-customer-churn-modeling/downloads/bank-customer-churn-modeling.zip/1 :
Schema schema = new Schema.Builder() .addColumnString("RowNumber") .addColumnInteger("CustomerId") .addColumnString("Surname") .addColumnInteger("CreditScore") .addColumnCategorical("Geography", Arrays.asList("France","Germany","Spain")) .addColumnCategorical("Gender", Arrays.asList("Male","Female")) .addColumnsInteger("Age", "Tenure") .addColumnDouble("Balance") .addColumnsInteger("NumOfProducts","HasCrCard","IsActiveMember") .addColumnDouble("EstimatedSalary") .build();
..................Content has been hidden....................
You can't read the all page of ebook, please click
here login for view all page.