There's more...

In a nutshell, here is what you need to do to build the schema for your datasets:

  • Understand your data well. Identify the noise and signals.
  • Capture features and labels. Identify categorical variables.
  • Identify categorical features that one-hot encoding can be applied to.
  • Pay attention to missing or bad data.
  • Add features using type-specific methods such as addColumnInteger() and addColumnsInteger(), where the feature type is an integer. Apply the respective Builder method to other data types.
  • Add categorical variables using addColumnCategorical().
  • Call the build() method to build the schema.

Note that you cannot skip/ignore any features from the dataset without specifying them in the schema. You need to remove the outlying features from the dataset, create a schema from the remaining features, and then move on to the transformation process for further processing. Alternatively, you can keep all the features aside, keep all the features in the schema, and then define the outliers during the transformation process.

When it comes to feature engineering/data analysis, DataVec comes up with its own analytic engine to perform data analysis on feature/target variables. For local executions, we can make use of AnalyzeLocal to return a data analysis object that holds information about each column in the dataset. Here is how you can create a data analysis object from a record reader object:

DataAnalysis analysis = AnalyzeLocal.analyze(mySchema, csvRecordReader);
System.out.println(analysis);

You can also analyze your dataset for missing values and check whether it is schema-compliant by calling analyzeQuality():

DataQualityAnalysis quality = AnalyzeLocal.analyzeQuality(mySchema, csvRecordReader);
System.out.println(quality);

For sequence data, you need to use analyzeQualitySequence() instead of analyzeQuality(). For data analysis on Spark, you can make use of the AnalyzeSpark utility class in place of AnalyzeLocal

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset