Data Extraction, Transformation, and Loading

Let's discuss the most important part of any machine learning puzzle: data preprocessing and normalization. Garbage in, garbage out would be the most appropriate statement for this situation. The more noise we let pass through, the more undesirable outputs we will receive. Therefore, you need to remove noise and keep signals at the same time.

Another challenge is handling various types of data. We need to convert raw datasets into a suitable format that a neural network can understand and perform scientific computations on. We need to convert data into a numeric vector so that it is understandable to the network and so that computations can be applied with ease. Remember that neural networks are constrained to only one type of data: vectors. 

There has to be an approach regarding how data is loaded into a neural network. We cannot put 1 million data records onto a neural network at once – that would bring performance down. We are referring to training time when we mention performance here. To increase performance, we need to make use of data pipelines, batch training, and other sampling techniques.

DataVec is an input/output format system that can manage everything that we just mentioned. It solves the biggest headaches that every deep learning puzzle causes. DataVec supports all types of input data, such as text, images, CSV files, and videos. The DataVec library manages the data pipeline in DL4J.

In this chapter, we will learn how to perform ETL operations using DataVec. This is the first step in building a neural network in DL4J.

In this chapter, we will cover the following recipes:

  • Reading and iterating through data
  • Performing schema transformations
  • Serializing transforms
  • Building a transform process
  • Executing a transform process
  • Normalizing data for network efficiency
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset