How it works...

Data can be spread across multiple files, subdirectories, or multiple clusters. We need a mechanism to extract and handle data in different ways due to various constraints, such as size. In distributed environments, a large amount of data can be stored as chunks in multiple clusters. DataVec uses InputSplit for this purpose.

In step 1, we looked at FileSplit, an InputSplit implementation that splits the root directory into files. FileSplit will recursively look for files inside the specified directory location. You can also pass an array of strings as a parameter to denote the allowed extensions:

  • Sample input: A directory location with files:

  • Sample output: A list of URIs with the filter applied:

In the sample output, we removed any file paths that are not in .jpeg format. CollectionInputSplit would be useful here if you want to extract data from a list of URIs, like we did in step 2. In step 2, the temp directory has a list of files in it. We used CollectionInputSplit to generate a list of URIs from the files. While FileSplit is specifically for splitting the directory into files (a list of URIs), CollectionInputSplit is a simple InputSplit implementation that handles a collection of URI inputs. If we already have a list of URIs to process, then we can simply use CollectionInputSplit instead of FileSplit

  • Sample input: A directory location with files. Refer to the following screenshot (directory with image files as input):

  • Sample output: A list of URIs. Refer to the following list of URIs generated by CollectionInputSplit from the earlier mentioned input.

In step 3, NumberedFileInputSplit generates URIs based on the specified numbering format. 

Note that we need to pass an appropriate regular expression pattern to generate filenames in a sequential format. Otherwise, it will throw runtime errors. A regular expression allows us to accept inputs in various numbered formats. NumberedFileInputSplit will generate a list of URIs that you can pass down the level in order to extract and process data. We added the %d regular expression at the end of file name to specify that numbering is present at the trailing end. 

  • Sample input: A directory location with files in a numbered naming format, for example, file1.txt, file2.txt, and file3.txt.
  • Sample outputA list of URIs:

If you need to map input URIs to different output URIs, then you will need TransformSplit. We used it in step 4 to normalize/transform the data URI into the required format. It will be especially helpful if features and labels are kept at different locations. When step 4 is executed, the "." string will be stripped from the URIs, which results in the following URIs:

  • Sample input: A collection of URIs, just like what we saw in CollectionInputSplit. However, TransformSplit can accept erroneous URIs:

  • Sample output: A list of URIs after formatting them:

After executing step 5, the -in.csv substrings in the URIs will be replaced with -out.csv.

CSVRecordReader is a simple CSV record reader for streaming CSV data. We can form data stream objects based on the delimiters and specify various other parameters, such as the number of lines to skip from the beginning. In step 6, we used CSVRecordReader for the same. 

For the CSVRecordReader example, use the titanic.csv file that's included in this chapter's GitHub repository. You need to update the directory path in the code to be able to use it.

ImageRecordReader is an image record reader that's used for streaming image data.

In step 7, we read images from a local filesystem. Then, we scaled them and converted them according to a given height, width, and channels. We can also specify the labels that are to be tagged for the image data. In order to specify the labels for the image set, create a separate subdirectory under the root. Each of them represents a label. 

In step 7, the first two parameters from the ImageRecordReader constructor represent the height and width to which images are to be scaled. We usually give a value of 3 for channels representing R, G, and B. parentPathLabelGenerator will define how to tag labels in images. trainData is the inputSplit we need in order to specify the range of records to load, while transform is the image transformation to be applied while loading images.

For the ImageRecordReader example, you can download some sample images from ImageNet. Each category of images will be represented by a subdirectory. For example, you can download dog images and put them under a subdirectory named "dog". You will need to provide the parent directory path where all the possible categories will be included. 

The ImageNet website can be found at http://www.image-net.org/.

TransformProcessRecordReader requires a bit of explanation when it's used in the schema transformation process. TransformProcessRecordReader is the end product of applying schema transformation to a record reader. This will ensure that a defined transformation process is applied before it is fed to the training data.

In step 8, transformProcess defines an ordered list of transformations to be applied to the given dataset. This can be the removal of unwanted features, feature data type conversions, and so on. The intent is to make the data suitable for the neural network to process further. You will learn how to create a transformation process in the upcoming recipes in this chapter. 

For the TransformProcessRecordReader example, use the transform-data.csv file that's included in this chapter's GitHub repository. You need to update the file path in code to be able to use it.

In step 9, we looked at some of the implementations of SequenceRecordReader. We use this record reader if we have a sequence of records to process. This record reader can be used locally as well as in distributed environments (such as Spark).

For the SequenceRecordReader example, you need to extract the dataset.zip file from this chapter's GitHub repository. After the extraction, you will see two subdirectories underneath: features and labels. In each of them, there is a sequence of files. You need to provide the absolute path to these two directories in the code. 

CodecRecordReader is a record reader that handle multimedia datasets and can be used for the following purposes:

  • H.264 (AVC) main profile decoder
  • MP3 decoder/encoder
  • Apple ProRes decoder and encoder
  • H264 Baseline profile encoder
  • Matroska (MKV) demuxer and muxer
  • MP4 (ISO BMF, QuickTime) demuxer/muxer and tools
  • MPEG 1/2 decoder
  • MPEG PS/TS demuxer
  • Java player applet parsing
  • VP8 encoder
  • MXF demuxer

CodecRecordReader makes use of jcodec as the underlying media parser.

For the CodecRecordReader example, you need to provide the directory location of a short video file in the code. This video file will be the input for the CodecRecordReader example. 

RegexSequenceRecordReader will consider the entire file as a single sequence and will read it one line at a time. Then, it will split each of them using the specified regular expression. We can combine RegexSequenceRecordReader with NumberedFileInputSplit to read file sequences. In step 9, we used RegexSequenceRecordReader to read the transactional logs that were recorded over the time steps (time series data). In our dataset (logdata.zip), transactional logs are unsupervised data with no specification for features or labels.

For the RegexSequenceRecordReader example, you need to extract the logdata.zip file from this chapter's GitHub repository. After the extraction, you will see a sequence of transactional logs with a numbered file naming format. You need to provide the absolute path to the extracted directory in the code. 

CSVSequenceRecordReader reads the sequences of data in CSV format. Each sequence represents a separate CSV file. Each line represents one time step.

In step 10, JacksonLineRecordReader will read the JSON/XML/YAML data line by line. It expects a valid JSON entry for each of the lines without a separator at the end. This follows the Hadoop convention of ensuring that the split works properly in a cluster environment. If the record spans multiple lines, the split won't work as expected and may result in calculation errors. Unlike JacksonRecordReader, JacksonLineRecordReader doesn't create the labels automatically and will require you to mention the configuration during training.

For the JacksonLineRecordReader example, you need to provide the directory location of irisdata.txt, which is located in this chapter's GitHub repository. In the irisdata.txt file, each line represents a JSON object. 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset