Data can be spread across multiple files, subdirectories, or multiple clusters. We need a mechanism to extract and handle data in different ways due to various constraints, such as size. In distributed environments, a large amount of data can be stored as chunks in multiple clusters. DataVec uses InputSplit for this purpose.
In step 1, we looked at FileSplit, an InputSplit implementation that splits the root directory into files. FileSplit will recursively look for files inside the specified directory location. You can also pass an array of strings as a parameter to denote the allowed extensions:
- Sample input: A directory location with files:
- Sample output: A list of URIs with the filter applied:
In the sample output, we removed any file paths that are not in .jpeg format. CollectionInputSplit would be useful here if you want to extract data from a list of URIs, like we did in step 2. In step 2, the temp directory has a list of files in it. We used CollectionInputSplit to generate a list of URIs from the files. While FileSplit is specifically for splitting the directory into files (a list of URIs), CollectionInputSplit is a simple InputSplit implementation that handles a collection of URI inputs. If we already have a list of URIs to process, then we can simply use CollectionInputSplit instead of FileSplit.
- Sample input: A directory location with files. Refer to the following screenshot (directory with image files as input):
- Sample output: A list of URIs. Refer to the following list of URIs generated by CollectionInputSplit from the earlier mentioned input.
In step 3, NumberedFileInputSplit generates URIs based on the specified numbering format.
Note that we need to pass an appropriate regular expression pattern to generate filenames in a sequential format. Otherwise, it will throw runtime errors. A regular expression allows us to accept inputs in various numbered formats. NumberedFileInputSplit will generate a list of URIs that you can pass down the level in order to extract and process data. We added the %d regular expression at the end of file name to specify that numbering is present at the trailing end.
- Sample input: A directory location with files in a numbered naming format, for example, file1.txt, file2.txt, and file3.txt.
- Sample output: A list of URIs:
If you need to map input URIs to different output URIs, then you will need TransformSplit. We used it in step 4 to normalize/transform the data URI into the required format. It will be especially helpful if features and labels are kept at different locations. When step 4 is executed, the "." string will be stripped from the URIs, which results in the following URIs:
- Sample input: A collection of URIs, just like what we saw in CollectionInputSplit. However, TransformSplit can accept erroneous URIs:
- Sample output: A list of URIs after formatting them:
After executing step 5, the -in.csv substrings in the URIs will be replaced with -out.csv.
CSVRecordReader is a simple CSV record reader for streaming CSV data. We can form data stream objects based on the delimiters and specify various other parameters, such as the number of lines to skip from the beginning. In step 6, we used CSVRecordReader for the same.
ImageRecordReader is an image record reader that's used for streaming image data.
In step 7, we read images from a local filesystem. Then, we scaled them and converted them according to a given height, width, and channels. We can also specify the labels that are to be tagged for the image data. In order to specify the labels for the image set, create a separate subdirectory under the root. Each of them represents a label.
In step 7, the first two parameters from the ImageRecordReader constructor represent the height and width to which images are to be scaled. We usually give a value of 3 for channels representing R, G, and B. parentPathLabelGenerator will define how to tag labels in images. trainData is the inputSplit we need in order to specify the range of records to load, while transform is the image transformation to be applied while loading images.
The ImageNet website can be found at http://www.image-net.org/.
TransformProcessRecordReader requires a bit of explanation when it's used in the schema transformation process. TransformProcessRecordReader is the end product of applying schema transformation to a record reader. This will ensure that a defined transformation process is applied before it is fed to the training data.
In step 8, transformProcess defines an ordered list of transformations to be applied to the given dataset. This can be the removal of unwanted features, feature data type conversions, and so on. The intent is to make the data suitable for the neural network to process further. You will learn how to create a transformation process in the upcoming recipes in this chapter.
In step 9, we looked at some of the implementations of SequenceRecordReader. We use this record reader if we have a sequence of records to process. This record reader can be used locally as well as in distributed environments (such as Spark).
CodecRecordReader is a record reader that handle multimedia datasets and can be used for the following purposes:
- H.264 (AVC) main profile decoder
- MP3 decoder/encoder
- Apple ProRes decoder and encoder
- H264 Baseline profile encoder
- Matroska (MKV) demuxer and muxer
- MP4 (ISO BMF, QuickTime) demuxer/muxer and tools
- MPEG 1/2 decoder
- MPEG PS/TS demuxer
- Java player applet parsing
- VP8 encoder
- MXF demuxer
CodecRecordReader makes use of jcodec as the underlying media parser.
RegexSequenceRecordReader will consider the entire file as a single sequence and will read it one line at a time. Then, it will split each of them using the specified regular expression. We can combine RegexSequenceRecordReader with NumberedFileInputSplit to read file sequences. In step 9, we used RegexSequenceRecordReader to read the transactional logs that were recorded over the time steps (time series data). In our dataset (logdata.zip), transactional logs are unsupervised data with no specification for features or labels.
CSVSequenceRecordReader reads the sequences of data in CSV format. Each sequence represents a separate CSV file. Each line represents one time step.
In step 10, JacksonLineRecordReader will read the JSON/XML/YAML data line by line. It expects a valid JSON entry for each of the lines without a separator at the end. This follows the Hadoop convention of ensuring that the split works properly in a cluster environment. If the record spans multiple lines, the split won't work as expected and may result in calculation errors. Unlike JacksonRecordReader, JacksonLineRecordReader doesn't create the labels automatically and will require you to mention the configuration during training.