Chapter 2. Loading Data

There are many ways of loading data into RapidMiner. This chapter will cover the important ones when reading from files or databases. Keeping in mind that we are exploring data, the chapter focuses on using the features of RapidMiner to work around the typical issues that real data can cause. Some issues can occur without being noticed, so it is particularly important to be vigilant.

Reading files

The most used operator to import flat files is Read CSV. This is a powerful operator that is capable of far more than importing comma-separated values. The operator has a wizard that can be used to take some of the tedious work out of setting up the details of the fields to import, and generally, this works well. However, there are some points to take care of.

An example process, ReadCSVGuessing.xml, is provided with the files that accompany this book. This process creates a CSV file and then reads it back in and can be used in conjunction with the following sections to help understand them more quickly. If this is run once, a CSV file will be created. The process includes a Read CSV operator, and by choosing the wizard from the operator's configuration options, it will be easier to follow the next sections.

At the most basic level, the result of running the Read CSV operator is an example set based on the contents of the file. The following screenshot shows a subset of the CSV file generated by the ReadCSVGuessing.xml process:

Reading files

Each line consists of three entries separated by semicolons. The first line corresponds to the names of the attributes, and the subsequent lines correspond to the values for the rows. Notice how the fields are surrounded by double quotation marks. All in all, this is a very well behaved file and is easy to read. Real files often have all sorts of issues, which we will cover shortly.

After the file is read, the RapidMiner example set corresponding to this data is displayed as shown in the following screenshot:

Reading files

This is the first opportunity to explore the data, and an inspection of this process shows how the fields in the CSV file map the examples and attributes in the example set.

As said earlier, we are exploring real data, which is always more challenging. A good first step to deal with this is to use the wizard built into the Read CSV operator. This wizard is started by clicking on the Import Configuration Wizard button in the parameter view for the operator.

The wizard uses the first 100 rows that it encounters to infer the types of the attributes (this value is configurable from the RapidMiner settings options). If these rows contain two possible values for an attribute, RapidMiner will guess that the type is binominal. If the 101st row contains a third possible value, the import will generate an error. Using the ReadCSVGuessing.xml process from earlier, deselect the checkbox to preview the first 100 rows and click on the Reload data button. Errors should show up as shown in the following diagram:

Reading files

In this case, changing the type of the last column to polynominal and reloading the data will make the error disappear. Guessing value types will set the type to binominal again.

If the file is large, the wizard may take some time, so it may be advisable either to run the import and follow this process with the Filter Examples operator (for more details, refer to Chapter 8, Reducing Data Size) to see examples with missing attributes. Or you can split the file into smaller parts as described in the Splitting files into smaller pieces section.

Alternative delimiters

The field delimiter used in the previous example is a semicolon. In real explorations, many different delimiters will be encountered. It is possible to parse these using regular expressions. For example, the default regular expression provided in step 2 of the wizard is as follows:

,s*|;s*

This expression means the following:

Look for a comma followed by some optional white space or look for a semicolon followed by some optional white space.

The expression will allow a file containing a mix of commas and semicolons as delimiters to be read although usually only one type of delimiter is used and these are most commonly semicolons or commas.

Another example would be as follows:

:{3}

This expression will look for three consecutive colons and treat them as the delimiter.

In passing, it is worth mentioning that regular expressions are very commonly used within RapidMiner. At various points, regular expressions will be used in this book, and where possible, an English translation will be given. In addition, refer to Chapter 10, Debugging, to find a further summary of why these are so important, where they are used and some tools to help.

Delimiter handling can be challenging if the delimiter is included as valid text within some of the values for fields. This particularly affects commas because many numbering schemes use them to delimit numbers. Vigilance is needed to spot these in general, and sometimes the only thing to do is parse the contents outside the wizard as described in the next section.

Reading complete lines

When using the Read CSV wizard to explore real data, you will inevitably encounter a situation where lines are too complex to parse. In such a case, reading each complete line one by one is a sensible course of action as it allows more complex multistep processing to be performed later. In this case, the line terminators are treated as delimiters. To do this, use a regular expression to look for carriage returns and line feeds as follows:

It is generally necessary to deselect the use of quotes to make this expression work.

The number of examples will correspond to the number of lines in the file. Some UNIX files may not have both the carriage return and line feed as line terminators. In this case, it is necessary to use either or by itself as the regular expression.

Reading large numbers of attributes

The wizard does a good job of inferring attribute types. But by using the first 100 examples (or whatever the configuration is set to), it can sometimes get this wrong if attributes that are beyond the first 100 have a different type. This particularly affects binominal values, real values, and dates. An incorrect guess can lead to errors when the whole file is read. For example, if the first 100 rows for a particular attribute contained two possible values, the wizard would guess that the type of the attribute is binominal. In fact, if there are more than two possible values, the attribute type should be polynominal. When the wizard encounters the unexpected value, it will—not unreasonably—show an error.

When the data to be read has hundreds of attributes, it can be tedious to use the GUI to correct any incorrect guesses. In this situation, the easiest approach is to edit the XML for the process directly.

For example, the following screenshot shows a small fragment of the XML for a Read CSV operator, which can be seen by enabling the XML view from the RapidMiner Studio GUI (this process is available as ReadCSVMetaData.xml in the files that accompany this book):

Reading large numbers of attributes

The following screenshot shows how this XML is shown in the GUI view. The relationship between the GUI view and the XML should be obvious from this example:

Reading large numbers of attributes

Editing the XML is straightforward and can be done by copying and pasting it into an external editor. When the finished XML is pasted back into the RapidMiner Studio GUI, it can be validated by pressing F12 or by clicking on the relevant button in the XML view.

Generally, for very large imports, that is, those with many attributes, it makes sense to set all the variable types to text or polynominal and then use various operators to parse, validate, and set the type of the attributes. This usually avoids errors on import and allows focus to be given to processing each attribute in a more controlled way.

Splitting files into smaller pieces

Processing a single large file that results in many attributes may exceed available memory, which is ultimately dictated by the computer on which RapidMiner is running. In this situation, it is sometimes possible to split a file into chunks using the capabilities of RapidMiner.

An example process that does this is shown in the following screenshot:

Splitting files into smaller pieces

This process reads each line of the entire CSV file to be split into chunks. No processing is performed, thereby avoiding the overhead of having to create many attributes. The following screenshot shows the inner operators within the Loop Batches operator:

Splitting files into smaller pieces

The Generate Macro operator increments a counter that is used to label the different files created with the Write CSV operator. Macros are initially defined in the context view. Macros are like global variables that are available to all operators and can be accessed and manipulated in numerous ways. Refer to the upcoming section Using macros for more information on macros in general.

The whole process titled chopFiles.xml is available from the code that accompanies this book. This file can be imported directly into RapidMiner using Import Process from the File menu. Alternatively, the XML can be pasted in its entirety into the XML view of the GUI. Pressing F12 or selecting the Validate Process option from the Process menu will check the validity of the process and load it.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset