Chapter 4. Parsing and Converting Attributes

Having read the data, in order to understand and explore it more effectively, there is usually a need to convert attributes into different formats, parse them to extract additional information or features, as well as create additional attributes to help represent the data in new ways for new insights.

For example, an extremely common task is converting date and time values into a common format so they can be manipulated. Another example is extracting file names from file paths or domain names from URLs. Furthermore, combining two or more attributes with summary information from the rest of the data or from external sources to make a new attribute may help make a predictive model more powerful.

For real data, it is important to think about unseen data that could be encountered, since it is important to handle this correctly in order for models to function accurately.

Some operators automatically generate attributes with names reflecting the operations performed. These attributes often need to be renamed in order to be used later on because the attribute names contain characters that are interpreted as mathematical operations. The values themselves often need global search and replace operations on them as part of the imposition of a data dictionary across larger projects.

RapidMiner Studio has operators that can be used individually or together to achieve the previous objectives; they allow attributes to be created and renamed with values derived from other attributes, and they allow values to be modified in systematic ways.

Generating attributes

The Generate Attributes operator is used very frequently. It allows new attributes to be generated from other attributes, constant values, macros, and built-in functions. The way to think of this operator is to regard it as an automatic loop over all the examples in an example set. The newly generated attribute is added to all the examples. If the value of the new attribute is derived from the values of other attributes, the single value for the new attribute is taken from the values of the other attributes of the current example. This means that if macros are to be used, they must be defined before the generation of new attributes.

The simplest expression to create a new attribute is shown in the following screenshot:

Generating attributes

In the previous screenshot, a1 and a2 are the existing attributes and the new attribute created, called newAttribute, is the sum of these for the example being processed. A subtle point is that the type of the new attribute is worked out dynamically. If new test data is encountered and one of the attributes is a polynominal while the other attribute is a number, the result will be an error. If both attributes are polynominals, the result will also be a polynominal. Data import should take care of getting the type of data correct, but it is worth bearing in mind for unseen data.

Macros can be used. For example, %{m1} + a2 will add the value of the macro m1 to the attribute a2. If the macro is not a number, the expression %{m1} + a2 would treat the value of the macro as an attribute name. For example, if the macro m1 has the value a1, the previous expression would become a1 + a2, so the end result would be the sum or concatenation of attributes a1 and a2 depending on their types. To force the macro to be treated as a string, place it in quotes. With the macro m1 equal to the string a1, the expression "%{m1}" + a2 would evaluate to "a1" + a2.

The following table summarizes these different situations:

Generating attributes

A process called generateAttributeExamples.xml is available with the files that accompany this book, which illustrates the previous example.

A large number of functions are available and there is help available from the operator description as well as the Edit Expression dialog, which is accessed by pressing the button to the right of the function expressions in the previous screenshot. How these functions work is usually obvious from the name. There are subtle points relating to some of the functions and these are described in the following sections.

Date functions

The functions in the Date group manipulate dates and can convert strings to dates and back. Some detailed examples are given in the next screenshot because dates can be a source of difficulty when encountered in real data.

The following screenshot shows some examples of date calculations within the Generate Attributes operator. For reference, the dates.xml process is available with the files that accompany this book.

Date functions

Various calculations are performed on the date provided as the first attribute. Note how the result of a previous step can be used in subsequent steps.

The result of running this is shown in the following screenshot, which shows the meta-data of the created attributes:

Date functions

Note that the strings created using DATE_SHORT built-in formats are ambiguous because the month and day have been swapped. This is because of locales and country codes and life is usually too short to work out how to get round this. So, the safest thing to do is to use the date_str_custom function which takes a format string. The strings ADate and ADateAsStringCustom are the same as a result. The other point to note is the use of UNIX timestamps to represent dates and times. The UNIX time is in fact in milliseconds so it is important to remember this when performing calculations.

Regular expression functions

Another useful function is finds(), which returns true or false if a subsequence of an attribute is matched by the provided regular expression. For example, if an attribute called sentence contains the string The quick brown fox jumped over the lazy dog, and the Generate Attributes operator creates a new attribute using the following function expression, the result will be true:

finds(sentence,"\bThe\W+(?:\w+\W+){1,6}?the\b")

This regular expression determines if The is between 1 and 6 words from the. Changing the {1,6} to {1,4} changes the result to false because there are five words between The and the. Note that two backslashes are required in the string to escape the backslash required for the w, W, and  special characters.

The use of this technique allows powerful validation processes to be created to confirm that imported data and generated attributes are valid. Sometimes, unseen test data contains values outside of what was seen during the test phase, and this can be a source of subtle, hard-to-find problems. Creating a new attribute that is true or false depending on some regular expression will allow all invalid examples to be flagged. The Filter Examples operator allows the invalid examples to be filtered and counted.

The other important use of this technique is to filter out data that is not required. This is especially important in the real world, where data volumes can be large.

Generating extracts

Other operators are available to extract data from within the value of attributes. For example, the Generate Extract operator can parse an attribute's value to extract selected data. This operator has a number of options for querying. Most typically, regular expressions are used for unstructured data, and XPath for structured data.

Regular expressions

Revisiting the finds() function example, as part of Generate Attribute, an attribute called sentence contains the value The quick brown fox jumped over the lazy dog and the requirement is to extract text where the word The is within six words of the word the.

The parameters for Generate Extract are shown in the following screenshot:

Regular expressions

The regular expression needed is shown in the following screenshot:

Regular expressions

The regular expression is slightly different from that used previously. The double backslash form is not required to escape backslashes. The whole expression is enclosed in parentheses and quotes are not required around the expression.

The resulting attribute contains "The quick brown fox jumped over the".

The parentheses at the beginning and end implement a regular expression capturing group, without which the operator will fail. This is because the operator uses the first capturing group as the result to populate the result attribute.

To illustrate how this can be modified, the regular expression is changed as follows:

TheW+((?:w+W+){1,6}?)the

The result is quick brown fox jumped over.

The first capturing group now captures all the words and white space between The and the. An extremely useful feature is the regular expression editor. This can be started by clicking on the button to the right of query expression, as shown in the previous screenshot.

The editor lets regular expressions be tried interactively. An example is shown in the next screenshot. The yellow highlights show the successful matches, the result preview shows what a replace operation would look like, and the result list outputs the text of the matches.

Regular expressions

Regular expressions can be daunting to learn, and this book has deliberately gone straight to a relatively complex example to show what is possible. This and other examples in this book have been inspired by various sources on the Internet, too numerous to mention (refer to Chapter 10, Debugging, for some suggestions). To get the best out of RapidMiner Studio, it is well worth getting comfortable with regular expressions.

XPath

No book on RapidMiner Studio would be complete without some discussion of XPath. This too has a learning curve but, as with regular expressions, there are plenty of examples available on the Internet. Again, this book has deliberately gone to a relatively complex first example in order to show the possibilities. Refer to Chapter 10, Debugging, for some suggestions for learning resources.

XPath works on structured XML data and an example is shown in the following code:

<?xml version="1.0"?>
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/">
  <soap:Body>
    <ns2:getFolderContentsResponse xmlns:ns2="http://service.web.rapidanalytics.de/">
      <return>
        <status>0</status>
        <entries>
          <status>0</status>
          <date>1352240369000</date>
          <latestRevision>1</latestRevision>
          <location>/home/rapidanalytics/svm</location>
          <size>15561</size>
          <type>process</type>
          <user>rapidanalytics</user>
        </entries>
        <entries>
          <status>0</status>
          <date>1352332393000</date>
          <latestRevision>1</latestRevision>
          <location>/home/rapidanalytics/t1</location>
          <size>1639</size>
          <type>process</type>
          <user>rapidanalytics</user>
        </entries>
        <location>/home/rapidanalytics</location>
      </return>
    </ns2:getFolderContentsResponse>
  </soap:Body>
</soap:Envelope>

This is a Simple Object Access Protocol (SOAP) response from a RapidMiner Server server showing the contents of a folder in its repository. To extract the timestamp of the /home/rapidanalytics/svm process, the XPath shown in the following screenshot could be used in the Generate Extract operator.

XPath

This XPath searches for the named process within all location nodes. Having found it, the search moves up one level in the document and then extracts the text within the date node. This example is more complex than it needs to be and could be reduced to ://location[text()="/home/rapidanalytics/svm"]/../date/text().

However, the original illustrates the use of namespaces. These are defined in the namespace parameters and the next screenshot shows this. The values are taken from the original document.

XPath

It is also important to uncheck the assume HTML checkbox for this to function correctly. The end result is the value 1352240369000, which is a UNIX time in milliseconds for the specific process. Refer to the generateExtract.xml process in the files that accompany this book for more information.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset