Having read the data, in order to understand and explore it more effectively, there is usually a need to convert attributes into different formats, parse them to extract additional information or features, as well as create additional attributes to help represent the data in new ways for new insights.
For example, an extremely common task is converting date and time values into a common format so they can be manipulated. Another example is extracting file names from file paths or domain names from URLs. Furthermore, combining two or more attributes with summary information from the rest of the data or from external sources to make a new attribute may help make a predictive model more powerful.
For real data, it is important to think about unseen data that could be encountered, since it is important to handle this correctly in order for models to function accurately.
Some operators automatically generate attributes with names reflecting the operations performed. These attributes often need to be renamed in order to be used later on because the attribute names contain characters that are interpreted as mathematical operations. The values themselves often need global search and replace operations on them as part of the imposition of a data dictionary across larger projects.
RapidMiner Studio has operators that can be used individually or together to achieve the previous objectives; they allow attributes to be created and renamed with values derived from other attributes, and they allow values to be modified in systematic ways.
The Generate Attributes
operator is used very frequently. It allows new attributes to be generated from other attributes, constant values, macros, and built-in functions. The way to think of this operator is to regard it as an automatic loop over all the examples in an example set. The newly generated attribute is added to all the examples. If the value of the new attribute is derived from the values of other attributes, the single value for the new attribute is taken from the values of the other attributes of the current example. This means that if macros are to be used, they must be defined before the generation of new attributes.
The simplest expression to create a new attribute is shown in the following screenshot:
In the previous screenshot, a1 and a2 are the existing attributes and the new attribute created, called newAttribute, is the sum of these for the example being processed. A subtle point is that the type of the new attribute is worked out dynamically. If new test data is encountered and one of the attributes is a polynominal while the other attribute is a number, the result will be an error. If both attributes are polynominals, the result will also be a polynominal. Data import should take care of getting the type of data correct, but it is worth bearing in mind for unseen data.
Macros can be used. For example, %{m1} + a2
will add the value of the macro m1 to the attribute a2. If the macro is not a number, the expression %{m1} + a2
would treat the value of the macro as an attribute name. For example, if the macro m1 has the value a1, the previous expression would become a1 + a2
, so the end result would be the sum or concatenation of attributes a1 and a2 depending on their types. To force the macro to be treated as a string, place it in quotes. With the macro m1 equal to the string a1, the expression "%{m1}" + a2
would evaluate to "a1" + a2
.
The following table summarizes these different situations:
A process called generateAttributeExamples.xml
is available with the files that accompany this book, which illustrates the previous example.
A large number of functions are available and there is help available from the operator description as well as the Edit Expression dialog, which is accessed by pressing the button to the right of the function expressions in the previous screenshot. How these functions work is usually obvious from the name. There are subtle points relating to some of the functions and these are described in the following sections.
The functions in the Date
group manipulate dates and can convert strings to dates and back. Some detailed examples are given in the next screenshot because dates can be a source of difficulty when encountered in real data.
The following screenshot shows some examples of date calculations within the Generate Attributes
operator. For reference, the dates.xml
process is available with the files that accompany this book.
Various calculations are performed on the date provided as the first attribute. Note how the result of a previous step can be used in subsequent steps.
The result of running this is shown in the following screenshot, which shows the meta-data
of the created attributes:
Note that the strings created using DATE_SHORT
built-in formats are ambiguous because the month and day have been swapped. This is because of locales and country codes and life is usually too short to work out how to get round this. So, the safest thing to do is to use the date_str_custom
function which takes a format string. The strings ADate
and ADateAsStringCustom
are the same as a result. The other point to note is the use of UNIX timestamps to represent dates and times. The UNIX time is in fact in milliseconds so it is important to remember this when performing calculations.
Another useful function is finds()
, which returns true or false if a subsequence of an attribute is matched by the provided regular expression. For example, if an attribute called sentence
contains the string The quick brown fox jumped over the lazy dog
, and the Generate Attributes
operator creates a new attribute using the following function expression, the result will be true:
finds(sentence,"\bThe\W+(?:\w+\W+){1,6}?the\b")
This regular expression determines if The
is between 1 and 6 words from the
. Changing the {1,6}
to {1,4}
changes the result to false because there are five words between The
and the
. Note that two backslashes are required in the string to escape the backslash required for the w
, W
, and special characters.
The use of this technique allows powerful validation processes to be created to confirm that imported data and generated attributes are valid. Sometimes, unseen test data contains values outside of what was seen during the test phase, and this can be a source of subtle, hard-to-find problems. Creating a new attribute that is true or false depending on some regular expression will allow all invalid examples to be flagged. The Filter Examples
operator allows the invalid examples to be filtered and counted.
The other important use of this technique is to filter out data that is not required. This is especially important in the real world, where data volumes can be large.
Other operators are available to extract data from within the value of attributes. For example, the
Generate Extract
operator can parse an attribute's value to extract selected data. This operator has a number of options for querying. Most typically, regular expressions are used for unstructured data, and XPath for structured data.
Revisiting the finds()
function example, as part of Generate Attribute
, an attribute called sentence
contains the value The quick brown fox jumped over the lazy dog
and the requirement is to extract text where the word The
is within six words of the word the
.
The parameters for Generate Extract are shown in the following screenshot:
The regular expression needed is shown in the following screenshot:
The regular expression is slightly different from that used previously. The double backslash form is not required to escape backslashes. The whole expression is enclosed in parentheses and quotes are not required around the expression.
The resulting attribute contains "The quick brown fox jumped over the"
.
The parentheses at the beginning and end implement a regular expression capturing group, without which the operator will fail. This is because the operator uses the first capturing group as the result to populate the result attribute.
To illustrate how this can be modified, the regular expression is changed as follows:
TheW+((?:w+W+){1,6}?)the
The result is quick brown fox jumped over
.
The first capturing group now captures all the words and white space between The
and the
. An extremely useful feature is the regular expression editor. This can be started by clicking on the button to the right of query expression, as shown in the previous screenshot.
The editor lets regular expressions be tried interactively. An example is shown in the next screenshot. The yellow highlights show the successful matches, the result preview shows what a replace operation would look like, and the result list outputs the text of the matches.
Regular expressions can be daunting to learn, and this book has deliberately gone straight to a relatively complex example to show what is possible. This and other examples in this book have been inspired by various sources on the Internet, too numerous to mention (refer to Chapter 10, Debugging, for some suggestions). To get the best out of RapidMiner Studio, it is well worth getting comfortable with regular expressions.
No book on RapidMiner Studio would be complete without some discussion of XPath. This too has a learning curve but, as with regular expressions, there are plenty of examples available on the Internet. Again, this book has deliberately gone to a relatively complex first example in order to show the possibilities. Refer to Chapter 10, Debugging, for some suggestions for learning resources.
XPath works on structured XML data and an example is shown in the following code:
<?xml version="1.0"?> <soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"> <soap:Body> <ns2:getFolderContentsResponse xmlns:ns2="http://service.web.rapidanalytics.de/"> <return> <status>0</status> <entries> <status>0</status> <date>1352240369000</date> <latestRevision>1</latestRevision> <location>/home/rapidanalytics/svm</location> <size>15561</size> <type>process</type> <user>rapidanalytics</user> </entries> <entries> <status>0</status> <date>1352332393000</date> <latestRevision>1</latestRevision> <location>/home/rapidanalytics/t1</location> <size>1639</size> <type>process</type> <user>rapidanalytics</user> </entries> <location>/home/rapidanalytics</location> </return> </ns2:getFolderContentsResponse> </soap:Body> </soap:Envelope>
This is a
Simple Object Access Protocol (SOAP) response from a RapidMiner Server server showing the contents of a folder in its repository. To extract the timestamp of the /home/rapidanalytics/svm
process, the XPath shown in the following screenshot could be used in the Generate
Extract operator.
This XPath searches for the named process within all location nodes. Having found it, the search moves up one level in the document and then extracts the text within the date node. This example is more complex than it needs to be and could be reduced to ://location[text()="/home/rapidanalytics/svm"]/../date/text()
.
However, the original illustrates the use of namespaces. These are defined in the namespace
parameters and the next screenshot shows this. The values are taken from the original document.
It is also important to uncheck the assume HTML checkbox for this to function correctly. The end result is the value 1352240369000, which is a UNIX time in milliseconds for the specific process. Refer to the generateExtract.xml
process in the files that accompany this book for more information.