Chapter 7. Transforming Data

Transforming data can often make it easier to mine. This chapter discusses some common data transforms whose objective is to present data in the form of example sets containing examples that have their own attributes. Typically one of these attributes is predicted based on the values of the other attributes.

Once you get the idea of the example and example set clear, you will find that data mining becomes a lot easier. When looking at new data for the first time, you will find yourself mentally transforming it into an example set-like format. It is also good discipline to produce all new data in an example set-like format so that future data exploration and mining activities are made easier.

However, some learning and practice is required to get confident with the data transformation operators. This chapter will cover some detailed examples that will show you how to make example sets with some typical real data.

The first step is to revisit attribute creation to show how to create new attributes based on other attributes in the same example as well as other values from the example set as a whole.

Next, we will look at aggregation. This is a way of summarizing data that is also useful for combining fragments of a transaction into a single example.

Pivoting and de-pivoting are then considered. Pivoting is useful when data is available as rows of name-value pairs that should be considered together in single examples. De-pivoting is useful when multiple attributes contain an implied additional dimension that is better represented as a separate dimension.

Finally, windowing is covered. This is useful when consecutive examples representing a time series need to be converted into single examples representing a time period within the series.

Creating new attributes

We have already covered attribute generation in a previous chapter, but there are some additional techniques to allow new attributes to be generated from attributes in the same example as well as from values in other examples.

Consider the simple example set shown in the following screenshot (see the process readFruitAndVeg.xml in the files that accompany this book to recreate this):

Creating new attributes

The previous screenshot describes the count of each item for a number of transactions. If we want to calculate the total number of fruits and the total number of vegetables, we can use the Generate Aggregation operator (apple and banana are fruits and carrot and daikon are vegetables).

The following screenshot shows typical parameters that could be used to generate such a set of totals (refer to the process manipulateFruitAndVeg.xml to see this):

Creating new attributes

The attribute name field is set to the name of the field to be created. The attribute filter type parameter is set to regular_expression, and in this case it is set to apple|banana to select the two items of fruit in the example set. The aggregation function parameter is set to sum in order to add the attribute values for apple and banana together and the keep all check box is cleared, which has the effect of deleting the apple and banana attributes from the example set. A second operator would be needed to do the equivalent for vegetables, and in this case the regular expression would be carrot|diakon. After applying these operators, the example set appears as shown in the following screenshot:

Creating new attributes

To obtain the mean and standard deviations for the fruit and vegetable attributes, the Extract Macro operator can be used. This operator allows various summary statistics about the example set to be placed into a macro. It should be noted that the sample standard deviation is calculated and not the population. For example, the following screenshot shows the parameters needed for the Extract Macro operator to determine the average for the fruit attribute and place the value into a macro called averageFruit.

Creating new attributes

The macro name (macro shown in the screenshot we just saw) is set to averageFruit and macro type is set to statistics from the drop-down list (a number of other options are available and the interested reader is encouraged to experiment with them), the statistics option is set to average and the fruit attribute must be chosen so that the average for this is calculated.

To calculate the standard deviation of the fruit attribute, a second Extract Macro operator is needed with macro set to standardDeviationFruit and the statistics parameter set to deviation. Two more operators are needed for the same calculations on the vegetable attribute (refer to the manipulateFruitAndVeg.xml process to see these).

Once this is done, the Generate Attributes operator can be used to calculate the z-score—the number of standard deviations an attribute is away from the mean. This is shown in the following screenshot:

Creating new attributes

Various macros are used with the attribute's name to calculate the value for the new attribute for each example.

The final example set is shown in the following screenshot:

Creating new attributes

By checking this manually, we find that the average for the fruit column is 1.364 and for the sample standard deviation is 1.362. For a value of 1, the z-score would be -0.267, which agrees with the numbers in the screenshot.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset