Removing attributes

Three different techniques for removing attributes are illustrated in the following sections. These are as follows:

  • Remove useless attributes by employing simple statistical techniques.
  • Weighting, which determines how much influence or weight an individual attribute has on the label. The assumption in this case is that the data is being used for a classification problem and the removal of attributes will speed up the modeling process but reduce the accuracy.
  • Model-based, which uses a classification model to determine the most predictive attributes of the label. As with weighting, the assumption is that the data is being used for classification.

Removing useless attributes

The Remove Useless Attributes operator is well named but it is worth understanding how it works to ensure that useful attributes are not accidently removed.

The following screenshot shows Statistics View for the first few attributes of a document vector containing 24176 attributes (refer to the process, reduceLargeDocumentVector.xml).

Removing useless attributes

Each attribute is a real number and the average and standard deviation of 20 examples in the example set are shown in the previous screenshot. The numerical min deviation parameter of the Remove Useless Attributes operator causes an attribute to be removed if its standard deviation is less than or equal to the parameter. For example, in the previous screenshot, if the parameter is set to 0.5, of the first six attributes, the aa, abaht, and abandoning attributes would be removed while aback, abandon, and abandoned would not.

The result of applying this to the example set is shown in the following screenshot:

Removing useless attributes

The total number of attributes has reduced to 10467. The rationale behind this approach is that an attribute with a smaller relative standard deviation has less variation when compared to the other attributes. Therefore, it is more likely to be less influential if the data is being used for classification. This approach is likely to be invalid if the attributes are not normalized. This can be understood by considering two attributes. The first attribute has a range between 0 and 1 and the second, a copy of the first, is scaled by a factor of 1,000. The standard deviation of the scaled attribute will be 1,000 times larger than the original, and the Remove Useless Attributes operator will choose to keep the scaled version in preference to the original. Extending this to different attributes, we can see that attributes with larger ranges will have larger absolute standard deviations and will consequently not be marked as being useless.

One important point is that if the standard deviation of an attribute is 0, there is no variation and the attribute has the same value for all the examples. In this case, the attribute will not have an impact on the result of a classification, adds nothing to understanding differences between examples, and can safely be deleted. This is the default setting for the operator.

When the data contains nominal attributes, the nominal useless above and nominal useless below parameters can be used. For each attribute, the operator determines the proportion of the most frequent nominal value. If this proportion is greater than or equal to the nominal useless above parameter, the attribute is deleted. This gives the possibility that attributes with one dominant nominal value can be deleted since they are less likely to be predictive. When this parameter is set to 1, the default attributes that have only a single nominal value are deleted.

The nominal useless below parameter allows attributes with too many nominal values to be deleted. If the most common attribute is present in fewer examples than the proportion given by the parameter, it is likely that the different nominal values are too numerous. For example, if there are 100 examples and one particular attribute has 100 different nominal values, the proportion of the most common nominal value will be 0.01. Setting the nominal useless below parameter to this will cause the attribute to be deleted. The nominal remove id like parameter is a shortcut to this.

Generally speaking, this operator is difficult to use because of the difficulty of knowing the impact of setting the various thresholds incorrectly. A feedback loop would be required to check the impact. Nonetheless, the default behavior is very useful and, additionally, the ability to quickly remove attributes that do not vary greatly can be a useful way of understanding the data.

Weighting attributes

When building classifiers or performing unsupervised clustering, the number of attributes can profoundly affect the processing time required. Weighting is a technique to rank attributes that have the most influence on the label or are most correlated with the principal components that explain the variation within the data. The attributes with the most weight can then be retained and the effect on model accuracy can be measured, and this can be balanced against processing time. In some cases, elimination of low-weight attributes can even improve the performance of classification.

In addition, it can be very difficult to see which attributes are predictive of the class label when building classifiers. This is especially difficult if there are thousands of attributes in the example set. Some classifiers such as the various types of decision trees or rule induction produce a model that can be read by a person. Many others don't and weighting gives the possibility to eliminate attributes that appear not to be important while determining their effect on model accuracy. More predictive attributes can then be seen and this gives a domain expert the opportunity to focus on these as part of the route to a greater understanding of the data.

Refer to weightLargeLabelledDocumentVector.xml for an initial example process that performs a simple weighting using correlation as well as a select by weight operator (explained later) to reduce the number of attributes.

There are a number of different weighting algorithms, and the precise details of how they work are beyond the scope of this book. It is important, however, to understand how much time some of the operators need because some are quicker than others.

Of the most commonly used weighting methods, weighting by correlation and chi-squared statistics is usually the quickest. Weighting by information gain and information gain ratio are slightly slower, and weighting by Principal Component Analysis (PCA) is usually the slowest.

This is illustrated in the following screenshot that shows the time required for the Weight by Information Gain operator to run, as the number of attributes is varied (the number of examples is fixed at 20; the process and method is described in Chapter 9, Resource Constraints). The bands on the graph represent the minimum and maximum performance for multiple runs. This is because in general, performance measurements vary between runs as a result of differences in the computing environment. It is important to provide enough results to ensure a degree of statistical significance.

Weighting attributes

The results are plotted on a log-log plot and show that k = 10,000 (10 raised to the fourth power) attributes require about 200,000 ms (10 raised to the power 5.3) to process, which is about 3 minutes 20 seconds. This would obviously be different if a different computer was used and the number of examples differed.

Given the straight line, we can estimate the time needed for a larger number of attributes. When the number of attributes is 100,000, the estimated time would be about 7 hours. At 1 million attributes (assuming we could even process this number with the resources available), the time would be of the order of 1 month.

A comparison between three different weighting methods is shown in the following graph, which shows the number of attributes along the x axis and the time in milliseconds along the y axis.

Weighting attributes

As seen in the previous screenshot, the correlation method is the most rapid and PCA is the slowest. Projecting the graphs forward, it is possible to infer that the PCA approach would require 11 days for 10,000 attributes, 65 years for 100,000, and 141,000 years for 1 million. Clearly, PCA must be handled with care because it will quickly become difficult to use for fairly normal-sized example sets.

Having produced a set of weights, they can be used to select attributes through the use of the Select by Weight operator. This operator requires a weight relation to be chosen by the user, and it uses this to select attributes within the example set. The possible values for the weight relation are given as follows (this text has been copied from the online help, the RapidMiner GUI):

  • less_equals: Attributes with weights equal to or less than the weight parameter are selected
  • less: Attributes with weights less than the weight parameter are selected
  • top_k: The k attributes with the highest weights are selected
  • bottom_k: The k attributes with the lowest weights are selected
  • all_but_top_k: All attributes other than the k attributes with the highest weights are selected
  • all_but_bottom_k: All attributes other than the k attributes with the lowest weights are selected
  • top_p%: The top p percent attributes with the highest weights are selected
  • bottom_p%: The bottom p percent attributes with the lowest weights are selected

The typical method of using of this operator when building classifiers is to choose the top k, where k is a small number, and then to investigate how this affects accuracy as k is varied.

Selecting by weight is also useful when eliminating attributes from the test data, which were not used to build a model based on the training data. The basic approach is to create a set of weights using the Data to Weights operator based on the training data. This creates a set of weights set to 1 for all the attributes. These weights can then be used with the Select by Weight operator to eliminate any new attributes that may happen to find their way into the data mining operation.

Selecting attributes using models

Weighting by the PCA approach, mentioned previously, is an example where the combination of attributes within an example drives the generation of the principal components, and the correlation of an attribute with these generates the attribute's weight.

When building classifiers, it is logical to take this a stage further and use the potential model itself as the determinant of whether the addition or removal of an attribute makes for better predictions. RapidMiner provides a number of operators to facilitate this, and the following sections go into detail for one of these operators with the intention of showing how applicable the techniques are to other similar operations. The operator that will be explained in detail is Forward Selection. This is similar to a number of others in the Optimization group within the Attribute selection and Data transformation section of the RapidMiner GUI operator tree. These operators include Backward Elimination and a number of Optimize Selection operators. The techniques illustrated are transferrable to these other operators.

A process that uses Forward Selection is shown in the next screenshot. This process is optimize.xml and is available with the files that accompany this book.

Selecting attributes using models

The Retrieve operator (labeled 1) simply retrieves the sonar data from the local sample repository. This data has 208 examples and 60 regular attributes named attribute_1 to attribute_60. The label is named class and has two values, Rock and Mine.

The Forward Selection operator (labeled 2) tests the performance of a model on examples containing more and more attributes. The inner operators within this operator perform this testing.

The Log to Data operator (labeled 3) creates an example set from the log entries that were written inside the Forward selection operator. Example sets are easier to process and store in the repository.

The Guess Types operator (labeled 4) changes the types of attributes based on their contents. This is simply a cosmetic step to change real numbers into integers to make plotting them look better.

Now, let's return to the Forward Selection operator, which starts by invoking its inner operators to check the model performance using each of the 60 regular attributes individually. This means it runs 60 times. The attribute that gives the best performance is then retained, and the process is repeated with two attributes using the remaining 59 attributes along with the best from the first run. The best pair of attributes is then retained, and the process is repeated with three attributes using each of the remaining 58. This is repeated until the stopping conditions are met. For illustrative purposes, the parameters shown in the following screenshot are chosen to allow it to continue for 60 iterations and use all the 60 attributes.

Selecting attributes using models

The inner operator to the Forward Selection operator is a simple cross validation with the number of folds set to three. Using cross validation ensures that the performance is an estimate of what the performance would be on unseen data. Some overfitting will inevitably occur, and it is likely that setting the number of validations to three will increase this. However, this process is for illustrative purposes and needs to run reasonably quickly, and a low cross-validation count facilitates this.

Inside the Validation operator itself, there are operators to generate a model, calculate performance, and log data. These are shown in the following screenshot:

Selecting attributes using models

The Naïve Bayes operator is a simple model that does not require a large runtime to complete. Within the Validation operator, it runs on different training partitions of the data. The Apply Model and Performance operators check the performance of the operator using test partitions. The Log operator outputs information each time it is called, and the following screenshot shows the details of what it logs.

Selecting attributes using models

Running the process gives the log output as shown in the following screenshot:

Selecting attributes using models

It is worth understanding this output because it gives a good overview of how the operators work and fit together in a process. For example, the attributes applyCountPerformance, applyCountValidation, and applyCountForwardSelection increment by one each time the respective operator is executed. The expected behavior is that applyCountPerformance will increment with each new row in the result, applyCountValidation will increment every three rows, which corresponds to the number of cross validation folds, and applyCountForwardSelection will remain at 1 throughout the process. Note that validationPerformance is missing for the first three rows. This is because the validation operator has not calculated a performance yet. The first occurrence of the logging operator is called validationPerformance; it is the average of innerPerformance within the validation operator. So, for example, the values for innerPerformance are 0.652, 0.514, and 0.580 for the first three rows; these values average out to 0.582, which is the value for validationPerformance in the fourth row. The featureNames attribute shows the attributes that were used to create the various performance measurements.

The results are plotted as a graph as shown:

Selecting attributes using models

This shows that as the number of attributes increases, the validation performance increases and reaches a maximum when the number of attributes is 23. From there, it steadily decreases as the number of attributes reaches 60.

The best performance is given by the attributes immediately before the maximum validationPerformance attribute value. In this case, the attributes are:

attribute_12, attribute_40, attribute_16, attribute_11, attribute_6, attribute_28, attribute_19, attribute_17, attribute_44, attribute_37, attribute_30, attribute_53, attribute_47, attribute_22, attribute_41, attribute_54, attribute_34, attribute_23, attribute_27, attribute_39, attribute_57, attribute_36, attribute_10.

The point is that the number of attributes has reduced and indeed the model accuracy has increased. In real-world situations with large datasets and a reduction in the attribute count, an increase in performance is very valuable.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset