A dataset often contains some parts of the data that are not helpful for analysis. One way to get rid of them is to pre-process the dataset and then import it to the Weka. The other way is to remove them after the dataset is loaded in Weka. The supervised filters can take into account the class attribute, while the unsupervised filters disregard it. In addition, filters can perform operation(s) on an attribute or instance that meets filter conditions. These are attribute-based and instance-based filters, respectively. Most filters implement the OptionHandler
interface allowing you to set the filter options via a String
array.
This task will demonstrate how to create a filter and apply it on the dataset. Additional sections show a variety of cases such as discretization and classifier-specific filtering.
Before starting, load a dataset, as shown in the previous recipe. Then, to remove, for example, the second attribute from the dataset, use the following code snippet:
import weka.core.Instances; import weka.filters.Filter; import weka.filters.unsupervised.attribute.Remove; ... String[] opts = new String[]{ "-R", "2"}; Remove remove = new Remove(); remove.setOptions(opts); remove.setInputFormat(dataset); Instances newData = Filter.useFilter(dataset, remove);
The new dataset is now without the second attribute from the original dataset.
First, we import the Instances
object that holds our dataset.
import weka.core.Instances;
Next, we import the Filter
object, which is used to run the selected filter.
import weka.filters.Filter;
For example, if you want to remove a subset of attributes from the dataset, you need this unsupervised attribute filter
weka.filters.unsupervised.attribute.Remove
Now, let's construct the OptionHanlder
interface as a String
array:
String[] options = new String[]{...};
The filter documentation specifies the options as follows: specify the range of attributes to act on. This is a comma-separated list of attribute indices, with first
and last
valid values. Specify an inclusive range with -
. For example, first-3,5,6-10,last
.
Suppose we want to remove the second attribute. Specify that we will use the Range
parameter and remove the second attribute. The first attribute index is 1
, while 0
is used when a new attribute is created, as shown in the previous recipe.
{"-R", "2"}
Initialize a new filter instance as follows:
Remove remove = new Remove();
Pass the options to the newly created filter as follows:
remove.setOptions(options);
Then pass the original dataset (after setting the options):
remove.setInputFormat(dataset);
And finally, apply the filter that returns a new dataset:
Instances newData = Filter.useFilter(dataset, remove);
The new dataset can now be used in other tasks.
In addition to the Remove
filter, we will take a closer look at another important filter; that is, attribute discretization that transforms a real-valued attribute to a nominal-valued attribute. Further, we will demonstrate how to prepare a classifier-specific filter that can apply filtering on the fly.
We will first see how an instance filter discretizes a range of numeric attributes in the dataset into nominal attributes.
Use the following code snippet to discretize all the attribute values to binary values:
import weka.core.Instances; import weka.filters.Filter; import weka.filters.unsupervised.attribute.Discretize; ... String[] options = new String[4];
Specify the number of discrete intervals, for example 2
:
options[0] = "-B"; options[1] = "2";
Specify the range of the attribute on which you want to apply the filter, for example, all the attributes:
options[2 = "-R"; options[3 = "first-last";
Apply the filter:
Discretize discretize = new Discretize(); discretize.setOptions(options); discretize.setInputFormat(dataset); Instances newData = Filter.useFilter(dataset, discretize);
An easy way to filter data on the fly is to use the FilteredClassifier
class. This is a meta-classifier that removes the necessity of filtering the data before training the classifier and prediction. This example demonstrates a meta-classifier with the Remove
filter and J48
decision trees for removing the first attribute (it could be, for example, a numeric ID attribute) in the dataset. For additional details on classifiers see the Training a classifier (Simple) and Building your own classifier (Advanced) recipe, for evaluation see the Testing and evaluating your models (Simple) recipe.
Import the FilteredClassifier
meta classifier, the J48
decision trees classifier, and the Remove
filter:
import weka.classifiers.meta.FilteredClassifier; import weka.classifiers.trees.J48; import weka.filters.unsupervised.attribute.Remove;
Initialize the filter and base classifier:
Remove rm = new Remove(); rm.setAttributeIndices("1"); J48 j48 = new J48();
Create the FilteredClassifier
object, specify filter, and base classifier:
FilteredClassifier fc = new FilteredClassifier(); fc.setFilter(rm); fc.setClassifier(j48);
Build the meta-classifier:
Instances dataset = ... fc.buildClassifier(dataset);
To classify an instance, you can simply use the following:
Instance instance = ... double prediction = fc.classifyInstance(instance);
The instance is automatically filtered before classification, in our case, the first attribute is removed.