Rule models

We can best understand rule models using the principles of discrete mathematics. Let's review some of these principles.

Let X be a set of features, the feature space, and C be a set of classes. We can define the ideal classifier for X as follows:

c: X → C

A set of examples in the feature space with class c is defined as follows:

D = {(x1, c( x1)), ... , (xn, c( xn)) ⊆ X × C

A splitting of X is partitioning X into a set of mutually exclusive subsets X1....Xs, so we can say the following:

X = X1 ∪ .. ∪ Xs

This induces a splitting of D into D1,...Ds. We define Dj where j = 1,...,s and is {(x,c(x) ∈ D | x ∈ Xj)}.

This is just defining a subset in X called Xj where all the members of Xj are perfectly classified.

In the following table we define a number of measurements using sums of indicator functions. An indicator function uses the notation where I[...] is equal to one if the statement between the square brackets is true and zero if it is false. Here τc(x) is the estimate of c(x).

Let's take a look at the following table:

Number of positives

Rule models

Number of negatives

Rule models

True positives

Rule models

True negatives

Rule models

False positives

Rule models

False negatives

Rule models

Accuracy

Rule models

Error rate

Rule models

True positive rate (sensitivity, recall)

Rule models

True negative rate (negative recall)

Rule models

Precision, confidence

Rule models

Rule models comprise not only sets or lists of rules, but importantly, a specification on how to combine these rules to form predictions. They are a logical model but differ from the tree approach in that, trees split into mutually exclusive branches, whereas rules can overlap, possibly carrying additional information. In supervised learning there are essentially two approaches to rule models. One is to find a combination of literals, as we did previously, to form a hypothesis that covers a sufficiently homogeneous set of samples, and then find a label. Alternatively, we can do the opposite; that is, we can first select a class and then find rules that cover a sufficiently large subset of samples of that class. The first approach tends to lead to an ordered list of rules, and in the second approach, rules are an unordered set. Each deals with overlapping rules in its own characteristic way, as we will see. Let's look at the ordered list approach first.

The ordered list approach

As we add literals to a conjunctive rule, we aim to increase the homogeneity of each subsequent set of instances covered by the rule. This is similar to constructing a path in the hypothesis space as we did for our logical trees in the last section. A key difference with the rule approach is that we are only interested in the purity of one of the children, the one where the added literal is true. With tree-based models, we use the weighted average of both children to find the purity of both branches of a binary split. Here, we are still interested in calculating the purity of subsequent rules; however, we only follow one side of each split. We can still use the same methods for finding purity, but we no longer need to average over all children. As opposed to the divide and conquer strategy of decision trees, rule-based learning is often described as separate and conquer.

Let's briefly consider an example using our conifer categorization problem from the previous section.

The ordered list approach

There are several options for choosing a rule that will result in the purest split. Supposing we choose the rule If scaled = N then class is negative, we have covered three out of four negative samples. In the next iteration, we remove these samples from consideration and continue this process of searching for literals with maximum purity. Effectively, what we are doing is building an ordered list of rules joined with the if and else clauses. We can rewrite our rules to be mutually exclusive, and this would mean that the set of rules does not need to be ordered. The tradeoff here is that we would have to use either negated literals or internal disjunctions to deal with features that have more than two values.

There are certain refinements we can make to this model. For example, we can introduce a stopping criterion that halts iteration if certain conditions are met, such as in the case of noisy data where we may want to stop iteration when the number of samples in each class falls below a certain number.

Ordered rule models have a lot in common with decision trees, especially, in that, they use an objective function based on the notion of purity that is the relative number of positive and negative class instances in each split. They have structures that are easy to visualize and they are used in many different machine learning settings.

Set-based rule models

With set based rule models rules are learned one class at a time, and our objective function simply becomes maximize p, rather than minimizing min(p,1-p). Algorithms that use this method typically iterate over each class and only cover samples of each class that are removed after a rule is found. Set-based models use precision (refer to table 4-1) as a search heuristic and this can make the model focus too much on the purity of the rule; it may miss near pure rules that can be further specialized to form a pure rule. Another approach, called beam search, uses a heuristic to order a predetermined number of best partial solutions.

Note

Ordered lists give us a convex coverage for the training set. This is not necessarily true of the uncorded set-based approach where there is no global optimum order for a given set of rules. Because of this, we have access to rule overlaps expressed as a conjunction A∧B, where A and B are two rule sets. If these two rules are in an ordered list, we have either, if the order is AB, A = (A∧B) ∨ (A∧┌B) or, if the order is BA, B = (A∧B) ∨ (┌A∧B). This means that the rule space is potentially enlarged; however, because we have to estimate the coverage of overlaps, we sacrifice convexity.

Rule models, in general, are well suited to predictive models. We can, as we will see later, extend our rule models to perform such tasks as clustering and regression. Another important application of rule models is to build descriptive models. When we are building classification models, we generally look for rules that will create pure subsets of the training samples. However, this not necessarily true if we are looking for other distinguishing characteristics of a particular sample set. This is sometimes referred to as subgroup discovery. Here, we are not interested in a heuristic that is based on class purity but rather in one that looks for distinguishing class distributions. This is done using a defined quality function based on the idea of local exceptional testing. This function can take the form q=TP/(FP +g). Here g is a generalization factor that determines the allowable number of nontarget class instances relative to the number of instances covered by the rule. For a small value of g, say less than 1, rules will be generated that are more specific because every additional nontarget example incurs greater relative expense. Higher values of g, say greater than 10, create more general rules covering more nontarget samples. There is no theoretical maximum value for g; however, it does not make much sense for it to exceed the number of samples. The value of g is governed by the size of the data and the proportion of positive samples. The value of g can be varied, thus guiding subgroup discovery to certain points in the TP versus FP space.

We can use subjective or objective quality functions. We can incorporate subjective interestingness coefficients into the model to reflect things such as understandability, unexpectedness, or, based on templates describing the interesting class, relationship patterns. Objective measurements are derived from the statistical and structural properties of the data itself. They are very amenable to the use of coverage plots to highlight subgroups that have statistical properties that differ from the population as a whole.

Finally, in this section on rule-based models, we will consider rules that can be learned entirely unsupervised. This is called association rule learning, and its typical use cases include data mining, recommender systems and natural language processing. We will use as an example a hardware shop that sells four items: hammers, nails, screws, and paint.

Let's take a look at the following table:

Transaction

Items

1

Nails

2

Hammers and nails

3

Hammers, nails, paint, and screws

4

Hammers, nails, and paint

5

Screws

6

Paint and screws

7

Screws and nails

8

Paint

In this table, we have grouped transactions with items. We could also have grouped each item with the transactions it was involved in. For example, nails were involved in transactions 1, 2, 3, 4, and 7, and hammers were involved in 2, 3, 4, and so on. We can also do this with sets of items, for example, hammers and nails were both involved in transactions 2, 3, and 4. We can write this as the item set {hammer,nails} covers the transaction set [2,3,4]. There are 16 item sets including the empty set, which covers all transactions.

The relationship between transaction sets forms a lattice structure connecting items with their respective sets. In order to build associative rules, we need to create frequent item sets that exceed the threshold FT. For example, a frequent item set where FT = 3 is {screws}, {hammer,nails}, and {paint}. These are simply the items sets that are associated with three or more transactions. The following is a diagram showing part of the lattice from our example. In a similar way, we found the least general generalization in our hypothesis space mapping. Here, we are interested in the lowest boundary of the largest item set. In this example, it is {nails,hammer}.

Set-based rule models

We can now create association rules of the form if A then B, where A and B are item sets that frequently appear together in a transaction. If we select an edge on this diagram, say the edge between {nails} with a frequency of 5, and {nails, hammer} with a frequency of 3, then we can say that the confidence of the association rule if nails then hammer is 3/5. Using a frequency threshold together with the confidence of a rule, an algorithm can find all rules that exceed this threshold. This is called association rule mining, and it often includes a post-processing phase where unnecessary rules are filtered out, for example—where a more specific rule does not have a higher confidence than a more general parent.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset