Chapter 4. Models – Learning from Information

So far in this book, we have examined a range of tasks and techniques. We introduced the basics of data types, structures, and properties, and we familiarized ourselves with some of the machine learning tools that are available.

In this chapter, we will look at three broad types of model:

  • Logical models
  • Tree models
  • Rule models

The next chapter will be devoted to another important type of model—the linear model. Much of the material in this chapter is theoretical, and its purpose is to introduce some of the mathematical and logical tools needed for machine learning tasks. I encourage you to work through these ideas and formulate them in ways that may help solve problems that we come across.

Logical models

Logical models divide the instance space, that is the set of all possible or allowable, instances, into segments. The goal is to ensure that the data in each segment is homogeneous with respect to a particular task. For example, if the task is classification, then we aim to ensure that each segment contains a majority of instances of the same class.

Logical models use logical expressions to explain a particular concept. The simplest and most general logical expressions are literals, and the most common of these is equality. The equality expression can be applied to all types—nominative, numerical, and ordinal. For numerical and ordinal types, we can include the inequality literals: greater than or less than. From here, we can build more complex expressions using four logical connectives. These are conjunction (logical AND), which is denoted by ; disjunction (logical OR), which is denoted by ; implication, which is denoted by ; and negation, which is denoted by . This provides us with a way to express the following equivalences:

┌┌A ≡ A = A → B ≡ ┌A ∨ B

┌ (A ∧ B) ≡ ┌A ∨ ┌B = ┌(A ∨ B) ≡ ┌A ∧ ┌B

We can apply these ideas in a simple example. Let's say you come across a grove of trees that all appear to be from the same species. Our goal is to identify the defining features of this tree species for use in a classification task. For simplicity sake, let's say we are just dealing with the following four features:

  • Size: This has three values—small, medium, and large
  • Leaf type: This has two values—scaled or non-scaled
  • Fruit: This has two values—yes or no
  • Buttress: This has two values—yes or no

The first tree we identify can be described by the following conjunction:

Size = Large ∧ Leaf = Scaled ∧ Fruit = No ∧ Buttress = Yes

The next tree that we come across is medium-sized. If we drop the size condition, then the statement becomes more general. That is, it will cover more samples:

Leaf = Scaled ∧ Fruit = No ∧ Buttress = Yes

The next tree is also medium-sized, but it does not have buttresses, so we remove this condition and generalize it to the following:

Leaf = Scaled ∧ Fruit = No

The trees in the grove all satisfy this conjunction, and we conclude that they are conifers. Obviously, in a real-world example, we would use a greater range of features and values and employ more complex logical constructs. However, even in this simple example, the instance space is 3 2 2 2, which makes 24 possible instances. If we consider the absence of a feature as an additional value, then the hypothesis space, that is, the space that we can use to describe this set, is 4 3 3 3 = 108. The number of sets of instances , or extensions, that are possible is 224. For example if you were to randomly choose a set of in. For example if you were to randomly choose a set of instances, the odds that you could find a conjunctive concept that exactly describes them is well over 100,000 to one.stances, the odds that you could find a conjunctive concept that exactly describes them is well over 100,000 to one.

Generality ordering

We can begin to map this hypothesis space from the most general statements to the most specific statements. For example, in the neighborhood of our conifer hypothesis, the space looks like this:

Generality ordering

Here, we are ordering our hypothesis by generality. At the top is the most general hypothesis—all trees are conifers. The more general hypothesis will cover a greater number of instances, and therefore the most general hypothesis, that is, all trees are conifers, applies to all instances. Now, while this might apply to the grove we are standing in, when we attempt to apply this hypothesis to new data, that is, to trees outside the grove, it will fail. At the bottom of the preceding diagram, we have the least general hypothesis. As we make more observations and move up through the nodes, we can eliminate hypothesis and establish the next most general complete hypothesis. The most conservative generalization we can make from the data is called the least general generalization (LGG) of these instances. We can understand this as being the point in the hypothesis space where the paths upward from each of the instances intersect.

Let's describe our observations in a table:

Size

Scaled

Fruit

Buttress

Label

L

Y

N

Y

p1

M

Y

N

Y

p2

M

Y

N

N

p3

M

Y

N

Y

p4

Sooner or later, of course, you wander out of the grove and you observe negative examples—trees that are clearly not conifers. You note the following features;

Size

Scaled

Fruit

Buttress

Label

S

N

N

N

n1

M

N

N

N

n2

S

N

Y

N

n3

M

Y

N

N

n4

So, with the addition of the negative examples, we can still see that our least general complete hypothesis is still Scale = Y ∧ Fruit =N. However, you will notice that a negative example, n4, is covered. The hypothesis is therefore not consistent.

Version space

This simple example may lead you to the conclusion that there is only one LGG. But this is not necessarily true. We can expand our hypothesis space by adding a restricted form of disjunction called internal disjunction. In our previous example, we had three positive examples of conifers with either medium or large size. We can add a condition Size = Medium ∨ Size = Large, and we can write this as size [m,l]. Internal disjunction will only work with features that have more than two values because something like Leaves = Scaled ∨ Leaves = Non-Scaled is always true.

In the previous conifer example, we dropped the size condition to accommodate our second and third observations. This gave us the following LGG:

Leaf = Scaled ∧ Leaf = = No

Given our internal disjunction, we can rewrite the preceding LGG as follows:

Size[m,l] ∧ Leaf = Scaled ∧ Fruit = No

Now, consider the first non-conifer, or negative non-conifer example:

Size = Small ∧ Leaf =Non-scaled ∧ Fruit = No

We can drop any of the three conditions in the LGG with the internal disjunction without covering this negative example. However, when we attempt to generalize further to single conditions, we see that Size[m,l] and Leaf = Scaled are OK but Fruit = No is not, since it covers the negative example.

Now, we are interested in the hypothesis that is both complete and consistent, that is, it covers all the positive examples and none of the negative. Let's now redraw our diagram considering just our four positive (p1 - p4) examples and one negative example (n1).

Version space

This is sometimes referred to as the version space. Note that we have one least general hypothesis, three intermediate, and, now, two most general hypotheses. The version space forms a convex set. This means we can interpolate between members of this set. If an element lies between a most general and least general member of the set, then it is also a member of that set. In this way, we can fully describe the version space by its most and least general members.

Consider a case where the least general generalization covers one or more of the negative instances. In such cases, we can say that the data is not conjunctively separable and the version space is empty. We can apply different approach whereby we search for the most general consistent hypothesis. Here we are interested in consistency as opposed to completeness. This essentially involves iterating through paths in the hypothesis space from the most general. We take downward steps by, for example, adding a conjunct or removing a value from an internal conjunct. At each step, we minimize the specialization of the resulting hypothesis.

Coverage space

When our data is not conjunctively separable, we need a way to optimize between consistency and completeness. A useful approach is in terms of mapping the coverage space of positive and negative instances, as shown in the following diagram:

Coverage space

We can see that learning a hypothesis involves finding a path through the hypothesis space ordered by generality. Logical models involve finding a pathway through a latticed structured hypothesis space. Each hypothesis in this space covers a set of instances. Each of these sets has upper and lower bounds, in and are ordered by, generality. So far, we have only used single conjunctions of literals. With a rich logical language at our disposal, why not incorporate a variety of logical connectives into our expressions? There are basically two reasons why we may want to keep our expressions simple, as follows:

  • More expressive statements lead to specialization, which will result in a model overfitting training data and performing poorly on test data
  • Complicated descriptions are computationally more expensive than simple descriptions

As we saw when learning about the conjunctive hypothesis, uncovered positive examples allow us to drop literals from the conjunction, making it more general. On the other hand, covered negative examples require us to increase specialization by adding literals.

Rather than describing each hypothesis in terms of conjunctions of single literals, we can describe it in terms of disjunctions of clauses, where each clause can be of the form A → B. Here, A is a conjunction of literals and B is a single literal. Let's consider the following statement that covers a negative example:

Butt =Y ∧ Scaled = N ∧ Size = S ∧ ┌ Fruit = N

To exclude this negative example, we can write the following clause:

Butt = Y ∧ Scaled = N ∧ Size = S → Fruit = N

There are of course, other clauses that exclude the negative, such as Butt = Y → Fruit = N; however, we are interested in the most specific clause because it is less likely to also exclude covered positives.

PAC learning and computational complexity

Given that, as we increase the complexity of our logical language, we impose a computational cost, we need a metric to gauge the learnability of a language. To these ends, we can use the idea of Probably Approximately Correct (PAC) learning.

When we select one hypothesis from a set of hypotheses, the goal is to ensure that our selection will have, with high probability, a low generalization error. This will perform with a high degree of accuracy on a test set. This introduces the idea of computational complexity. This is a formalization to gauge the computational cost of a given algorithm in relation to the accuracy of its output.

PAC learning makes allowance for mistakes on non-typical examples, and this typicality is determined by an unspecified probability distribution, D. We can evaluate an error rate of a hypothesis with respect to this distribution. For example, let's assume that our data is noise-free and that the learner always outputs a complete and consistent hypothesis within the training samples. Let's choose an arbitrary error rate ϵ < 0.5 and a failure rate δ= 0.5. We require our learning algorithm to output a hypothesis that has a probability ≥ 1 - δ such that the error rate will be less than ϵ. It turns out that this will always be true for any reasonably sized training set. For example, if our hypothesis space, H, contains a single bad hypothesis, then the probability that it is complete and consistent on n independent training samples is less than or equal to (1 - ϵ)n. For any 0 ≤ ϵ ≤ 1, this probability is less than e-n ϵ. We need to keep this below our error rate, δ, which we achieve by setting n ≥ 1/ ϵ ln 1/ δ. Now, if H contains a number of bad hypotheses, k ≤ | H |, then the probability that at least one of them is complete and consistent on n independent samples is at maximum:

k(1 - ϵ)n ≤ | H | (1 - ϵ)n ≤ | H | e-n ϵ

This maximum will be less than f if the following condition is met:

PAC learning and computational complexity

This is known as the sample complexity and you will notice that it is logarithmic in 1/δ and linear in 1/ϵ.

Note

This implies that it is exponentially cheaper to reduce the failure rate than it is to reduce the error rate.

To conclude this section, I will make one further point. The hypothesis space H is a subset of U, a universe of explanation for any given phenomena. How do we know whether the correct hypothesis actually exists inside H rather than elsewhere in U? Bayes theorem shows a relationship between the relative probabilities of H and ┌ H as well as their relative prior probabilities. However, there is no real way we can know the value of P┌ H because there is no way to calculate the probabilities of a hypothesis that has not yet been conceived. Moreover, the contents of this hypothesis consist of a, currently unknown, universe of possible objects. This paradox occurs in any description that uses comparative hypothesis checking where we evaluate our current hypothesis against other hypotheses within H. Another approach would be to find a way to evaluate H. We can see that, as we expand H, the computability of hypothesis within it becomes more difficult. To evaluate H, we need to restrict our universe to the universe of the known. For a human, this is a life of experiences that has been imprinted in our brains and nervous system; for a machine, it is the memory banks and algorithms. The ability to evaluate this global hypothesis space is one of the key challenges of artificial intelligence.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset