Training decision trees

Let's create a decision tree using an algorithm called Iterative Dichotomiser 3 (ID3). Invented by Ross Quinlan, ID3 was one of the first algorithms used to train decision trees. Assume that you have to classify animals as cats or dogs. Unfortunately, you cannot observe the animals directly and must use only a few attributes of the animals to make your decision. For each animal, you are told whether or not it likes to play fetch, whether or not it is frequently grumpy, and its favorite of three types of food.

To classify new animals, the decision tree will examine an explanatory variable at each node. The edge it follows to the next node will depend on the outcome of the test. For example, the first node might ask whether or not the animal likes to play fetch. If the animal does, we will follow the edge to the left child node; if not, we will follow the edge to the right child node. Eventually an edge will connect to a leaf node that indicates whether the animal is a cat or a dog.

The following fourteen instances comprise our training data:

Training instance

Plays fetch

Is grumpy

Favorite food

Species

1

Yes

No

Bacon

Dog

2

No

Yes

Dog Food

Dog

3

No

Yes

Cat food

Cat

4

No

Yes

Bacon

Cat

5

No

No

Cat food

Cat

6

No

Yes

Bacon

Cat

7

No

Yes

Cat Food

Cat

8

No

No

Dog Food

Dog

9

No

Yes

Cat food

Cat

10

Yes

No

Dog Food

Dog

11

Yes

No

Bacon

Dog

12

No

No

Cat food

Cat

13

Yes

Yes

Cat food

Cat

14

Yes

Yes

Bacon

Dog

From this data we can see that cats are generally grumpier than the dogs. Most dogs play fetch and most cats refuse. Dogs prefer dog food and bacon, whereas cats only like cat food and bacon. The is grumpy and plays fetch explanatory variables can be easily converted to binary-valued features. The favorite food explanatory variable is a categorical variable that has three possible values; we will one-hot encode it. Recall from Chapter 3, Feature Extraction and Preprocessing, that one-hot encoding represents a categorical variable with as many binary-valued features as there are values for variable. Representing the categorical variable with a single integer-valued feature will encode an artificial order to its values. Since favorite food has three possible states, we will represent it with three binary-valued features. From this table, we can manually construct classification rules. For example, an animal that is grumpy and likes cat food must be a cat, while an animal that plays fetch and likes bacon must be a dog. Constructing these classification rules by hand for even a small data set is cumbersome. Instead, we will learn these rules by creating a decision tree.

Selecting the questions

Like Twenty Questions, the decision tree will estimate the value of the response variable by testing the values of a sequence of explanatory variables. Which explanatory variable should be tested first? Intuitively, a test that produces subsets that contain all cats or all dogs is better than a test that produces subsets that still contain both cats and dogs. If the members of a subset are of different classes, we are still uncertain about how to classify the instance. We should also avoid creating tests that separate only a single cat or dog from the others; such tests are analogous to asking specific questions in the first few rounds of Twenty Questions. More formally, these tests can infrequently classify an instance and might not reduce our uncertainty. The tests that reduce our uncertainty about the classification the most are the best. We can quantify the amount of uncertainty using a measure called entropy.

Measured in bits, entropy quantifies the amount of uncertainty in a variable. Entropy is given by the following equation, where Selecting the questions is the number of outcomes and Selecting the questions is the probability of the outcome Selecting the questions. Common values for Selecting the questions are 2, Selecting the questions, and 10. Because the log of a number less than one will be negative, the entire sum is negated to return a positive value.

Selecting the questions

For example, a single toss of a fair coin has only two outcomes: heads and tails. The probability that the coin will land on heads is 0.5, and the probability that it will land on tails is 0.5. The entropy of the coin toss is equal to the following:

Selecting the questions

That is, only one bit is required to represent the two equally probable outcomes, heads and tails. Two tosses of a fair coin can result in four possible outcomes: heads and heads, heads and tails, tails and heads, and tails and tails. The probability of each outcome is 0.5 x 0.5 = 0.25. The entropy of two tosses is equal to the following:

Selecting the questions

If the coin has the same face on both sides, the variable representing its outcome has 0 bits of entropy; that is, we are always certain of the outcome and the variable will never represent new information. Entropy can also be represented as a fraction of a bit. For example, an unfair coin has two different faces, but is weighted such that the faces are not equally likely to land in a toss. Assume that the probability that an unfair coin will land on heads is 0.8, and the probability that it will land on tails is 0.2. The entropy of a single toss of this coin is equal to the following:

Selecting the questions

The outcome of a single toss of an unfair coin can have a fraction of one bit of entropy. There are two possible outcomes of the toss, but we are not totally uncertain since one outcome is more frequent.

Let's calculate the entropy of classifying an unknown animal. If an equal number of dogs and cats comprise our animal classification training data and we do not know anything else about the animal, the entropy of the decision is equal to one. All we know is that the animal could be either a cat or a dog; like the fair coin toss, both outcomes are equally likely. Our training data, however, contains six dogs and eight cats. If we do not know anything else about the unknown animal, the entropy of the decision is given by the following:

Selecting the questions

Since cats are more common, we are less uncertain about the outcome. Now let's find the explanatory variable that will be most helpful in classifying the animal; that is, let's find the explanatory variable that reduces the entropy the most. We can test the plays fetch explanatory variable and divide the training instances into animals that play fetch and animals that don't. This produces the two following subsets:

Selecting the questions

Decision trees are often visualized as diagrams that are similar to flowcharts. The top box of the previous diagram is the root node; it contains all of our training instances and specifies the explanatory variable that will be tested. At the root node we have not eliminated any instances from the training set and the entropy is equal to approximately 0.985. The root node tests the plays fetch explanatory variable. Recall that we converted this Boolean explanatory variable to a binary-valued feature. Training instances for which plays fetch is equal to zero follow the edge to the root's left child, and training instances for animals that do play fetch follow the edge to the root's right child node. The left child node contains a subset of the training data with seven cats and two dogs that do not like to play fetch. The entropy at this node is given by the following:

Selecting the questions

The right child contains a subset with one cat and four dogs that do like to play fetch. The entropy at this node is given by the following:

Selecting the questions

Instead of testing the plays fetch explanatory variable, we could test the is grumpy explanatory variable. This test produces the following tree. As with the previous tree, instances that fail the test follow the left edge, and instances that pass the test follow the right edge.

Selecting the questions

We could also divide the instances into animals that prefer cat food and animals that don't to produce the following tree:

Selecting the questions

Information gain

Testing for the animals that prefer cat food resulted in one subset with six cats, zero dogs, and 0 bits of entropy and another subset with two cats, six dogs, and 0.811 bits of entropy. How can we measure which of these tests reduced our uncertainty about the classification the most? Averaging the entropies of the subsets may seem to be an appropriate measure of the reduction in entropy. In this example, the subsets produced by the cat food test have the lowest average entropy. Intuitively, this test seems to be effective, as we can use it to classify almost half of the training instances. However, selecting the test that produces the subsets with the lowest average entropy can produce a suboptimal tree. For example, imagine a test that produced one subset with two dogs and no cats and another subset with four dogs and eight cats. The entropy of the first subset is equal to the following (note that the second term is omitted because Information gain is undefined):

Information gain

The entropy of the second subset is equal to the following:

Information gain

The average of these subsets' entropies is only 0.459, but the subset containing most of the instances has almost one bit of entropy. This is analogous to asking specific questions early in Twenty Questions; we could get lucky and win within the first few attempts, but it is more likely that we will squander our questions without eliminating many possibilities. Instead, we will measure the reduction in entropy using a metric called information gain. Calculated with the following equation, information gain is the difference between the entropy of the parent node, Information gain, and the weighted average of the children nodes' entropies. Information gain is the set of instances, and Information gain is the explanatory variable under test. Information gain is the value of attribute Information gain for instance Information gain. Information gain is the number of instances for which attribute Information gain is equal to the value Information gain. Information gain is the entropy of the subset of instances for which the value of the explanatory variable Information gain is Information gain.

Information gain

The following table contains the information gains for all of the tests. In this case, the cat food test is still the best, as it increases the information gain the most.

Test

Parent's entropy

Child's entropy

Child's entropy

Weighted average

IG

plays fetch?

0.9852

0.7642

0.7219

0.7490 * 9/14 + 0.7219 * 5/14 = 0.7491

0.2361

is grumpy?

0.9852

0.9183

0.8113

0.9183 * 6/14 + 0.8113 * 8/14 = 0.85710.8572

0.1280

favorite food = cat food

0.9852

0.8113

0

0.8113 * 8 /14 + 0.0 * 6/14 = 0.4636

0.5216

favorite food = dog food

0.9852

0.8454

0

0.8454 * 11/14 + 0.0 * 3/14 = 0.6642

0.3210

favorite food = bacon

0.9852

0.9183

0.971

0.9183 * 9/14 + 0.9710 * 5/14 = 0.9371

0.0481

Now let's add another node to the tree. One of the child nodes produced by the test is a leaf node that contains only cats. The other node still contains two cats and six dogs. We will add a test to this node. Which of the remaining explanatory variables reduces our uncertainty the most? The following table contains the information gains for all of the possible tests:

Test

Parent's entropy

Child's entropy

Child's entropy

Weighted average

IG

plays fetch?

0.8113

1

0

1.0 * 4/8 + 0 * 4/8 = 0.5

0.3113

is grumpy?

0.8113

0

1

0.0 * 4/8 + 1 * 4/8 = 0.5

0.3113

favorite food=dog food

0.8113

0.9710

0

0.9710 * 5/8 + 0.0 * 3/8 = 0.6069

0.2044

favorite food=bacon

0.8113

0

0.9710

0.0 * 3/8 + 0.9710 * 5/8 = 0.6069

0.2044

All of the tests produce subsets with 0 bits of entropy, but the is grumpy and plays fetch tests produce the greatest information gain. ID3 breaks ties by selecting one of the best tests arbitrarily. We will select the is grumpy test, which splits its parent's eight instances into a leaf node containing four dogs and a node containing two cats and two dogs. The following is a diagram of the current tree:

Information gain

We will now select another explanatory variable to test the child node's four instances. The remaining tests, favorite food=bacon, favorite food=dog food, and plays fetch, all produce a leaf node containing one dog or cat and a node containing the remaining animals. The remaining tests produce equal information gains, as shown in the following table:

Test

Parent's Entropy

Child Entropy

Child Entropy

Weighted Average

Information Gain

plays fetch?

1

0.9183

0

0.688725

0.311275

favorite food=dog food

1

0.9183

0

0.688725

0.311275

favorite food=bacon

1

0

0.9183

0.688725

0.311275

We will arbitrarily select the plays fetch test to produce a leaf node containing one dog and a node containing two cats and a dog. Two explanatory variables remain; we can test for animals that like bacon, or we can test for animals that like dog food. Both of the tests will produce the same subsets and create a leaf node containing one dog and a leaf node containing two cats. We will arbitrarily choose to test for animals that like dog food. The following is a diagram of the completed decision tree:

Information gain

Let's classify some animals from the following test data:

Testing instance

Plays fetch

Is grumpy

Favorite food

Species

1

Yes

No

Bacon

Dog

2

Yes

Yes

Dog Food

Dog

3

No

Yes

Dog Food

Cat

4

No

Yes

Bacon

Cat

5

No

No

Cat food

Cat

Let's classify the first animal, which likes to plays fetch, is infrequently grumpy, and loves bacon. We will follow the edge to the root node's left child since the animal's favorite food is not cat food. The animal is not grumpy, so we will follow the edge to the second-level node's left child. This is a leaf node containing only dogs; we have correctly classified this instance. To classify the third test instance as a cat, we follow the edge to the root node's left child, follow the edge to the second-level node's right child, follow the edge to the third-level node's left child, and finally follow the edge to the fourth-level node's right child.

Congratulations! You've constructed a decision tree using the ID3 algorithm. Other algorithms can be used to train decision trees. C4.5 is a modified version of ID3 that can be used with continuous explanatory variables and can accommodate missing values for features. C4.5 also can prune trees. Pruning reduces the size of a tree by replacing branches that classify few instances with leaf nodes. Used by scikit-learn's implementation of decision trees, CART is another learning algorithm that supports pruning.

Gini impurity

In the previous section, we built a decision tree by creating nodes that produced the greatest information gain. Another common heuristic for learning decision trees is Gini impurity, which measures the proportions of classes in a set. Gini impurity is given by the following equation, where Gini impurity is the number of classes, Gini impurity is the subset of instances for the node, and Gini impurity is the probability of selecting an element of class Gini impurity from the node's subset:

Gini impurity

Intuitively, Gini impurity is zero when all of the elements of the set are the same class, as the probability of selecting an element of that class is equal to one. Like entropy, Gini impurity is greatest when each class has an equal probability of being selected. The maximum value of Gini impurity depends on the number of possible classes, and it is given by the following equation:

Gini impurity

Our problem has two classes, so the maximum value of the Gini impurity measure will be equal to one half. scikit-learn supports learning decision trees using both information gain and Gini impurity. There are no firm rules to help you decide when to use one criterion or the other; in practice, they often produce similar results. As with many decisions in machine learning, it is best to compare the performances of models trained using both options.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset