Let's create a decision tree using an algorithm called Iterative Dichotomiser 3 (ID3). Invented by Ross Quinlan, ID3 was one of the first algorithms used to train decision trees. Assume that you have to classify animals as cats or dogs. Unfortunately, you cannot observe the animals directly and must use only a few attributes of the animals to make your decision. For each animal, you are told whether or not it likes to play fetch, whether or not it is frequently grumpy, and its favorite of three types of food.
To classify new animals, the decision tree will examine an explanatory variable at each node. The edge it follows to the next node will depend on the outcome of the test. For example, the first node might ask whether or not the animal likes to play fetch. If the animal does, we will follow the edge to the left child node; if not, we will follow the edge to the right child node. Eventually an edge will connect to a leaf node that indicates whether the animal is a cat or a dog.
The following fourteen instances comprise our training data:
Training instance |
Plays fetch |
Is grumpy |
Favorite food |
Species |
---|---|---|---|---|
1 |
Yes |
No |
Bacon |
Dog |
2 |
No |
Yes |
Dog Food |
Dog |
3 |
No |
Yes |
Cat food |
Cat |
4 |
No |
Yes |
Bacon |
Cat |
5 |
No |
No |
Cat food |
Cat |
6 |
No |
Yes |
Bacon |
Cat |
7 |
No |
Yes |
Cat Food |
Cat |
8 |
No |
No |
Dog Food |
Dog |
9 |
No |
Yes |
Cat food |
Cat |
10 |
Yes |
No |
Dog Food |
Dog |
11 |
Yes |
No |
Bacon |
Dog |
12 |
No |
No |
Cat food |
Cat |
13 |
Yes |
Yes |
Cat food |
Cat |
14 |
Yes |
Yes |
Bacon |
Dog |
From this data we can see that cats are generally grumpier than the dogs. Most dogs play fetch and most cats refuse. Dogs prefer dog food and bacon, whereas cats only like cat food and bacon. The is grumpy
and plays fetch
explanatory variables can be easily converted to binary-valued features. The favorite food
explanatory variable is a categorical variable that has three possible values; we will one-hot encode it. Recall from Chapter 3, Feature Extraction and Preprocessing, that one-hot encoding represents a categorical variable with as many binary-valued features as there are values for variable. Representing the categorical variable with a single integer-valued feature will encode an artificial order to its values. Since favorite food
has three possible states, we will represent it with three binary-valued features. From this table, we can manually construct classification rules. For example, an animal that is grumpy and likes cat food must be a cat, while an animal that plays fetch and likes bacon must be a dog. Constructing these classification rules by hand for even a small data set is cumbersome. Instead, we will learn these rules by creating a decision tree.
Like Twenty Questions, the decision tree will estimate the value of the response variable by testing the values of a sequence of explanatory variables. Which explanatory variable should be tested first? Intuitively, a test that produces subsets that contain all cats or all dogs is better than a test that produces subsets that still contain both cats and dogs. If the members of a subset are of different classes, we are still uncertain about how to classify the instance. We should also avoid creating tests that separate only a single cat or dog from the others; such tests are analogous to asking specific questions in the first few rounds of Twenty Questions. More formally, these tests can infrequently classify an instance and might not reduce our uncertainty. The tests that reduce our uncertainty about the classification the most are the best. We can quantify the amount of uncertainty using a measure called entropy.
Measured in bits, entropy quantifies the amount of uncertainty in a variable. Entropy is given by the following equation, where is the number of outcomes and is the probability of the outcome . Common values for are 2
, , and 10
. Because the log of a number less than one will be negative, the entire sum is negated to return a positive value.
For example, a single toss of a fair coin has only two outcomes: heads and tails. The probability that the coin will land on heads is 0.5, and the probability that it will land on tails is 0.5. The entropy of the coin toss is equal to the following:
That is, only one bit is required to represent the two equally probable outcomes, heads and tails. Two tosses of a fair coin can result in four possible outcomes: heads and heads, heads and tails, tails and heads, and tails and tails. The probability of each outcome is 0.5 x 0.5 = 0.25. The entropy of two tosses is equal to the following:
If the coin has the same face on both sides, the variable representing its outcome has 0 bits of entropy; that is, we are always certain of the outcome and the variable will never represent new information. Entropy can also be represented as a fraction of a bit. For example, an unfair coin has two different faces, but is weighted such that the faces are not equally likely to land in a toss. Assume that the probability that an unfair coin will land on heads is 0.8, and the probability that it will land on tails is 0.2. The entropy of a single toss of this coin is equal to the following:
The outcome of a single toss of an unfair coin can have a fraction of one bit of entropy. There are two possible outcomes of the toss, but we are not totally uncertain since one outcome is more frequent.
Let's calculate the entropy of classifying an unknown animal. If an equal number of dogs and cats comprise our animal classification training data and we do not know anything else about the animal, the entropy of the decision is equal to one. All we know is that the animal could be either a cat or a dog; like the fair coin toss, both outcomes are equally likely. Our training data, however, contains six dogs and eight cats. If we do not know anything else about the unknown animal, the entropy of the decision is given by the following:
Since cats are more common, we are less uncertain about the outcome. Now let's find the explanatory variable that will be most helpful in classifying the animal; that is, let's find the explanatory variable that reduces the entropy the most. We can test the plays fetch
explanatory variable and divide the training instances into animals that play fetch and animals that don't. This produces the two following subsets:
Decision trees are often visualized as diagrams that are similar to flowcharts. The top box of the previous diagram is the root node; it contains all of our training instances and specifies the explanatory variable that will be tested. At the root node we have not eliminated any instances from the training set and the entropy is equal to approximately 0.985. The root node tests the plays fetch
explanatory variable. Recall that we converted this Boolean explanatory variable to a binary-valued feature. Training instances for which plays fetch
is equal to zero follow the edge to the root's left child, and training instances for animals that do play fetch follow the edge to the root's right child node. The left child node contains a subset of the training data with seven cats and two dogs that do not like to play fetch. The entropy at this node is given by the following:
The right child contains a subset with one cat and four dogs that do like to play fetch. The entropy at this node is given by the following:
Instead of testing the plays fetch
explanatory variable, we could test the is grumpy
explanatory variable. This test produces the following tree. As with the previous tree, instances that fail the test follow the left edge, and instances that pass the test follow the right edge.
We could also divide the instances into animals that prefer cat food and animals that don't to produce the following tree:
Testing for the animals that prefer cat food resulted in one subset with six cats, zero dogs, and 0 bits of entropy and another subset with two cats, six dogs, and 0.811 bits of entropy. How can we measure which of these tests reduced our uncertainty about the classification the most? Averaging the entropies of the subsets may seem to be an appropriate measure of the reduction in entropy. In this example, the subsets produced by the cat food test have the lowest average entropy. Intuitively, this test seems to be effective, as we can use it to classify almost half of the training instances. However, selecting the test that produces the subsets with the lowest average entropy can produce a suboptimal tree. For example, imagine a test that produced one subset with two dogs and no cats and another subset with four dogs and eight cats. The entropy of the first subset is equal to the following (note that the second term is omitted because is undefined):
The entropy of the second subset is equal to the following:
The average of these subsets' entropies is only 0.459, but the subset containing most of the instances has almost one bit of entropy. This is analogous to asking specific questions early in Twenty Questions; we could get lucky and win within the first few attempts, but it is more likely that we will squander our questions without eliminating many possibilities. Instead, we will measure the reduction in entropy using a metric called information gain. Calculated with the following equation, information gain is the difference between the entropy of the parent node, , and the weighted average of the children nodes' entropies. is the set of instances, and is the explanatory variable under test. is the value of attribute for instance . is the number of instances for which attribute is equal to the value . is the entropy of the subset of instances for which the value of the explanatory variable is .
The following table contains the information gains for all of the tests. In this case, the cat food test is still the best, as it increases the information gain the most.
Test |
Parent's entropy |
Child's entropy |
Child's entropy |
Weighted average |
IG |
---|---|---|---|---|---|
plays fetch? |
0.9852 |
0.7642 |
0.7219 |
0.7490 * 9/14 + 0.7219 * 5/14 = 0.7491 |
0.2361 |
is grumpy? |
0.9852 |
0.9183 |
0.8113 |
0.9183 * 6/14 + 0.8113 * 8/14 = 0.85710.8572 |
0.1280 |
favorite food = cat food |
0.9852 |
0.8113 |
0 |
0.8113 * 8 /14 + 0.0 * 6/14 = 0.4636 |
0.5216 |
favorite food = dog food |
0.9852 |
0.8454 |
0 |
0.8454 * 11/14 + 0.0 * 3/14 = 0.6642 |
0.3210 |
favorite food = bacon |
0.9852 |
0.9183 |
0.971 |
0.9183 * 9/14 + 0.9710 * 5/14 = 0.9371 |
0.0481 |
Now let's add another node to the tree. One of the child nodes produced by the test is a leaf node that contains only cats. The other node still contains two cats and six dogs. We will add a test to this node. Which of the remaining explanatory variables reduces our uncertainty the most? The following table contains the information gains for all of the possible tests:
Test |
Parent's entropy |
Child's entropy |
Child's entropy |
Weighted average |
IG |
---|---|---|---|---|---|
plays fetch? |
0.8113 |
1 |
0 |
1.0 * 4/8 + 0 * 4/8 = 0.5 |
0.3113 |
is grumpy? |
0.8113 |
0 |
1 |
0.0 * 4/8 + 1 * 4/8 = 0.5 |
0.3113 |
favorite food=dog food |
0.8113 |
0.9710 |
0 |
0.9710 * 5/8 + 0.0 * 3/8 = 0.6069 |
0.2044 |
favorite food=bacon |
0.8113 |
0 |
0.9710 |
0.0 * 3/8 + 0.9710 * 5/8 = 0.6069 |
0.2044 |
All of the tests produce subsets with 0 bits of entropy, but the is grumpy
and plays fetch
tests produce the greatest information gain. ID3 breaks ties by selecting one of the best tests arbitrarily. We will select the is grumpy
test, which splits its parent's eight instances into a leaf node containing four dogs and a node containing two cats and two dogs. The following is a diagram of the current tree:
We will now select another explanatory variable to test the child node's four instances. The remaining tests, favorite food=bacon
, favorite food=dog food
, and plays fetch
, all produce a leaf node containing one dog or cat and a node containing the remaining animals. The remaining tests produce equal information gains, as shown in the following table:
Test |
Parent's Entropy |
Child Entropy |
Child Entropy |
Weighted Average |
Information Gain |
---|---|---|---|---|---|
plays fetch? |
1 |
0.9183 |
0 |
0.688725 |
0.311275 |
favorite food=dog food |
1 |
0.9183 |
0 |
0.688725 |
0.311275 |
favorite food=bacon |
1 |
0 |
0.9183 |
0.688725 |
0.311275 |
We will arbitrarily select the plays fetch
test to produce a leaf node containing one dog and a node containing two cats and a dog. Two explanatory variables remain; we can test for animals that like bacon, or we can test for animals that like dog food. Both of the tests will produce the same subsets and create a leaf node containing one dog and a leaf node containing two cats. We will arbitrarily choose to test for animals that like dog food. The following is a diagram of the completed decision tree:
Let's classify some animals from the following test data:
Testing instance |
Plays fetch |
Is grumpy |
Favorite food |
Species |
---|---|---|---|---|
1 |
Yes |
No |
Bacon |
Dog |
2 |
Yes |
Yes |
Dog Food |
Dog |
3 |
No |
Yes |
Dog Food |
Cat |
4 |
No |
Yes |
Bacon |
Cat |
5 |
No |
No |
Cat food |
Cat |
Let's classify the first animal, which likes to plays fetch, is infrequently grumpy, and loves bacon. We will follow the edge to the root node's left child since the animal's favorite food is not cat food. The animal is not grumpy, so we will follow the edge to the second-level node's left child. This is a leaf node containing only dogs; we have correctly classified this instance. To classify the third test instance as a cat, we follow the edge to the root node's left child, follow the edge to the second-level node's right child, follow the edge to the third-level node's left child, and finally follow the edge to the fourth-level node's right child.
Congratulations! You've constructed a decision tree using the ID3 algorithm. Other algorithms can be used to train decision trees. C4.5 is a modified version of ID3 that can be used with continuous explanatory variables and can accommodate missing values for features. C4.5 also can prune trees. Pruning reduces the size of a tree by replacing branches that classify few instances with leaf nodes. Used by scikit-learn's implementation of decision trees, CART is another learning algorithm that supports pruning.
In the previous section, we built a decision tree by creating nodes that produced the greatest information gain. Another common heuristic for learning decision trees is Gini impurity, which measures the proportions of classes in a set. Gini impurity is given by the following equation, where is the number of classes, is the subset of instances for the node, and is the probability of selecting an element of class from the node's subset:
Intuitively, Gini impurity is zero when all of the elements of the set are the same class, as the probability of selecting an element of that class is equal to one. Like entropy, Gini impurity is greatest when each class has an equal probability of being selected. The maximum value of Gini impurity depends on the number of possible classes, and it is given by the following equation:
Our problem has two classes, so the maximum value of the Gini impurity measure will be equal to one half. scikit-learn supports learning decision trees using both information gain and Gini impurity. There are no firm rules to help you decide when to use one criterion or the other; in practice, they often produce similar results. As with many decisions in machine learning, it is best to compare the performances of models trained using both options.