How it works...

Let's draw a decision tree for the tennis example we created in this recipe:

This model has a depth of three levels. Which attribute to select depends upon how we can maximize information gain. We compute this by measuring the purity of the split. Purity means that irrespective of whether the certainty of playing tennis is increasing, the given dataset will be considered positive or negative. In this example, this equates to whether the chances of play are increasing or the chances of not playing are increasing.

Purity is measured using entropy. Entropy is a measure of disorder in a system. In this context, it is easier to understand it as a measure of uncertainty:

The highest level of purity is 0 and the lowest is 1. Let's try to determine purity, in this case, using a formula.

When rain is yes, the probability of playing tennis is p+, which is 0/3 = 0. The probability of not playing tennis is p-, which is 3/3 = 1:

This is a pure set.

When rain is no, the probability of playing tennis is p+, which is 2/5 = 0.4. The probability of not playing tennis is p-, which is 3/5 = 0.6:

This is almost an impure set. The most impure would be the case where the probability is 0.5.

Spark uses three measures to determine impurity:

Gini impurity (classification)
Entropy (classification)
Variance (regression)

Information gain is the difference between the parent node impurity and the weighted sum of two child node impurities. Let's look at the first split, which partitions data of size 8 into two datasets of size 3 (left) and 5 (right). Let's call the first split s1, the parent node rain, the left child no rain, and the right child wind. So the information gain would be:

As we have already calculated the impurity for no rain and wind for the entropy, let's calculate the entropy for rain:

Let's calculate the information gain now:

So the information gain is 0.2 in the first split. Is this the best we can achieve? Let's see what our algorithm comes up with. First, let's find out the depth of the tree:

scala> model.depth
Int = 2

Here, the depth is 2 compared to 3, which we have built intuitively, so this model seems to be better optimized. Let's look at the structure of the tree:

    scala> model.toDebugString
String =  "DecisionTreeModel classifier of depth 2 with 5 nodes
If (feature 1 <= 0.0)
   If (feature 2 <= 0.0)
     Predict: 0.0
   Else (feature 2 > 0.0)
     Predict: 1.0
Else (feature 1 > 0.0)
    Predict: 0.0

Let's build it visually to get a better understanding:

We will not go into detail here, as we have already done this in the previous model. We will straightaway calculate the information gain:

As you can see, in this case, the information gain is 0.44, which is more than double of the first model.

If you look at the second level nodes, the impurity is zero. In this case, it is great as we got it at a depth of 2. Imagine a situation in which the depth is 50. In that case, the decision tree would work well for training data and would do badly for test data. This situation is called overfitting.

One solution to avoid overfitting is pruning. You divide your training data into two sets: the training set and validation set. You train the model using the training set. Now you test with the model against the validation set by slowly removing the left nodes. If removing the leaf node (which is mostly a singleton, that is, it contains only one data point) improves the performance of the model, then the leaf node is pruned from the model.

Table of Contents for How it works...

Create new playlist

Sign In

Sign Up

Table of Contents for
How it works...