Decision tree

Classification trees are used to separate the data into classes belonging to the response variable. The response variable usually has two classes: Yes or No (1 or 0) and sunny or rain. If the target variable has more than two categories, then C4.5 can be applicable. C4.5 improves the ID3 algorithm for the continuous attributes, the discrete attributes, and the post construction process.

Similar to most learning algorithms, the classification tree algorithm analyzes a training set and then builds a classifier based on that training so that with new data in the future, it can classify the training as well as the new data correctly. A test example is an input object, and the algorithm must predict an output value. Classification trees are used when the response or target variable is categorical in nature.

On the contrary, regression trees are needed when the response variable is continuous and not discrete. For example, the predicted price of a product. A regression tree is built through binary partitioning. This is an iterative process that splits the data into partitions or branches and then continues splitting each partition into smaller groups as the method moves up each partition or branch. In other words, regression trees are applicable when the problem involves prediction as opposed to classification. For more details on this, we recommend you refer to books on classification and regression trees.

When the relationship between predictors and response is linear, a standard regression tree is more appropriate, and when the relationship between predictors and response is nonlinear, then C4.5 should be used. Furthermore, to summarize, when the response variable has only two categories, the classification tree algorithm should be used.

An example

For a decision tree algorithm to play tennis or golf, one can easily sort down the decision process by asking a question, that is, is it raining out there or is it sunny? and draw the decision diagram branching out at every question based on the answers. The playing nature of the games are almost the same—tennis versus golf—and in any sporting event, if it is windy and raining, chances are that there is not going to be a game.

For tennis, if the outlook is sunny, but the humidity is high, then it is recommended to not play. Similarly, if it is raining and windy, then the whole dynamics of the tennis game will be pretty bad. Therefore, chances are that it is no fun playing tennis under these conditions as well. The following diagram shows all the possible conditions:

An example

We can also add discrete attributes (such as temperature); for what range of temperatures does it not make sense to play tennis? Probably, if the temperature is greater than 70 degrees Fahrenheit, that is, if the temperature is hot. We can write the rules combining all these as follows:

If (Outlook = Sunny) and (Humidity = High) then play=No
If (Outlook = Rain) and (Wind = Strong) then play=No
If (Outlook = Sunny) and (Humidity = Normal) or
  (Outlook = Overcast) or (Outlook=Rain and Wind=Weak) then play=Yes

With the following training set, we can run the algorithm to select the next best classifier:

Outlook

Temperature

Humidity

Wind

Play?

Sunny

Hot

High

Weak

No

Sunny

Hot

High

Strong

No

Overcast

Hot

High

Weak

Yes

Overcast

Cool

Normal

Strong

Yes

Sunny

Mild

High

Weak

No

Sunny

Cool

Normal

Weak

Yes

Rain

Mild

High

Weak

Yes

Rain

Cool

Normal

Weak

Yes

Rain

Cool

Normal

Strong

No

Rain

Mild

Normal

Weak

Yes

Sunny

Mild

Normal

Strong

Yes

Overcast

Mild

High

Strong

Yes

Overcast

Hot

Normal

Weak

Yes

Rain

Mild

High

Strong

No

The top down induction of decision trees (ID3) is a method that follows these rules:

  • Iterate over leaf nodes until stopping condition:
    1. Identify the best decision attribute for the next node in the traversal.
    2. Assign that best node from step 1 as the decision attribute.
    3. For each value of those best nodes, create new descendants of those nodes.
    4. Sort the training data into leaf nodes.
    5. Stopping condition for iteration:

      If the training data is classified within the threshold

One clear distinction between a linear regression and a decision tree algorithm is that the decision boundaries are parallel to the axes, for example, if we have two features (x1 and x2), then it can only create rules, such as x1 >=5.2, x2 >= 7.2. The advantage the decision tree algorithm has is that it is robust to errors, which means that the training set could have errors. Also, it doesn't affect the algorithm much.

Using the sklearn package from scikit-learn (scikit-learn.org) and the following code, we can plot the decision tree classifier:

from sklearn.externals.six import StringIO
from sklearn import tree
import pydot 

# Four columns from the table above with values
# 1st col - 1 for Sunny, 2 for Overcast, and 3 for Rainy
# 2nd col - 1 for Hot, 2 for Mild, 3 for Cool
# 3rd col – 1 for High and 2 for Normal
# 4th col – 0 for Weak and 1 for Strong

X=[[1,1,1,0],[1,1,1,1],[2,1,1,0],[2,3,2,1],[1,2,1,0],[1,3,2,0],
[3,2,1,0],[3,3,2,0],[3,3,2,1],[3,2,2,0],[1,2,2,1],[2,2,1,1],
[2,1,2,0],[3,2,1,0]]  

# 1 for Play and 0 for Don't Play
Y=[0,0,1,1,0,1,1,1,0,1,1,1,1,0] 

clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)

dot_data = StringIO() 
tree.export_graphviz(clf, out_file=dot_data) 

graph = pydot.graph_from_dot_data(dot_data.getvalue()) 
graph.write_pdf("game.pdf")

Use the export functionality of sklearn to be able to convert the tree diagram in the form of a graph that looks similar to the following diagram:

An example

In order to create your own tree structure, there is an option of using plotting methods from matplotlib. In order to display a tree-like diagram, matplotlib has annotation that allows you to create a tree-shaped structure with labels, as shown in the following code:

import matplotlib.pyplot as plt

#create nodes here
branchNode = dict(boxstyle="sawtooth", fc="0.8")
leafNode = dict(boxstyle="round4", fc="0.8")
startNode = dict(boxstyle="sawtooth", fc="0.9")

def createPlot():
    fig = plt.figure(1, facecolor='white')
    fig.clf()
    createPlot.ax1 = plt.subplot(111, frameon=False) #ticks for demo purposes
    plotNode('from here', (0.3,0.8), (0.3, 0.8), startNode)
    plotNode('a decision node', (0.5, 0.1), (0.3, 0.8), branchNode)
    plotNode('a leaf node', (0.8, 0.1), (0.3, 0.8), leafNode)
    plt.show()
...

This is usually an idea of how you can create a tree structure from scratch using matplotlib. Basically the preceding example shows the creation of three nodes and connecting them to form a small tree. The results of this code are shown as follows:

An example
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset