Tuning a Decision Tree model

If we use just the basic implementation of a Decision Tree, it will probably not fit very well. Therefore, we need to tweak the parameters in order to get a good fit. This is very easy and won't require much effort.

Getting ready

In this recipe, we will take an in-depth look at what it takes to tune a Decision Tree classifier. There are several options, and in the previous recipe, we only looked at one of these options.

We'll fit a basic model and actually look at what the Decision Tree looks like. Then, we'll re-examine after each decision and point out how various changes have influenced the structure.

If you want to follow along in this recipe, you'll need to install pydot.

How to do it…

Decision Trees have a lot more "knobs" when compared to most other algorithms, because of which it's easier to see what happens when we turn the knobs:

>>> from sklearn import datasets
>>> X, y = datasets.make_classification(1000, 20, n_informative=3)

>>> from sklearn.tree import DecisionTreeClassifier
>>> dt = DecisionTreeClassifier()
>>> dt.fit(X, y)

Ok, so now that we have a basic classifier fit, we can view it quite simply:

>>> from StringIO import StringIO
>>> from sklearn import tree
>>> import pydot

>>> str_buffer = StringIO()
>>> tree.export_graphviz(dt, out_file=str_buffer)
>>> graph = pydot.graph_from_dot_data(str_buffer.getvalue())
>>> graph.write("myfile.jpg")

The graph is almost certainly illegible, but hopefully this illustrates the complex trees that can be generated as a result of using an unoptimized decision tree:

How to do it…

Wow! This is a very complex tree. It will most likely overfit the data. First, let's reduce the max depth value:

>>> dt = DecisionTreeClassifier(max_depth=5)
>>> dt.fit(X, y);

As an aside, if you're wondering why the semicolon, the repr by default, is seen, it is actually the model for a Decision Tree. For example, the fit function actually returns the Decision Tree object that allows chaining:

>>> dt = DecisionTreeClassifier(max_depth=5).fit(X, y)

Now, let's get back to the regularly scheduled program.

As we will plot this a few times, let's create a function:

>>> def plot_dt(model, filename):
       str_buffer = StringIO()
>>> tree.export_graphviz(model, out_file=str_buffer)
    
>>> graph = pydot.graph_from_dot_data(str_buffer.getvalue())
>>> graph.write_jpg(filename)

>>> plot_dt(dt, "myfile.png")

The following is the graph that will be generated:

How to do it…

This is a much simpler tree. Let's look at what happens when we use entropy as the splitting criteria:

>>> dt = DecisionTreeClassifier(criterion='entropy', 
                                max_depth=5).fit(X, y)
>>> plot(dt, "entropy.png")

The following is the graph that can be generated:

How to do it…

It's good to see that the first two splits are the same features, and the first few after this are interspersed with similar amounts. This is a good sanity check.

Also, note how entropy for the first split is 0.999, but for the first split when using the Gini impurity is 0.5. This has to do with how different the two measures of the split of a Decision Tree are. See the following How it works... section for more information. However, if we want to create a Decision Tree with entropy, we must use the following command:

>>> dt = DecisionTreeClassifier(min_samples_leaf=10, 
                                criterion='entropy', 
                                max_depth=5).fit(X, y)

How it works…

Decision Trees, in general, suffer from overfitting. Quite often, left to it's own devices, a Decision Tree model will overfit, and therefore, we need to think about how best to avoid overfitting; this is done to avoid complexity. A simple model will more often work better in practice than not.

We're about to see this very idea in practice. random forests will build on this idea of simple models.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset