Doing basic classifications with Decision Trees

In this recipe, we will perform basic classifications using Decision Trees. These are very nice models because they are easily understandable, and once trained in, scoring is very simple. Often, SQL statements can be used, which means that the outcome can be used by a lot of people.

Getting ready

In this recipe, we'll look at Decision Trees. I like to think of Decision Trees as the base class from which a large number of other classification methods are derived. It's a pretty simple idea that works well in a bunch of situations.

First, let's get some classification data that we can practice on:

>>> from sklearn import datasets
>>> X, y = datasets.make_classification(n_samples=1000, n_features=3,
                                        n_redundant=0)

How to do it…

Working with Decision Trees is easy. We first need to import the object, and then fit the model:

>>> from sklearn.tree import DecisionTreeClassifier
>>> dt = DecisionTreeClassifier()
>>> dt.fit(X, y)
DecisionTreeClassifier(compute_importances=None, criterion='gini', 
                       max_depth=None, max_features=None, 
                       max_leaf_nodes=None, min_density=None, 
                       min_samples_leaf=1, min_samples_split=2, 
                       random_state=None, splitter='best')

>>> preds = dt.predict(X)
>>> (y == preds).mean()
1.0

As you can see, we guessed it right. Clearly, this was just a dry run, now let's investigate some of our options.

First, if you look at the dt object, it has several keyword arguments that determine how the object will behave. How we choose the object is important, so we'll look at the object's effects in detail.

The first detail we'll look at is max_depth. This is an important parameter. It determines how many branches are allowed. This is important because a Decision Tree can have a hard time generalizing out-of-sampled data with some sort of regularization. Later, we'll see how we can use several shallow Decision Trees to make a better learner. Let's create a more complex dataset and see what happens when we allow different max_depth. We'll use this dataset for the rest of the recipe:

>>> n_features=200
>>> X, y = datasets.make_classification(750, n_features, 
                                        n_informative=5)
>>> import numpy as np
>>> training = np.random.choice([True, False], p=[.75, .25], 
                                size=len(y))

>>> accuracies = []

>>> for x in np.arange(1, n_features+1):
>>> dt = DecisionTreeClassifier(max_depth=x)
    
>>> dt.fit(X[training], y[training])
    
>>> preds = dt.predict(X[~training])
    
>>> accuracies.append((preds == y[~training]).mean())

>>> import matplotlib.pyplot as plt

>>> f, ax = plt.subplots(figsize=(7, 5))

>>> ax.plot(range(1, n_features+1), accuracies, color='k')

>>> ax.set_title("Decision Tree Accuracy")
>>> ax.set_ylabel("% Correct")
>>> ax.set_xlabel("Max Depth")

The following is the output:

How to do it…

We can see that we actually get pretty accurate at a low max depth. Let's take a closer look at the accuracy at low levels, say the first 15:

>>> N = 15
>>> import matplotlib.pyplot as plt
>>> f, ax = plt.subplots(figsize=(7, 5))

>>> ax.plot(range(1, n_features+1)[:N], accuracies[:N], color='k')

>>> ax.set_title("Decision Tree Accuracy")
>>> ax.set_ylabel("% Correct")
>>> ax.set_xlabel("Max Depth")

The following is the output:

How to do it…

There's the spike we saw earlier; it's quite amazing to see the quick drop though. It's more likely that Max Depth of 1 through 3 is fairly equivalent. Decision Trees are quite good at separating rules, but they need to be reigned in.

We'll look at the compute_importances parameter here. It actually has a bit of a broader meaning for random forests, but we'll get acquainted with it. It's also worth noting that if you're using Version 0.16 or earlier, you will get this for free:

>>> dt_ci = DecisionTreeClassifier(compute_importances=True)
>>> dt.fit(X, y)

#plot the importances
>>> ne0 = dt.feature_importances_ != 0

>>> y_comp = dt.feature_importances_[ne0]
>>> x_comp = np.arange(len(dt.feature_importances_))[ne0]

>>> import matplotlib.pyplot as plt

>>> f, ax = plt.subplots(figsize=(7, 5))
>>> ax.bar(x_comp, y_comp)

The following is the output:

How to do it…

Note

Please note that you may get an error letting you know you'll no longer need to explicitly set compute importances.

As we can see, one of the features is by far the most important; several other features will follow up.

How it works…

In the simplest sense, we construct Decision Trees all the time. When thinking through situations and assigning probabilities to outcomes, we construct Decision Trees. Our rules are much more complex and involve a lot of context, but with Decision Trees, all we care about is the difference between outcomes, given that some information is already known about a feature.

Now, let's discuss the differences between entropy and Gini impurity.

Entropy is more than just the entropy value at any given variable; it states what the change in entropy is if we know an element's value. This is called Information Gain (IG); mathematically it looks like the following:

How it works…

For Gini impurity, we care about how likely one of the data points will be mislabeled given the new information.

Both entropy and Gini impurity have pros and cons; this said, if you see major differences in the working of entropy and Gini impurity, it will probably be a good idea to re-examine your assumptions.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset