Now, when we are sufficiently aware of the mathematics behind decision trees, let us implement a simple decision tree using the methods in scikit-learn
. The dataset we will be using for this is a commonly available dataset called the iris
dataset that has information about flower species and their petal and sepal dimensions. The purpose of this exercise will be to create a classifier that can classify a flower as belonging to a certain species based on the flower petal and sepal dimensions.
To do this, let's first import the dataset and have a look at it:
import pandas as pd data=pd.read_csv('E:/Personal/Learning/Predictive Modeling Book/My Work/Chapter 7/iris.csv') data.head()
The datasheet looks as follows:
Sepal-length, Sepal-width, Petal-length, and Petal-width are the dimensions of the flower while the Species denotes the class the flower belongs to. There are actually three classes of species here that can be looked at as follows:
data['Species'].unique()
The output will be three categories of the species as follows:
The purpose of this exercise will be to classify the flowers as belonging to one of the three species based on the dimensions. Let us see how we can do this.
Let us first get the predictors and the target variables separated:
colnames=data.columns.values.tolist() predictors=colnames[:4] target=colnames[4]
The first four columns of the dataset are termed predictors and the last one, that is, species is termed as the target variable.
Next, let's split the dataset into training and testing data:
Import numpy as np data['is_train'] = np.random.uniform(0, 1, len(data) <= .75 train, test = data[data['is_train']==True], data[data['is_train']==False]
In the first line, we are basically creating as many uniformly distributed random numbers between 0 and 1 as there are observations in the dataset. If the random number is less than or equal to .75
, that observation goes to the training dataset; otherwise the observation goes to the testing dataset.
We have everything ready to create a decision tree now. As we have seen earlier, there are several methods to create nodes and subnodes. This method can be specified while invoking the DecisionTreeClassifier
method of the sklearn
library:
from sklearn.tree import DecisionTreeClassifier dt = DecisionTreeClassifier(criterion='entropy',min_samples_split=20, random_state=99) dt.fit(train[predictors], train[target])
The min_samples_split
specifies the minimum number of observations required to split a node into a subnode. By default, it is set to 2, which can be troublesome and can lead to over-fitting as a tree in such case can keep growing until it can find at least two observations. In this case, we have specified it to be 20. Our decision tree is now ready. Let us now test the result of our decision tree by using it for prediction over the testing dataset:
preds=dt.predict(test[predictors]) pd.crosstab(test['Species'],preds,rownames=['Actual'],colnames=['Predictions'])
In the first line of the preceding code snippet, the decision tree is used to predict the class (species) for the flowers in the test dataset using the flower dimensions. The second line creates a table comparing the Actual species and the Predicted species. The table looks as follows:
This table can be interpreted as follows: all the actual setosas were actually classified correctly as setosas. Out of the total 13 versicolors, 11 were classified correctly and 2 were classified wrongly as virginicas. Out of the total 12 virginicas, 11 were classified correctly while 1 was classified wrongly as versicolor. This accuracy rate is pretty good.
In scikit-learn
, there are the following four steps to visualize a tree:
.dot
file from the Decision Tree Classifier model that is fit for the data.export_graphviz
module in the sklearn
package. A .dot
file contains information necessary to draw a tree. This information includes the entropy value (or Gini) at that node, the number of observations in that node, the condition referring to that node, and the node number pointing to another node number denoting which node is connected next to which one. For example, 2->3 and 3->4 means that node 2 is connected to 3, 3 is connected to 4, and so on. You can specify the directory name where you want to create the .dot
file:from sklearn.tree import export_graphviz with open('E:/Personal/Learning/Predictive Modeling Book/My Work/Chapter 7/dtree2.dot', 'w') as dotfile: export_graphviz(dt, out_file = dotfile, feature_names = predictors) dotfile.close()
.dot
file after it is created to have a better idea. It looks as follows:.dot
file into a tree:This can be done using the system
module of the os
package that is used to run the cmd
commands from within Python. This is done as follows:
from os import system system("dot -Tpng /E:/Personal/Learning/Predictive Modeling Book/My Work/Chapter 7/dtree2.dot -o /E:/Personal/Learning/Predictive Modeling Book/My Work/Chapter 7/dtree2.png")
This is how the tree looks like. The left arrow from the node ascribes to True and the right arrow to False for the condition given in the node. Each node has several important pieces of information such as the entropy at that node (remember, the less, the better), the number of samples (observations) at that node, and the number of samples in each species flower (under the heading value).
The tree is read as follows:
Some other observations from the tree are as follows:
The tree might have grown very complex even after putting the min_samples_split
of 20. There is a parameter of DecisionTreeClassifier
that can be used to check the maximum depth to which the tree grows. This is called max_depth
. Let us use this parameter and also the cross validation accuracy score to get an optimum depth of the tree. We are actually pruning the tree to get to an optimum depth where it neither overfits nor underfits the dataset.
We will do cross validation over the entire dataset. If you remember, cross validation splits the dataset into training and testing sets on its own and does this a number of times to generalize the results of the model.
Let us cross validate our decision tree:
X=data[predictors] Y=data[target] dt1.fit(X,Y) dt1 = DecisionTreeClassifier(criterion='entropy',max_depth=5, min_samples_split=20, random_state=99)
In these lines, we just assigned predictor variables to X and the target variable to Y. We have created a new decision tree that is very similar to the tree we created previously, except that it has an additional parameter, namely, max_depth=5
.
The next step is to import the cross validation methods in sklearn
and perform the cross validation:
from sklearn.cross_validation import KFold crossvalidation = KFold(n=X.shape[0], n_folds=10, shuffle=True, random_state=1) from sklearn.cross_validation import cross_val_score score = np.mean(cross_val_score(dt1, X, Y, scoring='accuracy', cv=crossvalidation, n_jobs=1)) score
We have chosen to do a 10-fold cross validation, and the score is the mean of the accuracy score obtained from each fold. The score in this comes out to be 0.933. This score signifies the accuracy of the classification.
If we vary the max_depth
from 1 to 10, this is how the mean accuracy score varies:
As you can observe, for max_depth => 4
, the score remains almost constant. The maximum score is obtained when max_depth = 3
. Hence, we will choose to grow our tree to only three levels from the root node.
Let us now do a feature importance test to determine which of the variables in the preceding dataset are actually important for the model. This can be easily done as follows:
dt1.feature_importances_
The higher the values, the higher the feature importance. Hence, we conclude that the Petal width and Petal length are important features (in ascending order of importance) to predict the flower species using this dataset.