In this recipe, we're going to introduce grid search with basic Python, though we will use sklearn for the models and matplotlib for the visualization.
In this recipe, we will perform the following tasks:
Also, the model we'll fit is a basic decision tree classifier. Our parameter space will be 2 dimensional to help us with the visualization:
The parameter space will then be the Cartesian product of the those two sets:
We'll see in a bit how we can iterate through this space with itertools
.
Let's create the dataset and then get started:
>>> from sklearn import datasets >>> X, y = datasets.make_classification(n_samples=2000, n_features=10)
Earlier we said that we'd use grid search to tune two parameters—criteria
and max_features
. We need to represent those as Python sets, and then use itertools
product to iterate through them:
>>> criteria = {'gini', 'entropy'} >>> max_features = {'auto', 'log2', None} >>> import itertools as it >>> parameter_space = it.product(criteria, max_features)
Great! So now that we have the parameter space, let's iterate through it and check the accuracy of each model as specified by the parameters. Then, we'll store that accuracy so that we can compare different parameter spaces. We'll also use a test and train split of 50
, 50
:
import numpy as np train_set = np.random.choice([True, False], size=len(y)) from sklearn.tree import DecisionTreeClassifier accuracies = {} for criterion, max_feature in parameter_space: dt = DecisionTreeClassifier(criterion=criterion, max_features=max_feature) dt.fit(X[train_set], y[train_set]) accuracies[(criterion, max_feature)] = (dt.predict(X[~train_set]) == y[~train_set]).mean() >>> accuracies {('entropy', None): 0.974609375, ('entropy', 'auto'): 0.9736328125, ('entropy', 'log2'): 0.962890625, ('gini', None): 0.9677734375, ('gini', 'auto'): 0.9638671875, ('gini', 'log2'): 0.96875}
So we now have the accuracies and its performance. Let's visualize the performance:
>>> from matplotlib import pyplot as plt >>> from matplotlib import cm >>> cmap = cm.RdBu_r >>> f, ax = plt.subplots(figsize=(7, 4)) >>> ax.set_xticklabels([''] + list(criteria)) >>> ax.set_yticklabels([''] + list(max_features)) >>> plot_array = [] >>> for max_feature in max_features: m = [] >>> for criterion in criteria: m.append(accuracies[(criterion, max_feature)]) plot_array.append(m) >>> colors = ax.matshow(plot_array, vmin=np.min(accuracies.values()) - 0.001, vmax=np.max(accuracies.values()) + 0.001, cmap=cmap) >>> f.colorbar(colors)
It's fairly easy to see which one performed best here. Hopefully, you can see how this process can be taken to the further stage with a brute force method.