Poor man's grid search

In this recipe, we're going to introduce grid search with basic Python, though we will use sklearn for the models and matplotlib for the visualization.

Getting ready

In this recipe, we will perform the following tasks:

  • Design a basic search grid in the parameter space
  • Iterate through the grid and check the loss/score function at each point in the parameter space for the dataset
  • Choose the point in the parameter space that minimizes/maximizes the evaluation function

Also, the model we'll fit is a basic decision tree classifier. Our parameter space will be 2 dimensional to help us with the visualization:

Getting ready
Getting ready

The parameter space will then be the Cartesian product of the those two sets:

Getting ready

We'll see in a bit how we can iterate through this space with itertools.

Let's create the dataset and then get started:

>>> from sklearn import datasets
>>> X, y = datasets.make_classification(n_samples=2000, n_features=10)

How to do it...

Earlier we said that we'd use grid search to tune two parameters—criteria and max_features. We need to represent those as Python sets, and then use itertools product to iterate through them:

>>> criteria = {'gini', 'entropy'}
>>> max_features = {'auto', 'log2', None}
>>> import itertools as it
>>> parameter_space = it.product(criteria, max_features)

Great! So now that we have the parameter space, let's iterate through it and check the accuracy of each model as specified by the parameters. Then, we'll store that accuracy so that we can compare different parameter spaces. We'll also use a test and train split of 50, 50:

import numpy as np 
train_set = np.random.choice([True, False], size=len(y))
from sklearn.tree import DecisionTreeClassifier
accuracies = {}
for criterion, max_feature in parameter_space: 
    dt = DecisionTreeClassifier(criterion=criterion, 
         max_features=max_feature)
    dt.fit(X[train_set], y[train_set])
    accuracies[(criterion, max_feature)] = (dt.predict(X[~train_set]) 
                                         == y[~train_set]).mean()
>>> accuracies
{('entropy', None): 0.974609375, ('entropy', 'auto'): 0.9736328125, ('entropy', 'log2'): 0.962890625, ('gini', None): 0.9677734375, ('gini', 'auto'): 0.9638671875, ('gini', 'log2'): 0.96875}

So we now have the accuracies and its performance. Let's visualize the performance:

>>> from matplotlib import pyplot as plt
>>> from matplotlib import cm
>>> cmap = cm.RdBu_r
>>> f, ax = plt.subplots(figsize=(7, 4))
>>> ax.set_xticklabels([''] + list(criteria))
>>> ax.set_yticklabels([''] + list(max_features))
>>> plot_array = []
>>> for max_feature in max_features:
   m = []
>>> for criterion in criteria:
       m.append(accuracies[(criterion, max_feature)])
       plot_array.append(m)
>>> colors = ax.matshow(plot_array, vmin=np.min(accuracies.values()) - 
             0.001, vmax=np.max(accuracies.values()) + 0.001, cmap=cmap)
>>> f.colorbar(colors)

The following is the output:

How to do it...

It's fairly easy to see which one performed best here. Hopefully, you can see how this process can be taken to the further stage with a brute force method.

How it works...

This works fairly simply, we just have to perform the following steps:

  1. Choose a set of parameters.
  2. Iterate through them and find the accuracy of each step.
  3. Find the best performer by visual inspection.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset