Brute force grid search

In this recipe, we'll do an exhaustive grid search through scikit-learn. This is basically the same thing we did in the previous recipe, but we'll utilize built-in methods.

We'll also walk through an example of performing randomized optimization. This is an alternative to brute force search. Essentially, we're trading computer cycles to make sure that we search the entire space. We were fairly calm in the last recipe. However, you could imagine a model that has several steps, first imputation for fix missing data, then PCA reduce the dimensionality to classification. Your parameter space could get very large, very fast; therefore, it can be advantageous to only search a part of that space.

Getting ready

To get started, we'll need to perform the following steps:

  1. Create some classification data.
  2. We'll then create a LogisticRegression object that will be the model we're fitting.
  3. After that, we'll create the search objects, GridSearch and RandomizedSearchCV.

How to do it...

Run the following code to create some classification data:

>>> from sklearn.datasets import make_classification

>>> X, y = make_classification(1000, n_features=5)

Now, we'll create our logistic regression object:

>>> from sklearn.linear_model import LogisticRegression

>>> lr = LogisticRegression(class_weight='auto')

We need to specify the parameters we want to search. For GridSearch, we can just specify the ranges that we care about, but for RandomizedSearchCV, we'll need to actually specify the distribution over the same space from which to sample:

>>> lr.fit(X, y)

LogisticRegression(C=1.0, class_weight={0: 0.25, 1: 0.75}, dual=False,
                   fit_intercept=True, intercept_scaling=1, 
                   penalty='l2', random_state=None, tol=0.0001)
>>> grid_search_params = {'penalty': ['l1', 'l2'],
                          'C': [1, 2, 3, 4]}

The only change we'll need to make is to describe the C parameter as a probability distribution. We'll keep it simple right now, though we will use scipy to describe the distribution:

>>> import scipy.stats as st
>>> import numpy as np


>>> random_search_params = {'penalty': ['l1', 'l2'],
                            'C': st.randint(1, 4)}

How it works...

Now, we'll fit the classifier. This works by passing lr to the parameter search objects:

>>> from sklearn.grid_search import GridSearchCV, RandomizedSearchCV


>>> gs = GridSearchCV(lr, grid_search_params)

GridSearchCV implements the same API as the other models:

>>> gs.fit(X, y)

GridSearchCV(cv=None, estimator=LogisticRegression(C=1.0, 
             class_weight='auto', dual=False, fit_intercept=True,
             intercept_scaling=1, penalty='l2', random_state=None, 
             tol=0.0001), fit_params={}, iid=True, loss_func=None, 
             n_jobs=1, param_grid={'penalty': ['l1', 'l2'], 
             'C': [1, 2, 3, 4]}, pre_dispatch='2*n_jobs', refit=True, 
             score_func=None, scoring=None, verbose=0)

As we can see with the param_grid parameter, our penalty and C are both arrays.

To access the scores, we can use the grid_scores_ attribute of the grid search. We also want to find the optimal set of parameters. We can also look at the marginal performance of the grid search:

>>> gs.grid_scores_

[mean: 0.90300, std: 0.01192, params: {'penalty': 'l1', 'C': 1},
 mean: 0.90100, std: 0.01258, params: {'penalty': 'l2', 'C': 1},
 mean: 0.90200, std: 0.01117, params: {'penalty': 'l1', 'C': 2},
 mean: 0.90100, std: 0.01258, params: {'penalty': 'l2', 'C': 2},
 mean: 0.90200, std: 0.01117, params: {'penalty': 'l1', 'C': 3},
 mean: 0.90100, std: 0.01258, params: {'penalty': 'l2', 'C': 3},
 mean: 0.90100, std: 0.01258, params: {'penalty': 'l1', 'C': 4},
 mean: 0.90100, std: 0.01258, params: {'penalty': 'l2', 'C': 4}]

We might want to get the max score:

>>> gs.grid_scores_[1][1]

0.90100000000000002

>>> max(gs.grid_scores_, key=lambda x: x[1])

mean: 0.90300, std: 0.01192, params: {'penalty': 'l1', 'C': 1}

The parameters obtained are the best choices for our logistic regression.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset