In this recipe, we'll do an exhaustive grid search through scikit-learn. This is basically the same thing we did in the previous recipe, but we'll utilize built-in methods.
We'll also walk through an example of performing randomized optimization. This is an alternative to brute force search. Essentially, we're trading computer cycles to make sure that we search the entire space. We were fairly calm in the last recipe. However, you could imagine a model that has several steps, first imputation for fix missing data, then PCA reduce the dimensionality to classification. Your parameter space could get very large, very fast; therefore, it can be advantageous to only search a part of that space.
To get started, we'll need to perform the following steps:
LogisticRegression
object that will be the model we're fitting.GridSearch
and RandomizedSearchCV
.Run the following code to create some classification data:
>>> from sklearn.datasets import make_classification >>> X, y = make_classification(1000, n_features=5)
Now, we'll create our logistic regression object:
>>> from sklearn.linear_model import LogisticRegression >>> lr = LogisticRegression(class_weight='auto')
We need to specify the parameters we want to search. For GridSearch
, we can just specify the ranges that we care about, but for RandomizedSearchCV
, we'll need to actually specify the distribution over the same space from which to sample:
>>> lr.fit(X, y) LogisticRegression(C=1.0, class_weight={0: 0.25, 1: 0.75}, dual=False, fit_intercept=True, intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001) >>> grid_search_params = {'penalty': ['l1', 'l2'], 'C': [1, 2, 3, 4]}
The only change we'll need to make is to describe the C
parameter as a probability distribution. We'll keep it simple right now, though we will use scipy
to describe the distribution:
>>> import scipy.stats as st >>> import numpy as np >>> random_search_params = {'penalty': ['l1', 'l2'], 'C': st.randint(1, 4)}
Now, we'll fit the classifier. This works by passing lr
to the parameter search objects:
>>> from sklearn.grid_search import GridSearchCV, RandomizedSearchCV >>> gs = GridSearchCV(lr, grid_search_params)
GridSearchCV
implements the same API as the other models:
>>> gs.fit(X, y) GridSearchCV(cv=None, estimator=LogisticRegression(C=1.0, class_weight='auto', dual=False, fit_intercept=True, intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001), fit_params={}, iid=True, loss_func=None, n_jobs=1, param_grid={'penalty': ['l1', 'l2'], 'C': [1, 2, 3, 4]}, pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring=None, verbose=0)
As we can see with the param_grid
parameter, our penalty
and C
are both arrays.
To access the scores, we can use the grid_scores_
attribute of the grid search. We also want to find the optimal set of parameters. We can also look at the marginal performance of the grid search:
>>> gs.grid_scores_ [mean: 0.90300, std: 0.01192, params: {'penalty': 'l1', 'C': 1}, mean: 0.90100, std: 0.01258, params: {'penalty': 'l2', 'C': 1}, mean: 0.90200, std: 0.01117, params: {'penalty': 'l1', 'C': 2}, mean: 0.90100, std: 0.01258, params: {'penalty': 'l2', 'C': 2}, mean: 0.90200, std: 0.01117, params: {'penalty': 'l1', 'C': 3}, mean: 0.90100, std: 0.01258, params: {'penalty': 'l2', 'C': 3}, mean: 0.90100, std: 0.01258, params: {'penalty': 'l1', 'C': 4}, mean: 0.90100, std: 0.01258, params: {'penalty': 'l2', 'C': 4}]
We might want to get the max score:
>>> gs.grid_scores_[1][1] 0.90100000000000002 >>> max(gs.grid_scores_, key=lambda x: x[1]) mean: 0.90300, std: 0.01192, params: {'penalty': 'l1', 'C': 1}
The parameters obtained are the best choices for our logistic regression.