To mitigate this problem, we have a very useful class named GridSearchCV
within the
sklearn.grid_search
module. What we have been doing with our
calc_params
function is a kind of grid search in one dimension. With GridSearchCV
, we can specify a grid of any number of parameters and parameter values to traverse. It will train the classifier for each combination and obtain a cross-validation accuracy to evaluate each one.
Let's use it to adjust the C
and the gamma
parameters at the same time.
>>> from sklearn.grid_search import GridSearchCV >>> parameters = { >>> 'svc__gamma': np.logspace(-2, 1, 4), >>> 'svc__C': np.logspace(-1, 1, 3), >>> } >>> clf = Pipeline([ >>> ('vect', TfidfVectorizer( >>> stop_words=stop_words, >>> token_pattern=ur"[a-z0-9_-.]+[a-z][a-z0- 9_-.]+", >>> )), >>> ('svc', SVC()), >>> ]) >>> gs = GridSearchCV(clf, parameters, verbose=2, refit=False, cv=3)
Let's execute our grid search and print the best parameter values and scores.
>>> %time _ = gs.fit(X_train, y_train) >>> gs.best_params_, gs.best_score_ CPU times: user 304.39 s, sys: 2.55 s, total: 306.94 s Wall time: 306.56 s ({'svc__C': 10.0, 'svc__gamma': 0.10000000000000001}, 0.81166666666666665)
With the grid search, we obtained a better combination of C
and gamma
parameters, for values 10.0
and 0.10
respectively, with a three-fold cross-validation accuracy of 0.811
, which is much better than the best value we obtained (0.76
) in the previous experiment by only adjusting gamma
and keeping the C
value at 1.0
.
At this point, we could continue performing experiments by trying not only to adjust other parameters of the SVC but also adjusting the parameters on TfidfVectorizer
, which is also part of the estimator. Note that this additionally increases the complexity. As you might have noticed, the previous grid search experiment took about five minutes to finish. If we add new parameters to adjust, the time will increase exponentially. As a result, these kinds of methods are very resource/time intensive; this is also the reason why we used only a subset of the total instances.