In machine learning, we have two types of parameters: those that are learned from the training data, for example, the weights in logistic regression, and the parameters of a learning algorithm that are optimized separately. The latter are the tuning parameters, also called hyperparameters, of a model, for example, the regularization parameter in logistic regression or the depth parameter of a decision tree.
In the previous section, we used validation curves to improve the performance of a model by tuning one of its hyperparameters. In this section, we will take a look at a popular hyperparameter optimization technique called grid search that can further help improve the performance of a model by finding the optimal combination of hyperparameter values.
The approach of grid search is quite simple; it's a brute-force exhaustive search paradigm where we specify a list of values for different hyperparameters, and the computer evaluates the model performance for each combination of those to obtain the optimal combination of values from this set:
>>> from sklearn.model_selection import GridSearchCV >>> from sklearn.svm import SVC >>> pipe_svc = make_pipeline(StandardScaler(), ... SVC(random_state=1)) >>> param_range = [0.0001, 0.001, 0.01, 0.1, ... 1.0, 10.0, 100.0, 1000.0] >>> param_grid = [{'svc__C': param_range, ... 'svc__kernel': ['linear']}, ... {'svc__C': param_range, 'svc__gamma': param_range, ... 'svc__kernel': ['rbf']}] >>> gs = GridSearchCV(estimator=pipe_svc, ... param_grid=param_grid, ... scoring='accuracy', ... cv=10, ... n_jobs=-1) >>> gs = gs.fit(X_train, y_train) >>> print(gs.best_score_) 0.9846153846153847 >>> print(gs.best_params_) {'svc__C': 100.0, 'svc__gamma': 0.001, 'svc__kernel': 'rbf'}
Using the preceding code, we initialized a GridSearchCV
object from the sklearn.model_selection
module to train and tune a Support Vector Machine (SVM) pipeline. We set the param_grid
parameter of GridSearchCV
to a list of dictionaries to specify the parameters that we'd want to tune. For the linear SVM, we only evaluated the inverse regularization parameter C
; for the RBF kernel SVM, we tuned both the svc__C
and svc__gamma
parameter. Note that the svc__gamma
parameter is specific to kernel SVMs.
After we used the training data to perform the grid search, we obtained the score of the best-performing model via the best_score_
attribute and looked at its parameters that can be accessed via the best_params_
attribute. In this particular case, the RBF-kernel SVM model with svc__C = 100.0
yielded the best k-fold cross-validation accuracy: 98.5 percent.
Finally, we will use the independent test dataset to estimate the performance of the best-selected model, which is available via the best_estimator_
attribute of the GridSearchCV
object:
>>> clf = gs.best_estimator_ >>> clf.fit(X_train, y_train) >>> print('Test accuracy: %.3f' % clf.score(X_test, y_test)) Test accuracy: 0.974
Although grid search is a powerful approach for finding the optimal set of parameters, the evaluation of all possible parameter combinations is also computationally very expensive. An alternative approach to sampling different parameter combinations using scikit-learn is randomized search. Using the RandomizedSearchCV
class in scikit-learn, we can draw random parameter combinations from sampling distributions with a specified budget. More details and examples of its usage can be found at http://scikit-learn.org/stable/modules/grid_search.html#randomized-parameter-optimization.
Using k-fold cross-validation in combination with grid search is a useful approach for fine-tuning the performance of a machine learning model by varying its hyperparameter values, as we saw in the previous subsection. If we want to select among different machine learning algorithms, though, another recommended approach is nested cross-validation. In a nice study on the bias in error estimation, Varma and Simon concluded that the true error of the estimate is almost unbiased relative to the test set when nested cross-validation is used (Bias in Error Estimation When Using Cross-validation for Model Selection, BMC Bioinformatics, S. Varma and R. Simon, 7(1): 91, 2006).
In nested cross-validation, we have an outer k-fold cross-validation loop to split the data into training and test folds, and an inner loop is used to select the model using k-fold cross-validation on the training fold. After model selection, the test fold is then used to evaluate the model performance. The following figure explains the concept of nested cross-validation with only five outer and two inner folds, which can be useful for large datasets where computational performance is important; this particular type of nested cross-validation is also known as 5x2 cross-validation:
In scikit-learn, we can perform nested cross-validation as follows:
>>> gs = GridSearchCV(estimator=pipe_svc, ... param_grid=param_grid, ... scoring='accuracy', ... cv=2) >>> scores = cross_val_score(gs, X_train, y_train, ... scoring='accuracy', cv=5) >>> print('CV accuracy: %.3f +/- %.3f' % (np.mean(scores), ... np.std(scores))) CV accuracy: 0.974 +/- 0.015
The returned average cross-validation accuracy gives us a good estimate of what to expect if we tune the hyperparameters of a model and use it on unseen data. For example, we can use the nested cross-validation approach to compare an SVM model to a simple decision tree classifier; for simplicity, we will only tune its depth parameter:
>>> from sklearn.tree import DecisionTreeClassifier >>> gs = GridSearchCV(estimator=DecisionTreeClassifier( random_state=0), ... param_grid=[{'max_depth': [1, 2, 3, ... 4, 5, 6, 7, None]}], ... scoring='accuracy', ... cv=2) >>> scores = cross_val_score(gs, X_train, y_train, ... scoring='accuracy', cv=5) >>> print('CV accuracy: %.3f +/- %.3f' % (np.mean(scores), ... np.std(scores))) CV accuracy: 0.934 +/- 0.016
As we can see, the nested cross-validation performance of the SVM model (97.4 percent) is notably better than the performance of the decision tree (93.4 percent), and thus, we'd expect that it might be the better choice to classify new data that comes from the same population as this particular dataset.