Tuning SVM

Before we start working on the hyperparameters (which are typically a different set of parameters depending on the implementation), there are two aspects that are left to be clarified when working with an SVM algorithm.

The first is about the sensitivity of the SVM to variables of different scale and large numbers. Similar to other learning algorithms based on linear combinations, having variables at different scales leads the algorithm to be dominated by features with the larger range or variance. Moreover, extremely high or low numbers may cause problems in the optimization process of the learning algorithms. It is advisable to scale all the data at limited intervals, such as [0,+1], which is a necessary choice if you are working with sparse arrays. In fact, it is desirable to preserve zero entries. Otherwise, data will become dense, consuming more memory. You can also scale the data into the [-1,+1] interval. Alternatively, you can standardize them to zero mean and unit variance. You can use, from the preprocessing module, the MinMaxScaler and StandardScaler utility classes by first fitting them on the training data and then transforming both the train and test sets.

The second aspect is regarding unbalanced classes. The algorithm tends to favor the frequent classes. A solution, apart from resampling or downsampling (reducing the majority class to the same number of the lesser one), is to weigh the C penalty parameter according to the frequency of the class (low values will penalize the class more, high values less). There are two ways to achieve this with respect to the different implementations; first, there is the class_weight parameter in SVC (which can be set to the keyword balanced, or provided with a dictionary containing specific values for each class). Then, there is also the sample_weight parameter in the .fit() method of SVC, NuSVC, SVR, NuSVR, and OneClassSVM (it requires a one-dimensional array as input, where each position refers to the weight of each training example).

Having dealt with scale and class balance, you can exhaustively search for optimal settings of the other parameters using GridSearchCV from the model_selection module in sklearn. Though SVM works fine with default parameters, they are often not optimal, and you need to test various value combinations using cross-validation in order to find the best ones.

According to their importance, you have to set the following parameters:

  • C: The penalty value. Decreasing it makes the margin larger, thus ignoring more noise but also making the model more generalizable. A best value can be normally considered in the range of np.logspace(-3, 3, 7).
  • kernel: The non-linearity workhorse for SVM can be set to linear, poly, rbf, sigmoid, or a custom kernel (for experts!). The most commonly used one is certainly rbf.
  • degree: This works with kernel='poly', signaling the dimensionality of the polynomial expansion. Instead, it is ignored by other kernels. Usually, setting its value from 2 to 5 works the best.
  • gamma: A coefficient for 'rbf', 'poly', and 'sigmoid'. High values tend to fit data in a better way but can lead to some overfitting. Intuitively, we can imagine gamma as the influence that a single example exercises on the model. Low values make the influence of each example felt quite far. Since many points have to be considered, the SVM curve will tend to take a shape less influenced by local points and the result will be a morbid contour curve. High values of gamma, instead, make the curve take into account more of how points are arranged locally. Many small bubbles explicating the influence exerted by local points will usually represent the results. The suggested grid search range for this hyperparameter is np.logspace(-3, 3, 7).
  • nu: For regression and classification with nuSVR and nuSVC, this parameter approximates the training points that are not classified with confidence, that is, mis-classified points and correct points inside or on the margin. It should be in the range of [0,1], since it is a proportion relative to your training set. In the end, it acts as C, with high proportions enlarging the margin.
  • epsilon: This parameter specifies how much error SVR is going to accept by defining an epsilon large range where no penalty is associated with respect to the true value of the point. The suggested search range is np.insert(np.logspace(-4, 2, 7),0,[0]).
  • penalty, loss, and dual: For LinearSVC, these parameters accept the ('l1','squared_hinge',False), ('l2','hinge',True), ('l2','squared_hinge',True), and ('l2','squared_hinge',False) combinations. The ('l2','hinge',True) combination is analogous to the SVC (kernel='linear') learner.

As an example, we will load the IJCNN'01 dataset again, and we will try to improve the initial accuracy of 0.91 by looking for better degree, C, and gamma values. To save time, we will use the RandomizedSearchCV class to increase the accuracy to 0.989 (cross-validation estimate):

In: from sklearn.svm import SVC
from sklearn.model_selection import RandomizedSearchCV
X_train, y_train = load_svmlight_file('ijcnn1.bz2')
first_rows = 2500
X_train, y_train = X_train[:first_rows,:], y_train[:first_rows]
hypothesis = SVC(kernel='rbf', random_state=101)
search_dict = {'C': [0.01, 0.1, 1, 10, 100],
'gamma': [0.1, 0.01, 0.001, 0.0001]}
search_func = RandomizedSearchCV(estimator=hypothesis,
n_iter=10, scoring='accuracy',

n_jobs=-1, iid=True, refit=True,
cv=5, random_state=101)
search_func.fit(X_train, y_train)
print ('Best parameters %s' % search_func.best_params_)
print ('Cross validation accuracy: mean = %0.3f' %

Out: Best parameters {'C': 100, 'gamma': 0.1}
Cross validation accuracy: mean = 0.989
