Modeling using support vector machines

Support vector machines belong to the family of supervised machine learning algorithms used for both classification and regression. Considering our binary classification problem, unlike logistic regression, the SVM algorithm will build a model around the training data in such a way that the training data points belonging to different classes are separated by a clear gap, which is optimized such that the distance of separation is the maximum. The samples on the margins are typically called the support vectors. The middle of the margin which separates the two classes is called the optimal separating hyperplane.

Data points on the wrong side of the margin are weighed down to reduce their influence and this is called the soft margin compared to the hard margins of separation we discussed earlier. SVM classifiers can be simple linear classifiers where the data points can be linearly separated. However, if we are dealing with data consisting of several features such that a linear separation is not possible directly, then we make use of several kernels to achieve the same and these form the non-linear SVM classifiers. You will be able to visualize how an SVM classifier actually looks much better with the following figure from the official documentation for the svm library in R:

From the figure, you can clearly see that we can place multiple hyperplanes separating the data points. However, the criterion for choosing the separating hyperplane is such that the distance of separation from the two classes is the maximum and the support vectors are the representative samples of the two classes as depicted on the margins. Revisiting the issue of non-linear classifiers, SVM has several kernels which can be used to achieve this besides the regular linear kernel used for linear classification. These include polynomial, radial basis function (RBF), and several others. The main principle behind these non-linear kernel functions is that, even if linear separation is not possible in the original feature space, they enable the separation to happen in a higher dimensional transformed feature space where we can use a hyperplane to separate the classes. An important thing to remember is the curse of dimensionality that applies here; since we may end up working with higher dimensional feature spaces, the model generalization error increases and the predictive power of the model decreases. It we have enough data, it still performs reasonably. We will be using the RBF kernel, also known as the radial basis function, in our model and for that two important parameters are cost and gamma.

We will start by loading the necessary dependencies and preparing the testing data features:

library(e1071) # svm model
library(caret) # model trainingoptimizations
library(kernlab) # svm model for hyperparameters
library(ROCR) # model evaluation
source("performance_plot_utils.R") # plot model metrics
## separate feature and class variables
test.feature.vars <- test.data[,-1]
test.class.var <- test.data[,1]

Once this is done, we build the SVM model using the training data and the RBF kernel on all the training set features:

> formula.init <- "credit.rating ~ ."
> formula.init <- as.formula(formula.init)
> svm.model <- svm(formula=formula.init, data=train.data, 
+                  kernel="radial", cost=100, gamma=1)
> summary(svm.model)

The properties of the model are generated as follows from the summary function:

Modeling using support vector machines

Now we use our testing data on this model to make predictions and evaluate the results as follows:

> svm.predictions <- predict(svm.model, test.feature.vars)
> confusionMatrix(data=svm.predictions, reference=test.class.var, positive="1")

This gives us the following confusion matrix like we saw in logistic regression and the details are depicted for the model performance. We observe that the accuracy is 67.5%, sensitivity is 100%, and specificity is 0%, which means that it is a very aggressive model which just predicts every customer rating as good. This model clearly suffers from the major class classification problem and we need to improve this.

Modeling using support vector machines

To build a better model, we need some feature selection. We already have the top five best features which we had obtained in the Feature selection section. Nevertheless, we will still run a feature selection algorithm specifically for SVM to see feature importance, as follows:

> formula.init <- "credit.rating ~ ."
> formula.init <- as.formula(formula.init)
> control <- trainControl(method="repeatedcv", number=10, repeats=2)
> model <- train(formula.init, data=train.data, method="svmRadial", 
+                trControl=control)
> importance <- varImp(model, scale=FALSE)
> plot(importance, cex.lab=0.5)

This gives us a plot and we see that the top five important variables are similar to our top five best features, except this algorithm ranks age as more important than credit.amount, so you can test this by building several models with different features and see which one gives the best results. For us, the features selected from random forests gave a better result. The variable importance plot is depicted as follows:

Modeling using support vector machines

We now build a new SVM model based on the top five features that gave us the best results and evaluate its performance on the test data using the following code snippet:

> formula.new <- "credit.rating ~ account.balance + 
                   credit.duration.months + savings + 
                   previous.credit.payment.status + credit.amount"
> formula.new <- as.formula(formula.new)
> svm.model.new <- svm(formula=formula.new, data=train.data, 
+                  kernel="radial", cost=100, gamma=1)
> svm.predictions.new <- predict(svm.model.new, test.feature.vars)
> confusionMatrix(data=svm.predictions.new, 
                  reference=test.class.var, positive="1")

The preceding snippet gives us a confusion matrix finally on the test data and we observe that the overall accuracy has in fact dropped by 1% to 66.5%. However, the most interesting part is that now our model is able to predict more bad ratings from bad, which can be seen from the confusion matrix. The specificity is now 38% compared to 0% earlier and, correspondingly, the sensitivity has gone down to 80% from 100%, which is still good because now this model is actually useful and profitable! You can see from this that feature selection can indeed be extremely powerful. The confusion matrix for the preceding observations is depicted in the following snapshot:

Modeling using support vector machines

We will definitely select this model and move on to model optimization by hyperparameter tuning using a grid search algorithm as follows to optimize the cost and gamma parameters:

cost.weights <- c(0.1, 10, 100)
gamma.weights <- c(0.01, 0.25, 0.5, 1)
tuning.results <- tune(svm, formula.new,
                       data = train.data, kernel="Radial", 
                       ranges=list(cost=cost.weights, gamma=gamma.weights))
print(tuning.results)

Output:

Modeling using support vector machines

The grid search plot can be viewed as follows:

> plot(tuning.results, cex.main=0.6, cex.lab=0.8,xaxs="i", yaxs="i")

Output:

Modeling using support vector machines

The darkest region shows the parameter values which gave the best performance. We now select the best model and evaluate it once again as follows:

> svm.model.best = tuning.results$best.model
> svm.predictions.best <- predict(svm.model.best,
                                  test.feature.vars)
> confusionMatrix(data=svm.predictions.best, 
                  reference=test.class.var, positive="1")

On observing the confusion matrix results we obtained from the following output (we are henceforth depicting only the metrics which we are tracking), we see that the overall accuracy has increased to 71%, sensitivity to 86%, and specificity to 41%, which is excellent compared to the previous model results:

Modeling using support vector machines

You see how powerful hyperparameter optimizations can be in predictive modeling! We also plot some evaluation curves as follows:

> svm.predictions.best <- predict(svm.model.best, test.feature.vars, decision.values = T)
> svm.prediction.values <- attributes(svm.predictions.best)$decision.values
> predictions <- prediction(svm.prediction.values, test.class.var)
> par(mfrow=c(1,2))
> plot.roc.curve(predictions, title.text="SVM ROC Curve")
> plot.pr.curve(predictions, title.text="SVM Precision/Recall Curve")

We can see how the predictions are plotted in the evaluation space, and we see that the AUC in this case is 0.69 from the following ROC plot:

Modeling using support vector machines

Now, let's say we want to optimize the model based on this ROC plot with the objective of maximizing the AUC. We will try that now, but first we need to encode the values of the categorical variables to include some letters because R causes some problems when representing column names of factor variables that have only numbers. So basically, if credit.rating has values 0, 1 then it gets transformed to X0 and X1; ultimately our categories are still distinct and nothing changes. We transform our data first with the following code snippet:

> transformed.train <- train.data
> transformed.test <- test.data
> for (variable in categorical.vars){
+   new.train.var <- make.names(train.data[[variable]])
+   transformed.train[[variable]] <- new.train.var
+   new.test.var <- make.names(test.data[[variable]])
+   transformed.test[[variable]] <- new.test.var
+ }
> transformed.train <- to.factors(df=transformed.train, variables=categorical.vars)
> transformed.test <- to.factors(df=transformed.test, variables=categorical.vars)
> transformed.test.feature.vars <- transformed.test[,-1]
> transformed.test.class.var <- transformed.test[,1]

Now we build an AUC optimized model using grid search again, as follows:

> grid <- expand.grid(C=c(1,10,100), sigma=c(0.01, 0.05, 0.1, 0.5, 
                                             1))
> ctr <- trainControl(method='cv', number=10, classProbs=TRUE,
                      summaryFunction=twoClassSummary)
> svm.roc.model <- train(formula.init, transformed.train,
+                        method='svmRadial', trControl=ctr, 
+                        tuneGrid=grid, metric="ROC")

Our next step is to perform predictions on the test data and evaluate the confusion matrix:

> predictions <- predict(svm.roc.model, 
                         transformed.test.feature.vars)
> confusionMatrix(predictions, transformed.test.class.var, 
                  positive = "X1")

This gives us the following results:

Modeling using support vector machines

We see now that accuracy has increased further to 72% and specificity has decreased slightly to 40%, but sensitivity has increased to 87%, which is good. We plot the curves once again, as follows:

> svm.predictions <- predict(svm.roc.model, transformed.test.feature.vars, type="prob")
> svm.prediction.values <- svm.predictions[,2]
> predictions <- prediction(svm.prediction.values, test.class.var)
> par(mfrow=c(1,2))
> plot.roc.curve(predictions, title.text="SVM ROC Curve")
> plot.pr.curve(predictions, title.text="SVM Precision/Recall Curve")

This gives us the following plots, the same as we did in our earlier iterations:

Modeling using support vector machines

It is quite pleasing to see that the AUC has indeed increased from 0.69 earlier to 0.74 now, which means the AUC based optimization algorithm indeed worked, since it has given better performance than the previous model in all the aspects we have been tracking. Up next, we will look at how to build predictive models using decision trees.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset