In this section, we are going to use R's e1071
package to try out the models we've discussed on a real-world dataset. As our first example, we have chosen the QSARbiodegration data set, which can be found at https://archive.ics.uci.edu/ml/datasets/QSAR+biodegradation. This is a dataset containing 41 numerical variables that describe the molecular composition and properties of 1,055 chemicals. The modeling task is to predict whether a particular chemical will be biodegradable based on these properties. Example properties are the percentages of carbon, nitrogen, and oxygen atoms, as well as the number of heavy atoms in the molecule. These features are highly specialized and sufficiently numerous, so a full listing won't be given here. The complete list and further details of the quantities involved can be found on the website. For now, we've downloaded the data into a bdf
data frame:
>bdf<- read.table("biodeg.csv", sep = ";", quote = """) > head(bdf, n = 3) V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 1 3.919 2.6909 0 0 0 0 0 31.4 2 0 0 0 3.106 2 4.170 2.1144 0 0 0 0 0 30.8 1 1 0 0 2.461 3 3.932 3.2512 0 0 0 0 0 26.7 2 4 0 0 3.279 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 1 2.550 9.002 0 0.960 1.142 0 0 0 1.201 0 0 2 1.393 8.723 1 0.989 1.144 0 0 0 1.104 1 0 3 2.585 9.110 0 1.009 1.152 0 0 0 1.092 0 0 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 1 0 0 1.932 0.011 0 0 4.489 0 0 0 0 2 0 0 2.214 -0.204 0 0 1.542 0 0 0 0 3 0 0 1.942 -0.008 0 0 4.891 0 0 0 1 V36 V37 V38 V39 V40 V41 V42 1 2.949 1.591 0 7.253 0 0 RB 2 3.315 1.967 0 7.257 0 0 RB 3 3.076 2.417 0 7.601 0 0 RB
The final column, V42
, contains the output variable, which takes the value NRB
for chemicals that are not biodegradable and RB
for those that are. We'll recode this into the familiar labels of 0
and 1
:
> levels(bdf$V42) <- c(0, 1)
Now that we have our data ready, we'll begin, as usual, by splitting them into training and testing sets, with an 80-20 split:
> library(caret) >set.seed(23419002) >bdf_sampling_vector<- createDataPartition(bdf$V42, p = 0.80, list = FALSE) >bdf_train<- bdf[bdf_sampling_vector,] >bdf_test<- bdf[-bdf_sampling_vector,]
There are a number of packages available in R that implement support vector machines. In this chapter, we'll explore the use of the e1071
package, which provides us with the svm()
function. If we examine our training data, we'll quickly notice that on the one hand, the scales of the various features are quite different, and on the other hand, many features are sparse features, which means that for many entries they take a zero value. It is a good idea to scale features as we did with neural networks, especially if we want to work with radial kernels. Fortunately for us, the svm()
function has a scale
parameter, which is set to TRUE
by default. This normalizes the input features so that they have zero mean and unit variance before the model is trained. This circumvents the need for us to manually carry out this preprocessing step. The first model that we will investigate will use a linear kernel:
> library(e1071) >model_lin<- svm(V42 ~ ., data = bdf_train, kernel = "linear", cost = 10)
The call to the svm()
function follows the familiar paradigm of first providing a formula, then providing the name of the data frame, and finally, other parameters relevant to the model. In our case, we want to train a model where the final V42
column is the predictor column and all other columns are to be used as features. For this reason, we can just use the simple formula V42 ~
instead of having to fully enumerate all the other columns. After specifying our data frame, we then specify the type of kernel we will use, and in this case, we've opted for a linear kernel. We'll also specify a value for the cost
parameter, which is related to the error budget C in our model:
>model_lin Call: svm(formula = V42 ~ ., data = biodeg_training2, kernel = "linear", cost = 10) Parameters: SVM-Type: C-classification SVM-Kernel: linear cost: 10 gamma: 0.02439024 Number of Support Vectors: 272
Our model doesn't provide us with too much information on its performance beyond the details of the parameters that we specified. One interesting piece of information is the number of data points that were support vectors in our model; in this case, 272
. If we use the str()
function to examine the structure of the fitted model, however, we will find that it contains a number of useful attributes. For example, the fitted attribute contains the model's predictions on the training data. We'll use this to gauge the quality of model fit by computing the accuracy of the training data and the confusion matrix:
> mean(bdf_train[,42] == model_lin$fitted) [1] 0.8887574 > table(actual = bdf_train[,42], predictions = model_lin$fitted) predictions actual 0 1 0 519 41 1 53 232
We have a training accuracy of just under 89 percent, which is a decent start. Next, we'll examine the performance of the test data using the predict()
function to see whether we can get a test accuracy close to this or whether we have ended up overfitting the data:
>test_predictions<- predict(model_lin, bdf_test[,1:41]) > mean(bdf_test[,42] == test_predictions) [1] 0.8619048
We do have a slightly lower test accuracy than we'd expect, but we are sufficiently close to the training accuracy we obtained earlier in order to be relatively confident that we are not in a position where we are overfitting the training data. Now, we've seen that the cost
parameter plays an important role in our model, and that choosing this involves a trade-off in model bias and variance. Consequently, we want to try different values of our cost
parameter before settling on a final model. After manually repeating the preceding code for a few values of this parameter, we obtained the following set of results:
>linearPerformances 0.01 0.1 1 10 100 1000 training 0.858 0.888 0.883 0.889 0.886 0.886 test 0.886 0.876 0.876 0.862 0.862 0.862
Sometimes, when building a model, we may see a warning informing us that the maximum number of iterations has been reached. If this happens, we should be doubtful of the model that we produced, as it may be an indication that a solution was not found and the optimization procedure did not converge. In such a case, it is best to experiment with a different cost
value and/or kernel type.
These results show that for most values of the cost
parameter, we are seeing a very similar level of quality of fit on our training data, roughly 88 percent. Ironically, the best performance on the test data was obtained using the model whose fit on the training data was the worst, using a cost of 0.01. In short, although we have reasonable performance on our training and test datasets, the low variance in the results shown in the table essentially tells us that we are not going to get a significant improvement in the quality of fit by tweaking the cost
parameter on this particular dataset.
Now let's try using a radial kernel to see whether introducing some nonlinearity can allow us to improve our performance. When we specify a radial kernel, we must also specify a positive gamma
parameter. This corresponds to the 1/2σ2 parameter in the equation of a radial kernel. The role that this parameter plays is that it controls the locality of the similarity computation between its two vector inputs. A large gamma
means that the kernel will produce values that are close to zero, unless the two vectors are very close together. A smaller gamma
results in a smoother kernel and takes into account pairs of vectors that are farther away. Again, this choice boils down to a trade-off between bias and variance, so just as with the cost
parameter, we'll have to try out different values of gamma
. For now, let's see how we can create a support vector machine model using a radial kernel with a specific configuration:
>model_radial<- svm(V42 ~ ., data = bdf_train, kernel = "radial", cost = 10, gamma = 0.5) > mean(bdf_train[,42] == model_radial$fitted) [1] 0.9964497 >test_predictions<- predict(model_radial, bdf_test[,1:41]) > mean(bdf_test[,42] == test_predictions) [1] 0.8047619
Note that the radial kernel under these settings is able to fit the training data much more closely, as indicated by the nearly 100 percent training accuracy; but when we see the performance on the test dataset, the result is substantially lower than what we obtained on the training data. Consequently, we have a very clear indication that this model is overfitting the data. To get around this problem, we will manually experiment with a few different settings of the gamma
and cost
parameters to see whether we can improve the fit:
>radialPerformances [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] cost 0.01 0.1 1 10 100 0.01 0.1 1 10 gamma 0.01 0.01 0.01 0.01 0.01 0.05 0.05 0.05 0.05 training 0.663 0.824 0.88 0.916 0.951 0.663 0.841 0.918 0.964 test 0.662 0.871 0.89 0.89 0.886 0.662 0.848 0.89 0.89 [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] cost 100 0.01 0.1 1 10 100 0.01 0.1 gamma 0.05 0.1 0.1 0.1 0.1 0.1 0.5 0.5 training 0.989 0.663 0.815 0.937 0.985 0.995 0.663 0.663 test 0.838 0.662 0.795 0.886 0.867 0.824 0.662 0.662 [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] cost 1 10 100 0.01 0.1 1 10 100 gamma 0.5 0.5 0.5 1 1 1 1 1 training 0.98 0.996 0.998 0.663 0.663 0.991 0.996 0.999 test 0.79 0.805 0.805 0.662 0.662 0.748 0.757 0.757
As we can see, the combination of the two parameters, cost
and gamma
, yields a much wider range of results using the radial kernel. From the data frame we built previously, we can see that some combinations, such as cost = 1 and gamma = 0.05, have brought our accuracy up to 89 percent on the test data, while still maintaining an analogous performance on our training data. Also, in the data frame, we see many examples of settings in which the training accuracy is nearly 100 percent, but the test accuracy is well below this.
As a result, we conclude that using a nonlinear kernel, such as the radial kernel, needs to be done with care in order to avoid overfitting. Nonetheless, radial kernels are very powerful and can be quite effective at modeling a highly nonlinear decision boundary, often allowing us to achieve higher rates of classification accuracy than a linear kernel. At this point in our analysis, we would usually want to settle on a particular value for the cost
and gamma
parameters and then retrain our model using the entire data available before deploying it in the real world.
Unfortunately, after using the test set to guide our decision on what parameters to use, it no longer represents an unseen dataset that would enable us to predict the model's accuracy in the real world. One possible solution to this problem is to use a validation set, but this would require us to set aside some of our data, resulting in smaller training and test set sizes.
Cross-validation, which we covered in Chapter 2, Tidying Data and Measuring Performance, should be considered as a practical way out of this dilemma.
A very readable book on support vector machines is An Introduction to Support Vector Machines and Other Kernel-based Learning Methods by Nello Christiani and John Shawe-Taylor. Another good reference that presents an insightful link between SVMs and a related type of neural network known as a Radial Basis Function Network is Neural Networks and Learning Machines by Simon Haykin, which we also referenced in Chapter 5, Neural Networks.