Feature selection for SVMs

However, all is not lost on feature selection and I want to take some space to show you a quick way of how to begin exploring this matter. It will require some trial and error on your part. Again, the caret package helps out in this matter as it will run a cross-validation on a linear SVM based on the kernlab package.

To do this, we will need to set the random seed, specify the cross-validation method in the caret's rfeControl() function, perform a recursive feature selection with the rfe() function, and then test how the model performs on the test set. In rfeControl(), you will need to specify the function based on the model being used. There are several different functions that you can use. Here we will need lrFuncs. To see a list of the available functions, your best bet is to explore the documentation with ?rfeControl and ?caretFuncs. The code for this example is as follows:

    > set.seed(123)
> rfeCNTL <- rfeControl(functions = lrFuncs, method = "cv", number
= 10)
> svm.features <- rfe(train[, 1:7], train[, 8],
sizes = c(7, 6, 5, 4),
rfeControl = rfeCNTL,
method = "svmLinear")

To create the svm.features object, it was important to specify the inputs and response factor, number of input features via sizes, and linear method from kernlab, which is the svmLinear syntax. Other options are available using this method, such as svmPoly. No method for a sigmoid kernel is available. Calling the object allows us to see how the various feature sizes perform, as follows:

    > svm.features
Recursive feature selection
Outer resampling method: Cross-Validated (10 fold)
Resampling performance over subset size:
Variables Accuracy Kappa AccuracySD KappaSD Selected
4 0.7797 0.4700 0.04969 0.1203
5 0.7875 0.4865 0.04267 0.1096 *
6 0.7847 0.4820 0.04760 0.1141
7 0.7822 0.4768 0.05065 0.1232
The top 5 variables (out of 5):

Counter-intuitive as it is, the five variables perform quite well by themselves as well as when skin and bp are included. Let's try this out on the test set, remembering that the accuracy of the full model was 76.2 per cent:

    > svm.5 <- svm(type ~ glu + ped + npreg + bmi + age,
data = train,
kernel = "linear")

> svm.5.predict <- predict(svm.5, newdata = test[c(1, 2, 5, 6, 7)])
> table(svm.5.predict, test$type)
svm.5.predict No Yes
No 79 21
Yes 14 33

This did not perform as well and we can stick with the full model. You can see through trial and error how this technique can play out in order to determine some simple identification of feature importance. If you want to explore the other techniques and methods that you can apply here, and for blackbox techniques in particular, I recommend that you start by reading the work by Guyon and Elisseeff (2003) on this subject.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset