Feature selection for SVMs

However, all is not lost on feature selection and I want to take some space to show you a quick way of how to begin exploring this matter. It will require some trial and error on your part. Again, the caret package helps out in this matter as it will run a cross-validation on a linear SVM based on the kernlab package.

To do this, we will need to set the random seed, specify the cross-validation method in the caret's rfeControl() function, perform a recursive feature selection with the rfe() function, and then test how the model performs on the test set. In rfeControl(), you will need to specify the function based on the model being used. There are several different functions that you can use. Here we will need lrFuncs. To see a list of the available functions, your best bet is to explore the documentation with ?rfeControl and ?caretFuncs. The code for this example is as follows:

    > set.seed(123)
    > rfeCNTL <- rfeControl(functions = lrFuncs, method = "cv", number 
      = 10)
    > svm.features <- rfe(train[, 1:7], train[, 8],
      sizes = c(7, 6, 5, 4), 
      rfeControl = rfeCNTL, 
      method = "svmLinear")

To create the svm.features object, it was important to specify the inputs and response factor, number of input features via sizes, and linear method from kernlab, which is the svmLinear syntax. Other options are available using this method, such as svmPoly. No method for a sigmoid kernel is available. Calling the object allows us to see how the various feature sizes perform, as follows:

    > svm.features
    Recursive feature selection
    Outer resampling method: Cross-Validated (10 fold) 
    Resampling performance over subset size:
     Variables Accuracy  Kappa AccuracySD KappaSD Selected
             4   0.7797 0.4700    0.04969  0.1203         
             5   0.7875 0.4865    0.04267  0.1096        *
             6   0.7847 0.4820    0.04760  0.1141         
             7   0.7822 0.4768    0.05065  0.1232         
    The top 5 variables (out of 5):

Counter-intuitive as it is, the five variables perform quite well by themselves as well as when skin and bp are included. Let's try this out on the test set, remembering that the accuracy of the full model was 76.2 per cent:

    > svm.5 <- svm(type ~ glu + ped + npreg + bmi + age,
      data = train,
      kernel = "linear")
    > svm.5.predict <- predict(svm.5, newdata = test[c(1, 2, 5, 6, 7)])
    > table(svm.5.predict, test$type)
    svm.5.predict No Yes
              No  79  21
              Yes 14  33

This did not perform as well and we can stick with the full model. You can see through trial and error how this technique can play out in order to determine some simple identification of feature importance. If you want to explore the other techniques and methods that you can apply here, and for blackbox techniques in particular, I recommend that you start by reading the work by Guyon and Elisseeff (2003) on this subject.

Table of Contents for Feature selection for SVMs

Create new playlist

Sign In

Sign Up

Table of Contents for
Feature selection for SVMs