Business and data understanding

We are once again going to visit our wine data set that we used in Chapter 8, Cluster Analysis. If you recall, it consists of 13 numeric features and a response of three possible classes of wine. Our task is to predict those classes. I will include one interesting twist and that is to artificially increase the number of observations. The reasons are twofold. First, I want to fully demonstrate the resampling capabilities of the mlr package, and second, I wish to cover a synthetic sampling technique. We utilized upsampling in the prior section, so synthetic is in order.

Our first task is to load the package libraries and bring the data:

    > library(mlr)

    > library(ggplot2)

    > library(HDclassif)

    > library(DMwR)

    > library(reshape2)

    > library(corrplot)

    > data(wine)

    > table(wine$class)

     1  2  3 
    59 71 48

We have 178 observations, plus the response labels are numeric (1, 2 and 3). Let's more than double the size of our data. The algorithm used in this example is Synthetic Minority Over-Sampling Technique (SMOTE). In the prior example, we used upsampling where the minority class was sampled WITH REPLACEMENT until the class size matched the majority. With SMOTE, take a random sample of the minority class and compute/identify the k-nearest neighbors for each observation and randomly generate data based on those neighbors. The default nearest neighbors in the SMOTE() function from the DMwR package is 5 (k = 5). The other thing you need to consider is the percentage of minority oversampling. For instance, if we want to create a minority class double its current size, we would specify "percent.over = 100" in the function. The number of new samples for each case added to the current minority class is percent over/100, or one new sample for each observation. There is another parameter for percent over, and that controls the number of majority classes randomly selected for the new dataset.

Here is the application of the technique, first starting by structuring the classes to a factor, otherwise the function will not work:

    > wine$class <- as.factor(wine$class)

    > set.seed(11)

    > df <- SMOTE(class ~ ., wine, perc.over = 300, perc.under = 300)

    > table(df$class)

      1   2   3 
    195 237 192

Voila! We have created a dataset of 624 observations. Our next endeavor will involve a visualization of the number of features by class. I am a big fan of boxplots, so let's create boxplots for the first four inputs by class. They have different scales, so putting them into a dataframe with mean 0 and standard deviation of 1 will aid the comparison:

    > wine.scale <- data.frame(scale(wine[, 2:5]))

    > wine.scale$class <- wine$class

    > wine.melt <- melt(wine.scale, id.var="class")

    > ggplot(data = wine.melt, aes( x = class, y = value)) +
      geom_boxplot() +
      facet_wrap( ~ variable, ncol = 2)

The output of the preceding command is as follows:

Recall from Chapter 3, Logistic Regression and Discriminant Analysis that a dot on the boxplot is considered an outlier. So, what should we do with them? There are a number of things to do:

Nothing--doing nothing is always an option
Delete the outlying observations
Truncate the observations either within the current feature or create a separate feature of truncated values
Create an indicator variable per feature that captures whether an observation is an outlier

I've always found outliers interesting and usually look at them closely to determine why they occur and what to do with them. We don't have that kind of time here, so let me propose a simple solution and code around truncating the outliers. Let's create a function to identify each outlier and reassign a high value (> 99th percentile) to the 75th percentile and a low value (< 1 percentile) to the 25th percentile. You could do median or whatever, but I've found this to work well.

You could put these code excerpts into the same function, but I've done in this fashion for simplification and understanding.

These are our outlier functions:

    > outHigh <- function(x) {
      x[x > quantile(x, 0.99)] <- quantile(x, 0.75)
      x
      }

    > outLow <- function(x) {
      x[x < quantile(x, 0.01)] <- quantile(x, 0.25)
      x
      }

Now we execute the function on the original data and create a new dataframe:

    > wine.trunc <- data.frame(lapply(wine[, -1], outHigh))

    > wine.trunc <- data.frame(lapply(wine.trunc, outLow))

    > wine.trunc$class <- wine$class

A simple comparison of a truncated feature versus the original is in order. Let's try that with V3:

    > boxplot(wine.trunc$V3 ~ wine.trunc$class)

The output of the preceding command is as follows:

So that worked out well. Now it's time to look at correlations:

    > c <- cor(wine.trunc[, -14])

    > corrplot.mixed(c, upper = "ellipse")

The output of the preceding command is as follows:

We see that V6 and V7 are the highest correlated features, and we see a number above 0.5. In general, this is not a problem with non-linear based learning methods, but we will account for this in our GLM by incorporating an L2 penalty (ridge regression).

Table of Contents for Business and data understanding

Create new playlist

Sign In

Sign Up

Table of Contents for
Business and data understanding