Data understanding and preparation

The dataset for the 532 women is in two separate data frames. The variables of interest are as follows:

npreg: This is the number of pregnancies
glu: This is the plasma glucose concentration in an oral glucose tolerance test
bp: This is the diastolic blood pressure (mm Hg)
skin: This is triceps skin-fold thickness measured in mm
bmi: This is the body mass index
ped: This is the diabetes pedigree function
age: This is the age in years
type: This is diabetic, Yes or No

The datasets are contained in the R package, MASS. One data frame is named Pima.tr and the other is named Pima.te. Instead of using these as separate train and test sets, we will combine them and create our own in order to discover how to do such a task in R.

To begin, let's load the following packages that we will need for the exercise:

    > library(class) #k-nearest neighbors
    > library(kknn) #weighted k-nearest neighbors
    > library(e1071) #SVM
    > library(caret) #select tuning parameters
    > library(MASS) # contains the data
    > library(reshape2) #assist in creating boxplots
    > library(ggplot2) #create boxplots
    > library(kernlab) #assist with SVM feature selection

We will now load the datasets and check their structure, ensuring that they are the same, starting with Pima.tr, as follows:

    > data(Pima.tr)
    > str(Pima.tr)
    'data.frame':200 obs. of  8 variables:
     $ npreg: int  5 7 5 0 0 5 3 1 3 2 ...
     $ glu  : int  86 195 77 165 107 97 83 193 142 128 ...
     $ bp   : int  68 70 82 76 60 76 58 50 80 78 ...
     $ skin : int  28 33 41 43 25 27 31 16 15 37 ...
     $ bmi  : num  30.2 25.1 35.8 47.9 26.4 35.6 34.3 25.9 32.4 43.3 
       ...
     $ ped  : num  0.364 0.163 0.156 0.259 0.133 ...
     $ age  : int  24 55 35 26 23 52 25 24 63 31 ...
     $ type : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 1 1 2 ...
    > data(Pima.te)
    > str(Pima.te)
    'data.frame':332 obs. of  8 variables:
     $ npreg: int  6 1 1 3 2 5 0 1 3 9 ...
     $ glu  : int  148 85 89 78 197 166 118 103 126 119 ...
     $ bp   : int  72 66 66 50 70 72 84 30 88 80 ...
     $ skin : int  35 29 23 32 45 19 47 38 41 35 ...
     $ bmi  : num  33.6 26.6 28.1 31 30.5 25.8 45.8 43.3 39.3 29 ...
     $ ped  : num  0.627 0.351 0.167 0.248 0.158 0.587 0.551 0.183 
       0.704 0.263 ...
     $ age  : int  50 31 21 26 53 51 31 33 27 29 ...
     $ type : Factor w/ 2 levels "No","Yes": 2 1 1 2 2 2 2 1 1 2 ...

Looking at the structures, we can be confident that we can combine the data frames into one. This is very easy to do using the rbind() function, which stands for row binding and appends the data. If you had the same observations in each frame and wanted to append the features, you would bind them by columns using the cbind() function. You will simply name your new data frame and use this syntax: new data = rbind(data frame1, data frame2). Our code thus becomes the following:

    > pima <- rbind(Pima.tr, Pima.te)

As always, double-check the structure. We can see that there are no issues:

    > str(pima)
    'data.frame':532 obs. of  8 variables:
     $ npreg: int  5 7 5 0 0 5 3 1 3 2 ...
     $ glu  : int  86 195 77 165 107 97 83 193 142 128 ...
     $ bp   : int  68 70 82 76 60 76 58 50 80 78 ...
     $ skin : int  28 33 41 43 25 27 31 16 15 37 ...
     $ bmi  : num  30.2 25.1 35.8 47.9 26.4 35.6 34.3 25.9 32.4 43.3 
      ...
     $ ped  : num  0.364 0.163 0.156 0.259 0.133 ...
     $ age  : int  24 55 35 26 23 52 25 24 63 31 ...
     $ type : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 1 1 2 ...

Let's do some exploratory analysis by putting this in boxplots. For this, we want to use the outcome variable, "type", as our ID variable. As we did with logistic regression, the melt() function will do this and prepare a data frame that we can use for the boxplots. We will call the new data frame pima.melt, as follows:

    > pima.melt <- melt(pima, id.var = "type")

The boxplot layout using the ggplot2 package is quite effective, so we will use it. In the ggplot() function, we will specify the data to use, the x and y variables, and what type of plot, and create a series of plots with two columns. In the following code, we will put the response variable as x and its value as y in aes(). Then, geom_boxplot() creates the boxplots. Finally, we will build the boxplots in two columns with facet_wrap():

    > ggplot(data = pima.melt, aes(x = type, y = value)) + 
        geom_boxplot() + facet_wrap(~ variable, ncol = 2)

The following is the output of the preceding command:

This is an interesting plot because it is difficult to discern any dramatic differences in the plots, probably with the exception of glucose (glu). As you may have suspected, the fasting glucose appears to be significantly higher in the patients currently diagnosed with diabetes. The main problem here is that the plots are all on the same y-axis scale. We can fix this and produce a more meaningful plot by standardizing the values and then re-plotting. R has a built-in function, scale(), which will convert the values to a mean of zero and a standard deviation of one. Let's put this in a new data frame called pima.scale, converting all of the features and leaving out the type response. Additionally, while doing KNN, it is important to have the features on the same scale with a mean of zero and a standard deviation of one. If not, then the distance calculations in the nearest neighbor calculation are flawed. If something is measured on a scale of 1 to 100, it will have a larger effect than another feature that is measured on a scale of 1 to 10. Note that when you scale a data frame, it automatically becomes a matrix. Using the data.frame() function, convert it back to a data frame, as follows:

    > pima.scale <- data.frame(scale(pima[, -8]))
    > str(pima.scale)
    'data.frame':532 obs. of  7 variables:
     $ npreg: num  0.448 1.052 0.448 -1.062 -1.062 ...
     $ glu  : num  -1.13 2.386 -1.42 1.418 -0.453 ...
     $ bp   : num  -0.285 -0.122 0.852 0.365 -0.935 ...
     $ skin : num  -0.112 0.363 1.123 1.313 -0.397 ...
     $ bmi  : num  -0.391 -1.132 0.423 2.181 -0.943 ...
     $ ped  : num  -0.403 -0.987 -1.007 -0.708 -1.074 ...
     $ age  : num  -0.708 2.173 0.315 -0.522 -0.801 ...

Now, we will need to include the response in the data frame, as follows:

    > pima.scale$type <- pima$type

Let's just repeat the boxplotting process again with melt() and ggplot():

    > pima.scale.melt <- melt(pima.scale, id.var = "type")
    > ggplot(data = pima.scale.melt, aes(x = type, y = value)) +
         geom_boxplot() + facet_wrap(~ variable, ncol = 2)

The following is the output of the preceding command:

With the features scaled, the plot is easier to read. In addition to glucose, it appears that the other features may differ by type, in particular, age.

Before splitting this into train and test sets, let's have a look at the correlation with the R function, cor(). This will produce a matrix instead of a plot of the Pearson correlations:

    > cor(pima.scale[-8])
                npreg       glu          bp       skin
    npreg 1.000000000 0.1253296 0.204663421 0.09508511
    glu   0.125329647 1.0000000 0.219177950 0.22659042
    bp    0.204663421 0.2191779 1.000000000 0.22607244
    skin  0.095085114 0.2265904 0.226072440 1.00000000
    bmi   0.008576282 0.2470793 0.307356904 0.64742239
    ped   0.007435104 0.1658174 0.008047249 0.11863557
    age   0.640746866 0.2789071 0.346938723 0.16133614
                  bmi         ped        age
    npreg 0.008576282 0.007435104 0.64074687
    glu   0.247079294 0.165817411 0.27890711
    bp    0.307356904 0.008047249 0.34693872
    skin  0.647422386 0.118635569 0.16133614
    bmi   1.000000000 0.151107136 0.07343826
    ped   0.151107136 1.000000000 0.07165413
    age   0.073438257 0.071654133 1.00000000

There are a couple of correlations to point out: npreg/age and skin/bmi. Multicollinearity is generally not a problem with these methods, assuming that they are properly trained and the hyperparameters are tuned.

I think we are now ready to create the train and test sets, but before we do so, I recommend that you always check the ratio of Yes and No in our response. It is important to make sure that you will have a balanced split in the data, which may be a problem if one of the outcomes is sparse. This can cause a bias in a classifier between the majority and minority classes. There is no hard and fast rule on what is an improper balance. A good rule of thumb is that you strive for at least a 2:1 ratio in the possible outcomes (He and Wa, 2013):

    > table(pima.scale$type)
     No Yes
    355 177

The ratio is 2:1 so we can create the train and test sets with our usual syntax using a 70/30 split in the following way:

    > set.seed(502)
    > ind <- sample(2, nrow(pima.scale), replace = TRUE, prob = c(0.7, 
      0.3))
    > train <- pima.scale[ind == 1, ]
    > test <- pima.scale[ind == 2, ]
    > str(train)
    'data.frame':385 obs. of  8 variables:
     $ npreg: num  0.448 0.448 -0.156 -0.76 -0.156 ...
     $ glu  : num  -1.42 -0.775 -1.227 2.322 0.676 ...
     $ bp   : num  0.852 0.365 -1.097 -1.747 0.69 ...
     $ skin : num  1.123 -0.207 0.173 -1.253 -1.348 ...
     $ bmi  : num  0.4229 0.3938 0.2049 -1.0159 -0.0712 ...
     $ ped  : num  -1.007 -0.363 -0.485 0.441 -0.879 ...
     $ age  : num  0.315 1.894 -0.615 -0.708 2.916 ...
     $ type : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 2 1 1 1 ...
    > str(test)
    'data.frame':147 obs. of  8 variables:
     $ npreg: num  0.448 1.052 -1.062 -1.062 -0.458 ...
     $ glu  : num  -1.13 2.386 1.418 -0.453 0.225 ...
     $ bp   : num  -0.285 -0.122 0.365 -0.935 0.528 ...
     $ skin : num  -0.112 0.363 1.313 -0.397 0.743 ...
     $ bmi  : num  -0.391 -1.132 2.181 -0.943 1.513 ...
     $ ped  : num  -0.403 -0.987 -0.708 -1.074 2.093 ...
     $ age  : num  -0.7076 2.173 -0.5217 -0.8005 -0.0571 ...
     $ type : Factor w/ 2 levels "No","Yes": 1 2 1 1 2 1 2 1 1 1 ...

All seems to be in order, so we can move on to building our predictive models and evaluating them, starting with KNN.

Table of Contents for Data understanding and preparation

Create new playlist

Sign In

Sign Up

Table of Contents for
Data understanding and preparation