Data understanding and preparation

The dataset for the 532 women is in two separate data frames. The variables of interest are as follows:

  • npreg: This is the number of pregnancies
  • glu: This is the plasma glucose concentration in an oral glucose tolerance test
  • bp: This is the diastolic blood pressure (mm Hg)
  • skin: This is triceps skin-fold thickness measured in mm
  • bmi: This is the body mass index
  • ped: This is the diabetes pedigree function
  • age: This is the age in years
  • type: This is diabetic, Yes or No

The datasets are contained in the R package, MASS. One data frame is named Pima.tr and the other is named Pima.te. Instead of using these as separate train and test sets, we will combine them and create our own in order to discover how to do such a task in R.

To begin, let's load the following packages that we will need for the exercise:

    > library(class) #k-nearest neighbors
> library(kknn) #weighted k-nearest neighbors
> library(e1071) #SVM
> library(caret) #select tuning parameters
> library(MASS) # contains the data
> library(reshape2) #assist in creating boxplots
> library(ggplot2) #create boxplots
> library(kernlab) #assist with SVM feature selection

We will now load the datasets and check their structure, ensuring that they are the same, starting with Pima.tr, as follows:

    > data(Pima.tr)
> str(Pima.tr)
'data.frame':200 obs. of 8 variables:
$ npreg: int 5 7 5 0 0 5 3 1 3 2 ...
$ glu : int 86 195 77 165 107 97 83 193 142 128 ...
$ bp : int 68 70 82 76 60 76 58 50 80 78 ...
$ skin : int 28 33 41 43 25 27 31 16 15 37 ...
$ bmi : num 30.2 25.1 35.8 47.9 26.4 35.6 34.3 25.9 32.4 43.3
...

$ ped : num 0.364 0.163 0.156 0.259 0.133 ...
$ age : int 24 55 35 26 23 52 25 24 63 31 ...
$ type : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 1 1 2 ...
> data(Pima.te)
> str(Pima.te)
'data.frame':332 obs. of 8 variables:
$ npreg: int 6 1 1 3 2 5 0 1 3 9 ...
$ glu : int 148 85 89 78 197 166 118 103 126 119 ...
$ bp : int 72 66 66 50 70 72 84 30 88 80 ...
$ skin : int 35 29 23 32 45 19 47 38 41 35 ...
$ bmi : num 33.6 26.6 28.1 31 30.5 25.8 45.8 43.3 39.3 29 ...
$ ped : num 0.627 0.351 0.167 0.248 0.158 0.587 0.551 0.183
0.704 0.263 ...

$ age : int 50 31 21 26 53 51 31 33 27 29 ...
$ type : Factor w/ 2 levels "No","Yes": 2 1 1 2 2 2 2 1 1 2 ...

Looking at the structures, we can be confident that we can combine the data frames into one. This is very easy to do using the rbind() function, which stands for row binding and appends the data. If you had the same observations in each frame and wanted to append the features, you would bind them by columns using the cbind() function. You will simply name your new data frame and use this syntax: new data = rbind(data frame1, data frame2). Our code thus becomes the following:

    > pima <- rbind(Pima.tr, Pima.te)

As always, double-check the structure. We can see that there are no issues:

    > str(pima)
'data.frame':532 obs. of 8 variables:
$ npreg: int 5 7 5 0 0 5 3 1 3 2 ...
$ glu : int 86 195 77 165 107 97 83 193 142 128 ...
$ bp : int 68 70 82 76 60 76 58 50 80 78 ...
$ skin : int 28 33 41 43 25 27 31 16 15 37 ...
$ bmi : num 30.2 25.1 35.8 47.9 26.4 35.6 34.3 25.9 32.4 43.3
...

$ ped : num 0.364 0.163 0.156 0.259 0.133 ...
$ age : int 24 55 35 26 23 52 25 24 63 31 ...
$ type : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 1 1 2 ...

Let's do some exploratory analysis by putting this in boxplots. For this, we want to use the outcome variable, "type", as our ID variable. As we did with logistic regression, the melt() function will do this and prepare a data frame that we can use for the boxplots. We will call the new data frame pima.melt, as follows:

    > pima.melt <- melt(pima, id.var = "type")

The boxplot layout using the ggplot2 package is quite effective, so we will use it. In the ggplot() function, we will specify the data to use, the x and y variables, and what type of plot, and create a series of plots with two columns. In the following code, we will put the response variable as x and its value as y in aes(). Then, geom_boxplot() creates the boxplots. Finally, we will build the boxplots in two columns with facet_wrap():

    > ggplot(data = pima.melt, aes(x = type, y = value)) + 
geom_boxplot() + facet_wrap(~ variable, ncol = 2)

The following is the output of the preceding command:

This is an interesting plot because it is difficult to discern any dramatic differences in the plots, probably with the exception of glucose (glu). As you may have suspected, the fasting glucose appears to be significantly higher in the patients currently diagnosed with diabetes. The main problem here is that the plots are all on the same y-axis scale. We can fix this and produce a more meaningful plot by standardizing the values and then re-plotting. R has a built-in function, scale(), which will convert the values to a mean of zero and a standard deviation of one. Let's put this in a new data frame called pima.scale, converting all of the features and leaving out the type response. Additionally, while doing KNN, it is important to have the features on the same scale with a mean of zero and a standard deviation of one. If not, then the distance calculations in the nearest neighbor calculation are flawed. If something is measured on a scale of 1 to 100, it will have a larger effect than another feature that is measured on a scale of 1 to 10. Note that when you scale a data frame, it automatically becomes a matrix. Using the data.frame() function, convert it back to a data frame, as follows:

    > pima.scale <- data.frame(scale(pima[, -8]))
> str(pima.scale)
'data.frame':532 obs. of 7 variables:
$ npreg: num 0.448 1.052 0.448 -1.062 -1.062 ...
$ glu : num -1.13 2.386 -1.42 1.418 -0.453 ...
$ bp : num -0.285 -0.122 0.852 0.365 -0.935 ...
$ skin : num -0.112 0.363 1.123 1.313 -0.397 ...
$ bmi : num -0.391 -1.132 0.423 2.181 -0.943 ...
$ ped : num -0.403 -0.987 -1.007 -0.708 -1.074 ...
$ age : num -0.708 2.173 0.315 -0.522 -0.801 ...

Now, we will need to include the response in the data frame, as follows:

    > pima.scale$type <- pima$type

Let's just repeat the boxplotting process again with melt() and ggplot():

    > pima.scale.melt <- melt(pima.scale, id.var = "type")
> ggplot(data = pima.scale.melt, aes(x = type, y = value)) +
geom_boxplot() + facet_wrap(~ variable, ncol = 2)

The following is the output of the preceding command:

With the features scaled, the plot is easier to read. In addition to glucose, it appears that the other features may differ by type, in particular, age.

Before splitting this into train and test sets, let's have a look at the correlation with the R function, cor(). This will produce a matrix instead of a plot of the Pearson correlations:

    > cor(pima.scale[-8])
npreg glu bp skin
npreg 1.000000000 0.1253296 0.204663421 0.09508511
glu 0.125329647 1.0000000 0.219177950 0.22659042
bp 0.204663421 0.2191779 1.000000000 0.22607244
skin 0.095085114 0.2265904 0.226072440 1.00000000
bmi 0.008576282 0.2470793 0.307356904 0.64742239
ped 0.007435104 0.1658174 0.008047249 0.11863557
age 0.640746866 0.2789071 0.346938723 0.16133614
bmi ped age
npreg 0.008576282 0.007435104 0.64074687
glu 0.247079294 0.165817411 0.27890711
bp 0.307356904 0.008047249 0.34693872
skin 0.647422386 0.118635569 0.16133614
bmi 1.000000000 0.151107136 0.07343826
ped 0.151107136 1.000000000 0.07165413
age 0.073438257 0.071654133 1.00000000

There are a couple of correlations to point out: npreg/age and skin/bmi. Multicollinearity is generally not a problem with these methods, assuming that they are properly trained and the hyperparameters are tuned.

I think we are now ready to create the train and test sets, but before we do so, I recommend that you always check the ratio of Yes and No in our response. It is important to make sure that you will have a balanced split in the data, which may be a problem if one of the outcomes is sparse. This can cause a bias in a classifier between the majority and minority classes. There is no hard and fast rule on what is an improper balance. A good rule of thumb is that you strive for at least a 2:1 ratio in the possible outcomes (He and Wa, 2013):

    > table(pima.scale$type)
No Yes
355 177

The ratio is 2:1 so we can create the train and test sets with our usual syntax using a 70/30 split in the following way:

    > set.seed(502)
> ind <- sample(2, nrow(pima.scale), replace = TRUE, prob = c(0.7,
0.3))

> train <- pima.scale[ind == 1, ]
> test <- pima.scale[ind == 2, ]
> str(train)
'data.frame':385 obs. of 8 variables:
$ npreg: num 0.448 0.448 -0.156 -0.76 -0.156 ...
$ glu : num -1.42 -0.775 -1.227 2.322 0.676 ...
$ bp : num 0.852 0.365 -1.097 -1.747 0.69 ...
$ skin : num 1.123 -0.207 0.173 -1.253 -1.348 ...
$ bmi : num 0.4229 0.3938 0.2049 -1.0159 -0.0712 ...
$ ped : num -1.007 -0.363 -0.485 0.441 -0.879 ...
$ age : num 0.315 1.894 -0.615 -0.708 2.916 ...
$ type : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 2 1 1 1 ...
> str(test)
'data.frame':147 obs. of 8 variables:
$ npreg: num 0.448 1.052 -1.062 -1.062 -0.458 ...
$ glu : num -1.13 2.386 1.418 -0.453 0.225 ...
$ bp : num -0.285 -0.122 0.365 -0.935 0.528 ...
$ skin : num -0.112 0.363 1.313 -0.397 0.743 ...
$ bmi : num -0.391 -1.132 2.181 -0.943 1.513 ...
$ ped : num -0.403 -0.987 -0.708 -1.074 2.093 ...
$ age : num -0.7076 2.173 -0.5217 -0.8005 -0.0571 ...
$ type : Factor w/ 2 levels "No","Yes": 1 2 1 1 2 1 2 1 1 1 ...

All seems to be in order, so we can move on to building our predictive models and evaluating them, starting with KNN.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset