The dataset for analysis here is DNA pulled from mlbench. You don't have to install the package as I've put it in a CSV file and placed it on GitHub: https://github.com/PacktPublishing/Advanced-Machine-Learning-with-R/blob/master/Data/dna.csv.
Install the packages as needed and load the data:
> library(magrittr)
> install.packages("earth")
> install.packages("glmnet")
> install.packages("mlr")
> install.packages("randomForest")
> install.packages("tidyverse")
dna <- read.csv("dna.csv")
The data consists of 3,181 observations, 180 input features coded as binary indicators, and the Class response. The response is a factor with three labels indicating a DNA type either ei, ie, or neither—coded as n. The following is a table of the target labels:
> table(dna$Class)
ei ie n
767 765 1654
This data should be ready for analysis, but let's run some quick checks to verify, starting with missing values:
> na_count <-
sapply(dna, function(y)
sum(length(which(is.na(
y
)))))
> table(na_count)
na_count
0
181
With no missing values, we check for zero variance features:
> feature_variance <- caret::nearZeroVar(dna[, -181], saveMetrics = TRUE)
> table(feature_variance$zeroVar)
FALSE
180
One of the things the authors of mlbench did with this data is transform the nucleotide factor features (A, C, G, T) into indicator features. They also de-identified the features naming them V1 through V180.
As such, let's check feature correlation:
> high_corr <- caret::findCorrelation(dna[, -181], cutoff = 0.9)
> length(high_corr)
[1] 173
It's a highly correlated dataset. We could run our feature selection methods as we've done in previous chapters, but let's press on with all features and see what happens.
Before doing so, let's get the train and test sets created:
> set.seed(555)
> index <- caret::createDataPartition(y = dna$Class, p = 0.8, list = FALSE)
> train <- dna[index, ]
> test <- dna[-index, ]
This created an 80/20 split for us and we can move on to building an algorithm.