Data understanding

The dataset for analysis here is DNA pulled from mlbench. You don't have to install the package as I've put it in a CSV file and placed it on GitHub: https://github.com/PacktPublishing/Advanced-Machine-Learning-with-R/blob/master/Data/dna.csv.

Install the packages as needed and load the data:

> library(magrittr)

> install.packages("earth")

> install.packages("glmnet")

> install.packages("mlr")

> install.packages("randomForest")

> install.packages("tidyverse")

dna <- read.csv("dna.csv")

The data consists of 3,181 observations, 180 input features coded as binary indicators, and the Class response. The response is a factor with three labels indicating a DNA type either ei, ie, or neither—coded as n. The following is a table of the target labels:

> table(dna$Class)

  ei  ie    n 
 767 765 1654

This data should be ready for analysis, but let's run some quick checks to verify, starting with missing values:

> na_count <-
    sapply(dna, function(y)
    sum(length(which(is.na(
    y
 )))))

> table(na_count)
na_count
  0 
181

With no missing values, we check for zero variance features:

> feature_variance <- caret::nearZeroVar(dna[, -181], saveMetrics = TRUE)

> table(feature_variance$zeroVar)

FALSE 
  180

One of the things the authors of mlbench did with this data is transform the nucleotide factor features (A, C, G, T) into indicator features. They also de-identified the features naming them V1 through V180.

As such, let's check feature correlation:

> high_corr <- caret::findCorrelation(dna[, -181], cutoff = 0.9)

> length(high_corr)
[1] 173

It's a highly correlated dataset. We could run our feature selection methods as we've done in previous chapters, but let's press on with all features and see what happens.

Before doing so, let's get the train and test sets created:

> set.seed(555)

> index <- caret::createDataPartition(y = dna$Class, p = 0.8, list = FALSE)

> train <- dna[index, ]

> test <- dna[-index, ]

This created an 80/20 split for us and we can move on to building an algorithm.

Table of Contents for Data understanding

Create new playlist

Sign In

Sign Up

Table of Contents for
Data understanding