Training and testing datasets

Here, we are going to put the numeric features into a dataframe along with the quantitative response. Then, we'll carve this up into train and test sets with an 80/20 split. As a closing effort, we'll scale the data, which is required for PCA.

Here, I grab those input features, including height in inches, while dropping weight in kilograms. I also include the subjectid:

> army_subset <- armyClean[, c(1:91, 93, 94, 106, 107)]

We've used the dplyr and caret packages to create train and test sets, and here I demonstrate the dplyr method:

> set.seed(1812)

> army_subset %>%
dplyr::sample_frac(.8) -> train

> army_subset %>%
dplyr::anti_join(train, by = "subjectid") -> test

I mentioned previously that this data had a number of high correlations. Even if you take just the first five features, that becomes clear: 

> DataExplorer::plot_correlation(train[, 2:6])

The output of the preceding code is as follows:

Axilla height and acromial height are 99 percent correlated. These refer to the armpit and point of the shoulder respectively. 

We need to preserve the y-values for the training data. Additionally, we have to scale the data, that is, just the input features, so drop the subjectid and y-values:

> trainY <- train$Weightlbs

> train_scale <- data.frame(scale(train[, c(-1, -95)]))

With that complete, we can move on to creating principal components and using them in a supervised learning example.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset