Data loading and review

To begin with, load the necessary packages:

> library(magrittr)

> install.packages("caret")

> install.packages("DataExplorer")

> install.packages("earth")

> install.packages("ggthemes")

> install.packages("psych")

> install.packages("tidyverse")

> options(scipen = 999)

Now, read the data into your environment:

> army_ansur <- readRDS("army_ansur.RData")

The feature names are fairly straightforward. Here, I just put in the last few features as output:

> colnames(army_ansur)
 [93] "wristcircumference"     "wristheight" 
 [95] "Gender"                 "Date" 
 [97] "Installation"           "Component" 
 [99] "Branch"                 "PrimaryMOS" 
[101] "SubjectsBirthLocation"  "SubjectNumericRace" 
[103] "Ethnicity"              "DODRace" 
[105] "Age"                    "Heightin" 
[107] "Weightlbs"              "WritingPreference" 
[109] "SubjectId"

I'm interested in looking at the breakdown of the "Component" and "Gender" columns:

> table(army_ansur$Component)

Army National Guard   Army Reserve   Regular Army 
               2708            220           3140

> table(army_ansur$Gender)

Female   Male 
  1986   4082

If we look at missing values, we can see something of interest. Here is the abbreviated output:

> sapply(army_ansur, function(x) sum(is.na(x)))
        PrimaryMOS   SubjectsBirthLocation   SubjectNumericRace 
                 0                       0                    0 
         Ethnicity                 DODRace                  Age 
              4647                       0                    0 
          Heightin               Weightlbs    WritingPreference 
                 0                       0                    0 
         SubjectId 
              4082

We have a bunch of missing subject IDs. Fine, let's take care of that right now:

> army_ansur$subjectid <- seq(1:6068)

Since weight is what we will predict after we build our unsupervised model, let's have a look at it:

> sjmisc::descr(army_ansur$Weightlbs)

## Basic descriptive statistics
 var    type label    n NA.prc  mean   sd    se  md trimmed       range skew
  dd integer    dd 6068      0 174.8 33.69 0.43 173   173.4 321 (0-321) 0.39

Look at the range! We have someone who weighs zero. A plot of this data is in order, I believe:

> ggplot2::ggplot(army_ansur, ggplot2::aes(x = Weightlbs)) + 
    ggplot2::geom_density() +
    ggthemes::theme_wsj()

The output of the preceding code is as follows:

So, I would estimate we only have one or two observations of implausible weight values. Indeed, this code will confirm that assumption:

> dplyr::select(army_ansur, Weightlbs) %>%
    dplyr::arrange(Weightlbs)
# A tibble: 6,068 x 1
   Weightlbs
       <int>
 1         0
 2        86
 3        88
 4        90
 5        95
 6        95
 7        95
 8        96
 9        98
10       100
# ... with 6,058 more rows

Removing that observation is important:

> armyClean <- dplyr::filter(army_ansur, Weightlbs > 0)

We can now transition to bundling our features for PCA and creating training and testing dataframes.

Table of Contents for Data loading and review

Create new playlist

Sign In

Sign Up

Table of Contents for
Data loading and review