To begin with, load the necessary packages:
> library(magrittr)
> install.packages("caret")
> install.packages("DataExplorer")
> install.packages("earth")
> install.packages("ggthemes")
> install.packages("psych")
> install.packages("tidyverse")
> options(scipen = 999)
Now, read the data into your environment:
> army_ansur <- readRDS("army_ansur.RData")
The feature names are fairly straightforward. Here, I just put in the last few features as output:
> colnames(army_ansur)
[93] "wristcircumference" "wristheight"
[95] "Gender" "Date"
[97] "Installation" "Component"
[99] "Branch" "PrimaryMOS"
[101] "SubjectsBirthLocation" "SubjectNumericRace"
[103] "Ethnicity" "DODRace"
[105] "Age" "Heightin"
[107] "Weightlbs" "WritingPreference"
[109] "SubjectId"
I'm interested in looking at the breakdown of the "Component" and "Gender" columns:
> table(army_ansur$Component)
Army National Guard Army Reserve Regular Army
2708 220 3140
> table(army_ansur$Gender)
Female Male
1986 4082
If we look at missing values, we can see something of interest. Here is the abbreviated output:
> sapply(army_ansur, function(x) sum(is.na(x)))
PrimaryMOS SubjectsBirthLocation SubjectNumericRace
0 0 0
Ethnicity DODRace Age
4647 0 0
Heightin Weightlbs WritingPreference
0 0 0
SubjectId
4082
We have a bunch of missing subject IDs. Fine, let's take care of that right now:
> army_ansur$subjectid <- seq(1:6068)
Since weight is what we will predict after we build our unsupervised model, let's have a look at it:
> sjmisc::descr(army_ansur$Weightlbs)
## Basic descriptive statistics
var type label n NA.prc mean sd se md trimmed range skew
dd integer dd 6068 0 174.8 33.69 0.43 173 173.4 321 (0-321) 0.39
Look at the range! We have someone who weighs zero. A plot of this data is in order, I believe:
> ggplot2::ggplot(army_ansur, ggplot2::aes(x = Weightlbs)) +
ggplot2::geom_density() +
ggthemes::theme_wsj()
The output of the preceding code is as follows:
So, I would estimate we only have one or two observations of implausible weight values. Indeed, this code will confirm that assumption:
> dplyr::select(army_ansur, Weightlbs) %>%
dplyr::arrange(Weightlbs)
# A tibble: 6,068 x 1
Weightlbs
<int>
1 0
2 86
3 88
4 90
5 95
6 95
7 95
8 96
9 98
10 100
# ... with 6,058 more rows
Removing that observation is important:
> armyClean <- dplyr::filter(army_ansur, Weightlbs > 0)
We can now transition to bundling our features for PCA and creating training and testing dataframes.