To begin, load the essential libraries and register the number of cores for parallel processing:
library(doMC)
registerDoMC(cores = 4)
library(caret)
#setting the random seed for replication
set.seed(1234)
# setting the working directory where the data is located
setwd("~/Desktop/chapter 15")
# reading the data
mydata <- read.csv("WA_Fn-UseC_-HR-Employee-Attrition.csv")
#removing the non-discriminatory features identified during EDA
mydata$EmployeeNumber=mydata$Over18=mydata$EmployeeCount=mydata$StandardHours = NULL
#setting up cross-validation
cvcontrol <- trainControl(method="repeatedcv", repeats=10, number = 10, allowParallel=TRUE)
# model creation with treebag , observe that the number of bags is set as 10
train.bagg <- train(Attrition ~ ., data=mydata, method="treebag",B=10, trControl=cvcontrol, importance=TRUE)
train.bagg
This will result in the following output:
Bagged CART
1470 samples
30 predictors
2 classes: 'No', 'Yes'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 10 times)
Summary of sample sizes: 1324, 1323, 1323, 1322, 1323, 1322, ...
Resampling results:
Accuracy Kappa
0.854478 0.2971994
We can see that we achieved a better accuracy of 85.4% compared to 84% accuracy that was obtained with the KNN algorithm.