Understanding the attrition problem and the dataset 

HR analytics helps with interpreting organizational data. It finds out the people-related trends in the data and helps the HR department take the appropriate steps to keep the organization running smoothly and profitably. Attrition in a corporate setup is one of the complex challenges that the people managers and HR personnel have to deal with. Interestingly, machine learning models can be deployed to predict potential attrition cases, thereby helping the appropriate HR personnel or people managers take the necessary steps to retain the employee.

In this chapter, we are going to build ML ensembles that will predict such potential cases of attrition. The job attrition dataset used for the project is a fictional dataset created by data scientists at IBM. The rsample library incorporates this dataset and we can make use of this dataset directly from the library.

It is a small dataset that has 1,470 records of 31 attributes. The description of the dataset can be obtained with the following code:

setwd("~/Desktop/chapter 15") 
library(rsample)
data(attrition)
str(attrition)
mydata<-attrition

This will result in the following output:

'data.frame':1470 obs. of  31 variables: 
$ Age : int 41 49 37 33 27 32 59 30 38 36 ...
$ Attrition : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ....
$ BusinessTravel : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ...
$ DailyRate : int 1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
$ Department : Factor w/ 3 levels "Human_Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
$ DistanceFromHome : int 1 8 2 3 2 2 3 24 23 27 ...
$ Education : Ord.factor w/ 5 levels "Below_College"<..: 2 1 2 4 1 2 3 1 3 3 ...
$ EducationField : Factor w/ 6 levels "Human_Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
$ EnvironmentSatisfaction : Ord.factor w/ 4 levels "Low"<"Medium"<..: 2 3 4 4 1 4 3 4 4 3 ...
$ Gender : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
$ HourlyRate : int 94 61 92 56 40 79 81 67 44 94 ...
$ JobInvolvement : Ord.factor w/ 4 levels "Low"<"Medium"<..: 3 2 2 3 3 3 4 3 2 3 ...
$ JobLevel : int 2 2 1 1 1 1 1 1 3 2 ...
$ JobRole : Factor w/ 9 levels "Healthcare_Representative",..: 8 7 3 7 3 3 3 3 5 1 ...
$ JobSatisfaction : Ord.factor w/ 4 levels "Low"<"Medium"<..: 4 2 3 3 2 4 1 3 3 3 ...
$ MaritalStatus : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...
$ MonthlyIncome : int 5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
$ MonthlyRate : int 19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
$ NumCompaniesWorked : int 8 1 6 1 9 0 4 1 0 6 ...
$ OverTime : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 1 1 ...
$ PercentSalaryHike : int 11 23 15 11 12 13 20 22 21 13 ...
$ PerformanceRating : Ord.factor w/ 4 levels "Low"<"Good"<"Excellent"<..: 3 4 3 3 3 3 4 4 4 3 ...
$ RelationshipSatisfaction: Ord.factor w/ 4 levels "Low"<"Medium"<..: 1 4 2 3 4 3 1 2 2 2 ...
$ StockOptionLevel : int 0 1 0 0 1 0 3 1 0 2 ...
$ TotalWorkingYears : int 8 10 7 8 6 8 12 1 10 17 ...
$ TrainingTimesLastYear : int 0 3 3 3 3 2 3 2 2 3 ...
$ WorkLifeBalance : Ord.factor w/ 4 levels "Bad"<"Good"<"Better"<..: 1 3 3 3 3 2 2 3 3 2 ...
$ YearsAtCompany : int 6 10 0 8 2 7 1 1 9 7 ...
$ YearsInCurrentRole : int 4 7 0 7 2 7 0 0 7 7 ...
$ YearsSinceLastPromotion : int 0 1 0 3 2 3 0 0 1 7 ...
$ YearsWithCurrManager : int 5 7 0 0 2 6 0 0 8 7 ...

To view the Attrition target variable in the dataset run the following code:

table(mydata$Attrition) 

This will result in the following output:

 No   Yes  
1233 237

Out of the 1,470 observations in the dataset, we have 1,233 samples (83.87%) that are non-attrition cases and 237 attrition cases (16.12%). Clearly, we are dealing with a class imbalance dataset.

We will now visualize the highly correlated variables in the data through the corrplot library using the following code:

# considering only the numeric variables in the dataset 
numeric_mydata <- mydata[,c(1,4,6,7,10,11,13,14,15,17,19,20,21,24,25,26,28:35)]
# converting the target variable "yes" or "no" values into numeric
# it defaults to 1 and 2 however converting it into 0 and 1 to be consistent
numeric_Attrition = as.numeric(mydata$Attrition)- 1
# create a new data frame with numeric columns and numeric target
numeric_mydata = cbind(numeric_mydata, numeric_Attrition)
# loading the required library
library(corrplot)
# creating correlation plot
M <- cor(numeric_mydata)
corrplot(M, method="circle")

This will result in the following output:

In the preceding screenshot, it may be observed that darker and larger blues dot in the cells indicate the existence of a strong correlation between the variables in the corresponding rows and columns that form the cell. High correlation between the independent variables indicates the existence of redundant features in the data. The problem of the existence of highly correlated features in the data is termed as multicollinearity. If we were to fit a regression model, then it is required that we treat the highly correlated variables from the data through some techniques such as removing the redundant features or by applying principal component analysis or partial least squares regression, which intuitively cuts down the redundant features.

We infer from the output that the following variables are highly correlated and the person building the model needs to take care of these variables if we are to build a regression-based model:

JobLevel-MonthlyIncomeJobLevel-TotalWorkingYearsMonthlyIncome-TotalWorkingYearsPercentSalaryHike-PerformanceRatingYearsAtCompany-YearsInCurrentRoleYearsAtCompany-YearsWithCurrManagerYearsWithCurrManager-YearsInCurrentRole

Now, plot the various independent variables with the dependent Attrition variable in order to understand the influence of the independent variable on the target: 

### Overtime vs Attiriton 
l <- ggplot(mydata, aes(OverTime,fill = Attrition))
l <- l + geom_histogram(stat="count")

tapply(as.numeric(mydata$Attrition) - 1 ,mydata$OverTime,mean)

No Yes
0.104364326375712 0.305288461538462

Let's run the following command to get a graph view:

print(l) 

The preceding command generates the following output:

In the preceding output, it can be observed that employees that work overtime are more prone to attrition when compared to the ones that do not work overtime:

Let's calculate the attrition of the employees by executing the following commands:

### MaritalStatus vs Attiriton 
l <- ggplot(mydata, aes(MaritalStatus,fill = Attrition))
l <- l + geom_histogram(stat="count")

tapply(as.numeric(mydata$Attrition) - 1 ,mydata$MaritalStatus,mean)
Divorced 0.100917431192661
Married 0.12481426448737
Single 0.25531914893617

Let's run the following command to get a graph view:

print(l) 

The preceding command generates the following output:

In the preceding output, it can be observed that employees that are single have more attrition:

###JobRole vs Attrition 
l <- ggplot(mydata, aes(JobRole,fill = Attrition))
l <- l + geom_histogram(stat="count")

tapply(as.numeric(mydata$Attrition) - 1 ,mydata$JobRole,mean)

Healthcare Representative Human Resources
0.06870229 0.23076923
Laboratory Technician Manager
0.23938224 0.04901961
Manufacturing Director Research Director
0.06896552 0.02500000
Research Scientist Sales Executive
0.16095890 0.17484663
Sales Representative
0.39759036
mean(as.numeric(mydata$Attrition) - 1)
[1] 0.161224489795918

Execute the following command to get a graphical representation for the same:

print(l)

Take a look at the following output generated by running the preceding command:

In the preceding output, it can be observed that the lab technicians, sales representatives, and employees working in human resources job roles have more attrition than other organizational roles.

Let's execute the following commands to check with the impact of the gender of an employee over attribution:

###Gender vs Attrition 
l <- ggplot(mydata, aes(Gender,fill = Attrition))
l <- l + geom_histogram(stat="count")

tapply(as.numeric(mydata$Attrition) - 1 ,mydata$Gender,mean)

Female 0.147959183673469
Male 0.170068027210884

Run the following command to get a graphical representation for the same:

print(l)

This will result in the following output:

In the preceding output, you can see that the gender of an employee does not have any impact on attrition, in other words attrition is observed to be the same across all genders.

Let's calculate the attribute of the employees from various fields by executing the following:

###EducationField vs Attrition el <- ggplot(mydata, aes(EducationField,fill = Attrition)) 
l <- l + geom_histogram(stat="count")

tapply(as.numeric(mydata$Attrition) - 1 ,mydata$EducationField,mean)

Human Resources Life Sciences Marketing
0.2592593 0.1468647 0.2201258
Medical Other Technical Degree
0.1357759 0.1341463 0.2424242

Let's execute the following command to get a graphical representation:

print(l)

This will result in the following output:

Looking at the preceding graph, we can conclude that employees with a technical degree or a degree in human resources are observed to have more attrition. Take a look at the following code:

###Department vs Attrition 
l <- ggplot(mydata, aes(Department,fill = Attrition))
l <- l + geom_histogram(stat="count")

tapply(as.numeric(mydata$Attrition) - 1 ,mydata$Department,mean)
Human Resources Research & Development Sales
0.1904762 0.1383975 0.2062780

Let's execute the following command to check with the attribution of various departments:

print(l) 

This will result in the following output:

Looking at the preceding graph, we can conclude that the R and D department has less attrition compared to the sales and HR departments. Take a look at the following code:

###BusinessTravel vs Attrition 
l <- ggplot(mydata, aes(BusinessTravel,fill = Attrition))
l <- l + geom_histogram(stat="count")

tapply(as.numeric(mydata$Attrition) - 1 ,mydata$BusinessTravel,mean)
Non-Travel Travel_Frequently Travel_Rarely
0.0800000 0.2490975 0.1495686

Execute the following command to get a graphical representation for the same:

print(l) 

This will result in the following output:

Looking at the preceding graph, we can conclude that employees with frequent travels are prone to more attrition compared to employees with a non-travel status or the ones that rarely travel. 

Let's calculate the overtime of the employees by executing the following commands:

### x=Overtime, y= Age, z = MaritalStatus , t = Attrition 
ggplot(mydata, aes(OverTime, Age)) +
facet_grid(.~MaritalStatus) +
geom_jitter(aes(color = Attrition),alpha = 0.4) +
ggtitle("x=Overtime, y= Age, z = MaritalStatus , t = Attrition") +
theme_light()

This will result in the following output:

Looking at the preceding graph, we can conclude that it can be observed that employees that are young (age < 35 ) and are single, but work overtime, are more prone to attrition: 

### MonthlyIncome vs. Age, by  color = Attrition 
ggplot(mydata, aes(MonthlyIncome, Age, color = Attrition)) +
geom_jitter() +
ggtitle("MonthlyIncome vs. Age, by color = Attrition ") +
theme_light()

This will result in the following output:

Looking at the preceding graph, we can conclude that attrition is higher in employees that are young (age < 30) and most attrition is observed with employees that earn less than $7,500.

Although we have learned a number of important details about the data at hand, there is actually so much more to explore and learn. However, so as to move to the next step, we stop here at this EDA step. It should be noted that, in a real-world situation, data would not be so very clean as we see in this attrition dataset. For example, we would have missing values in the data; in which case, we would do missing values imputation. Fortunately, we have an impeccable dataset that is ready for us to create models without having to do any data cleansing or additional preprocessing. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset