HR attrition data example

In this section, we will be using IBM Watson's HR Attrition data (the data has been utilized in the book after taking prior permission from the data administrator) shared in Kaggle datasets under open source license agreement https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset to predict whether employees would attrite or not based on independent explanatory variables:

>>> import pandas as pd 
>>> hrattr_data = pd.read_csv("WA_Fn-UseC_-HR-Employee-Attrition.csv") 
 
>>> print (hrattr_data.head()) 

There are about 1470 observations and 35 variables in this data, the top five rows are shown here for a quick glance of the variables:

The following code is used to convert Yes or No categories into 1 and 0 for modeling purposes, as scikit-learn does not fit the model on character/categorical variables directly, hence dummy coding is required to be performed for utilizing the variables in models:

>>> hrattr_data['Attrition_ind'] = 0 
>>> hrattr_data.loc[hrattr_data['Attrition'] =='Yes', 'Attrition_ind'] = 1 

Dummy variables are created for all seven categorical variables (shown here in alphabetical order), which are Business Travel, Department, Education Field, Gender, Job Role, Marital Status, and Overtime. We have ignored four variables from the analysis, as they do not change across the observations, which are Employee count, Employee number, Over18, and Standard Hours:

>>> dummy_busnstrvl = pd.get_dummies(hrattr_data['BusinessTravel'], prefix='busns_trvl') 
>>> dummy_dept = pd.get_dummies(hrattr_data['Department'], prefix='dept') 
>>> dummy_edufield = pd.get_dummies(hrattr_data['EducationField'], prefix='edufield') 
>>> dummy_gender = pd.get_dummies(hrattr_data['Gender'], prefix='gend') 
>>> dummy_jobrole = pd.get_dummies(hrattr_data['JobRole'], prefix='jobrole') 
>>> dummy_maritstat = pd.get_dummies(hrattr_data['MaritalStatus'], prefix='maritalstat')  
>>> dummy_overtime = pd.get_dummies(hrattr_data['OverTime'], prefix='overtime')  

Continuous variables are separated and will be combined with the created dummy variables later:

>>> continuous_columns = ['Age','DailyRate','DistanceFromHome', 'Education', 'EnvironmentSatisfaction','HourlyRate','JobInvolvement','JobLevel','JobSatisfaction', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked','PercentSalaryHike',  'PerformanceRating', 'RelationshipSatisfaction','StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear','WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion','YearsWithCurrManager'] 
 
>>> hrattr_continuous = hrattr_data[continuous_columns] 

In the following step, both derived dummy variables from categorical variables and straight continuous variables are combined:

>>> hrattr_data_new = pd.concat([dummy_busnstrvl, dummy_dept, dummy_edufield, dummy_gender, dummy_jobrole, dummy_maritstat, dummy_overtime, hrattr_continuous, hrattr_data['Attrition_ind']],axis=1) 
Here, we have not removed one extra derived dummy variable for each categorical variable due to the reason that multi-collinearity does not create a problem in decision trees as it does in either logistic or linear regression, hence we can simply utilize all the derived variables in the rest of the chapter, as all the models utilize decision trees as an underlying model, even after performing ensembles of it.

Once basic data has been prepared, it needs to be split by 70-30 for training and testing purposes:

# Train and Test split 
>>> from sklearn.model_selection import train_test_split 
>>> x_train,x_test,y_train,y_test = train_test_split( hrattr_data_new.drop (['Attrition_ind'], axis=1),hrattr_data_new['Attrition_ind'],   train_size = 0.7, random_state=42) 

R Code for Data Preprocessing on HR Attrition Data:

hrattr_data = read.csv("WA_Fn-UseC_-HR-Employee-Attrition.csv")   
str(hrattr_data);summary(hrattr_data)   
hrattr_data$Attrition_ind = 0;   
hrattr_data$Attrition_ind[   hrattr_data$Attrition=="Yes"]=1   
hrattr_data$Attrition_ind=   as.factor(hrattr_data$Attrition_ind)   
   
remove_cols = c("EmployeeCount","EmployeeNumber","Over18",   "StandardHours","Attrition")   
hrattr_data_new =   hrattr_data[,!(names(hrattr_data) %in% remove_cols)]   
   
set.seed(123)   
numrow = nrow(hrattr_data_new)   
trnind = sample(1:numrow,size =   as.integer(0.7*numrow))   
train_data =   hrattr_data_new[trnind,]   
test_data = hrattr_data_new[-trnind,]   
   
# Code for calculating   precision, recall for 0 and 1 categories and # at overall level which   will be used in all the classifiers in # later sections   
frac_trzero =   (table(train_data$Attrition_ind)[[1]])/nrow(train_data)   
frac_trone =   (table(train_data$Attrition_ind)[[2]])/nrow(train_data)   
   
frac_tszero =   (table(test_data$Attrition_ind)[[1]])/nrow(test_data)   
frac_tsone = (table(test_data$Attrition_ind)[[2]])/nrow(test_data)   
   
prec_zero <-   function(act,pred){  tble = table(act,pred)   
return( round(   tble[1,1]/(tble[1,1]+tble[2,1]),4))}   
   
prec_one <-   function(act,pred){ tble = table(act,pred)   
return( round(   tble[2,2]/(tble[2,2]+tble[1,2]),4))}   
   
recl_zero <-   function(act,pred){tble = table(act,pred)   
return( round(   tble[1,1]/(tble[1,1]+tble[1,2]),4))}   
   
recl_one <-   function(act,pred){ tble = table(act,pred)   
return( round(   tble[2,2]/(tble[2,2]+tble[2,1]),4))}   
   
accrcy <-   function(act,pred){ tble = table(act,pred)   
return(   round((tble[1,1]+tble[2,2])/sum(tble),4))}   
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset