Random forest classifier

Random forests provide an improvement over bagging by doing a small tweak that utilizes de-correlated trees. In bagging, we build a number of decision trees on bootstrapped samples from training data, but the one big drawback with the bagging technique is that it selects all the variables. By doing so, in each decision tree, the order of candidate/variable chosen to split remains more or less the same for all the individual trees, which look correlated with each other. Variance reduction on correlated individual entities does not work effectively while aggregating them.

In random forest, during bootstrapping (repeated sampling with replacement), samples were drawn from training data; not just simply the second and third observations randomly selected, similar to bagging, but it also selects the few predictors/columns out of all predictors (m predictors out of total p predictors).

The thumb rule for variable selection of m variables out of total variables p is m = sqrt(p) for classification and m = p/3 for regression problems randomly to avoid correlation among the individual trees. By doing so, significant improvement in the accuracy can be achieved. This ability of RF makes it one of the favorite algorithms used by the data science community, as a winning recipe across various competitions or even for solving practical problems in various industries.

In the following diagram, different colors represent different bootstrap samples. In the first sample, the 1st, 3rd, 4th, and 7th columns are selected, whereas, in the second bootstrap sample, the 2nd, 3rd, 4th, and 5th columns are selected respectively. In this way, any columns can be selected at random, whether they are adjacent to each other or not. Though the thumb rules of sqrt (p) or p/3 are given, readers are encouraged to tune the number of predictors to be selected:

The sample plot shows the impact of a test error change while changing the parameters selected, and it is apparent that a m = sqrt(p) scenario gives better performance on test data compared with m =p (we can call this scenario bagging):

Random forest classifier has been utilized from the scikit-learn package here for illustration purposes:

# Random Forest Classifier 
>>> from sklearn.ensemble import RandomForestClassifier 

The parameters used in random forest are: n_estimators representing the number of individual decision trees used is 5000, maximum features selected are auto, which means it will select sqrt(p) for classification and p/3 for regression automatically. Here is the straightforward classification problem though. Minimum samples per leaf provide the minimum number of observations required in the terminal node:

>>> rf_fit = RandomForestClassifier(n_estimators=5000,criterion="gini", max_depth=5, min_samples_split=2,bootstrap=True,max_features='auto',random_state=42, min_samples_leaf=1,class_weight = {0:0.3,1:0.7}) 
>>> rf_fit.fit(x_train,y_train)        
 
>>> print ("
Random Forest - Train Confusion Matrix

",pd.crosstab(y_train, rf_fit.predict(x_train),rownames = ["Actuall"],colnames = ["Predicted"]))       
>>> print ("
Random Forest - Train accuracy",round(accuracy_score(y_train, rf_fit.predict(x_train)),3)) 
>>> print ("
Random Forest  - Train Classification Report
",classification_report( y_train, rf_fit.predict(x_train))) 
 
>>> print ("

Random Forest - Test Confusion Matrix

",pd.crosstab(y_test, rf_fit.predict(x_test),rownames = ["Actuall"],colnames = ["Predicted"]))       
>>> print ("
Random Forest - Test accuracy",round(accuracy_score(y_test, rf_fit.predict(x_test)),3)) 
>>> print ("
Random Forest - Test Classification Report
",classification_report( y_test, rf_fit.predict(x_test))) 

Random forest classifier produced 87.8% test accuracy compared with bagging 87.3%, and also identifies 14 actually attrited employees in contrast with bagging, for which 13 attrited employees have been identified:

# Plot of Variable importance by mean decrease in gini 
>>> model_ranks = pd.Series(rf_fit.feature_importances_,index=x_train.columns, name='Importance').sort_values(ascending=False, inplace=False) 
>>> model_ranks.index.name = 'Variables' 
>>> top_features = model_ranks.iloc[:31].sort_values(ascending=True,inplace=False) 
>>> import matplotlib.pyplot as plt 
>>> plt.figure(figsize=(20,10)) 
>>> ax = top_features.plot(kind='barh') 
>>> _ = ax.set_title("Variable Importance Plot") 
>>> _ = ax.set_xlabel('Mean decrease in Variance') 
>>> _ = ax.set_yticklabels(top_features.index, fontsize=13) 

From the variable importance plot, it seems that the monthly income variable seems to be most significant, followed by overtime, total working years, stock option levels, years at company, and so on. This provides us with some insight into what are major contributing factors that determine whether the employee will remain with the company or leave the organization:

R Code for Random Forest Classifier Applied on HR Attrition Data:

# Random Forest   
library(randomForest)   
set.seed(43)   
rf_fit =   randomForest(Attrition_ind~.,data = train_data,mtry=6, maxnodes= 64,classwt =   c(0.3,0.7),ntree=5000,nodesize = 1)   
tr_y_pred = predict(rf_fit,data   = train_data,type = "response")   
ts_y_pred =   predict(rf_fit,newdata = test_data,type = "response")   
tr_y_act =   train_data$Attrition_ind;ts_y_act = test_data$Attrition_ind   
tr_tble =   table(tr_y_act,tr_y_pred)   
print(paste("Train   Confusion Matrix"))   
print(tr_tble)   
tr_acc =   accrcy(tr_y_act,tr_y_pred)   
trprec_zero = prec_zero(tr_y_act,tr_y_pred);   trrecl_zero = recl_zero(tr_y_act,tr_y_pred)   
trprec_one =   prec_one(tr_y_act,tr_y_pred); trrecl_one = recl_one(tr_y_act,tr_y_pred)   
trprec_ovll = trprec_zero   *frac_trzero + trprec_one*frac_trone   
trrecl_ovll = trrecl_zero   *frac_trzero + trrecl_one*frac_trone   
   
print(paste("Random Forest   Train accuracy:",tr_acc))   
print(paste("Random Forest   - Train Classification Report"))   
print(paste("Zero_Precision",trprec_zero,"Zero_Recall",trrecl_zero))   
print(paste("One_Precision",trprec_one,"One_Recall",trrecl_one))   
print(paste("Overall_Precision",round(trprec_ovll,4),"Overall_Recall",round(trrecl_ovll,4)))   
ts_tble =   table(ts_y_act,ts_y_pred)   
print(paste("Test   Confusion Matrix"))   
print(ts_tble)   
ts_acc =   accrcy(ts_y_act,ts_y_pred)   
tsprec_zero = prec_zero(ts_y_act,ts_y_pred);   tsrecl_zero = recl_zero(ts_y_act,ts_y_pred)   
tsprec_one =   prec_one(ts_y_act,ts_y_pred); tsrecl_one = recl_one(ts_y_act,ts_y_pred)   
tsprec_ovll = tsprec_zero   *frac_tszero + tsprec_one*frac_tsone   
tsrecl_ovll = tsrecl_zero   *frac_tszero + tsrecl_one*frac_tsone   
   
print(paste("Random Forest   Test accuracy:",ts_acc))   
print(paste("Random Forest   - Test Classification Report"))   
print(paste("Zero_Precision",tsprec_zero,"Zero_Recall",tsrecl_zero))   
print(paste("One_Precision",tsprec_one,"One_Recall",tsrecl_one))   
print(paste("Overall_Precision",round(tsprec_ovll,4),"Overall_Recall",round(tsrecl_ovll,4)))   
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset