Bagging classifier

As we have discussed already, decision trees suffer from high variance, which means if we split the training data into two random parts separately and fit two decision trees for each sample, the rules obtained would be very different. Whereas low variance and high bias models, such as linear or logistic regression, will produce similar results across both samples. Bagging refers to bootstrap aggregation (repeated sampling with replacement and perform aggregation of results to be precise), which is a general purpose methodology to reduce the variance of models. In this case, they are decision trees.

Aggregation reduces the variance, for example, when we have n independent observations x1, x,..., xn each with variance σ2 and the variance of the mean  of the observations is given by σ2/n, which illustrates by averaging a set of observations that it reduces variance. Here, we are reducing variance by taking many samples from training data (also known as bootstrapping), building a separate decision tree on each sample separately, averaging the predictions for regression, and calculating mode for classification problems in order to obtain a single low-variance model that will have both low bias and low variance:

In a bagging procedure, rows are sampled while selecting all the columns/variables (whereas, in a random forest, both rows and columns would be sampled, which we will cover in the next section) and fitting individual trees for each sample. In the following diagram, two colors (pink and blue) represent two samples, and for each sample, a few rows are sampled, but all the columns (variables) are selected every time. One issue that exists due to the selection of all columns is that most of the trees will describe the same story, in which the most important variable will appear initially in the split, and this repeats in all the trees, which will not produce de-correlated trees, so we may not get better performance when applying variance reduction. This issue will be avoided in random forest (we will cover this in the next section of the chapter), in which we will sample both rows and columns as well:

In the following code, the same HR data has been used to fit the bagging classifier in order to compare the results apple to apple with respect to decision trees:

# Bagging Classifier 
>>> from sklearn.tree import DecisionTreeClassifier
>>> from sklearn.ensemble import BaggingClassifier

The base classifier used here is Decision Trees with the same parameter setting that we used in the decision tree example:

>>> dt_fit = DecisionTreeClassifier(criterion="gini", max_depth=5,min_samples_split=2, min_samples_leaf=1,random_state=42,class_weight = {0:0.3,1:0.7}) 

Parameters used in bagging are, n_estimators to represent the number of individual decision trees used as 5,000, maximum samples and features selected are 0.67 and 1.0 respectively, which means it will select 2/3rd of observations for each tree and all the features. For further details, please refer to the scikit-learn manual http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html:

>>> bag_fit = BaggingClassifier(base_estimator= dt_fit,n_estimators=5000, max_samples=0.67, 
...              max_features=1.0,bootstrap=True, 
...              bootstrap_features=False, n_jobs=-1,random_state=42) 
 
>>> bag_fit.fit(x_train, y_train) 
 
>>> print ("
Bagging - Train Confusion Matrix

",pd.crosstab(y_train, bag_fit.predict(x_train),rownames = ["Actuall"],colnames = ["Predicted"]))       
>>> print ("
Bagging- Train accuracy",round(accuracy_score(y_train, bag_fit.predict(x_train)),3))  
>>> print ("
Bagging  - Train Classification Report
",classification_report(y_train, bag_fit.predict(x_train))) 
 
>>> print ("

Bagging - Test Confusion Matrix

",pd.crosstab(y_test, bag_fit.predict(x_test),rownames = ["Actuall"],colnames = ["Predicted"]))       
>>> print ("
Bagging - Test accuracy",round(accuracy_score(y_test, bag_fit.predict(x_test)),3)) 
>>> print ("
Bagging - Test Classification Report
",classification_report(y_test, bag_fit.predict(x_test)))

After analyzing the results from bagging, the test accuracy obtained was 87.3%, whereas for decision tree it was 84.6%. Comparing the number of actual attrited employees identified, there were 13 in bagging, whereas in decision tree there were 12, but the number of 0 classified as 1 significantly reduced to 8 compared with 19 in DT. Overall, bagging improves performance over the single tree:

R Code for Bagging Classifier Applied on HR Attrition Data:

# Bagging Classifier - using   Random forest package but all variables selected   
library(randomForest)   
set.seed(43)   
rf_fit = randomForest(Attrition_ind~.,data   = train_data,mtry=30,maxnodes= 64,classwt = c(0.3,0.7), ntree=5000,nodesize =   1)   
tr_y_pred = predict(rf_fit,data   = train_data,type = "response")   
ts_y_pred =   predict(rf_fit,newdata = test_data,type = "response")   
tr_y_act = train_data$Attrition_ind;ts_y_act   = test_data$Attrition_ind   
   
tr_tble =   table(tr_y_act,tr_y_pred)   
print(paste("Train   Confusion Matrix"))   
print(tr_tble)   
tr_acc =   accrcy(tr_y_act,tr_y_pred)   
trprec_zero =   prec_zero(tr_y_act,tr_y_pred); trrecl_zero = recl_zero(tr_y_act,tr_y_pred)   
trprec_one =   prec_one(tr_y_act,tr_y_pred);    
trrecl_one =   recl_one(tr_y_act,tr_y_pred)   
trprec_ovll = trprec_zero   *frac_trzero + trprec_one*frac_trone   
trrecl_ovll = trrecl_zero   *frac_trzero + trrecl_one*frac_trone   
print(paste("Random Forest   Train accuracy:",tr_acc))   
print(paste("Random Forest   - Train Classification Report"))   
print(paste("Zero_Precision",trprec_zero,"Zero_Recall",trrecl_zero))   
print(paste("One_Precision",trprec_one,"One_Recall",trrecl_one))   
print(paste("Overall_Precision",round(trprec_ovll,4),"Overall_Recall",   
round(trrecl_ovll,4)))   
   
ts_tble =   table(ts_y_act,ts_y_pred)   
print(paste("Test   Confusion Matrix"))   
print(ts_tble)   
ts_acc =   accrcy(ts_y_act,ts_y_pred)   
tsprec_zero =   prec_zero(ts_y_act,ts_y_pred); tsrecl_zero = recl_zero(ts_y_act,ts_y_pred)   
tsprec_one =   prec_one(ts_y_act,ts_y_pred);    
tsrecl_one =   recl_one(ts_y_act,ts_y_pred)   
tsprec_ovll = tsprec_zero   *frac_tszero + tsprec_one*frac_tsone   
tsrecl_ovll = tsrecl_zero   *frac_tszero + tsrecl_one*frac_tsone   
print(paste("Random Forest   Test accuracy:",ts_acc))   
print(paste("Random Forest   - Test Classification Report"))   
print(paste("Zero_Precision",tsprec_zero,"Zero_Recall",tsrecl_zero))   
print(paste("One_Precision",tsprec_one,"One_Recall",tsrecl_one))   
print(paste("Overall_Precision",round(tsprec_ovll,4),"Overall_Recall",   
round(tsrecl_ovll,4)))   
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset