Bagging

Bootstrap aggregation or bagging is the earliest ensemble technique adopted widely by the ML-practicing community. Bagging involves creating multiple different models from a single dataset. It is important to understand an important statistical technique called bootstrapping in order to get an understanding of bagging.

Bootstrapping involves multiple random subsets of a dataset being created. It is possible that the same data sample gets picked up in multiple subsets and this is termed as bootstrapping with replacement. The advantage with this approach is that the standard error in estimating a quantity that occurs due to the use of whole dataset. This technique can be better explained with an example.

Assume you have a small dataset of 1,000 samples. Based on the samples, you are asked to compute the average of the population that the sample represents. Now, a direct way of doing it is through the following formula:

As this is a small sample, we may have an error in estimating the population average. This error can be reduced by adapting bootstrap sampling with replacement. In the technique, we create 10 subsets of the dataset where each dataset has 100 items in it. A data item may be randomly represented multiple times in a subset and there is no restriction on the number of times an item can be represented within a data subset as well as across the subsets. Now, we take the average of samples in each data subset, therefore, we end up with 10 different averages. Using all these collected averages, we estimate the average of the population with the following formula:

Now, we have a better estimate of the average as we have extrapolated the small sample to randomly generate multiple samples that are representative of the original population. 

In bagging, the actual training dataset is split into multiple bags through bootstrap sampling with replacement. Assuming that we ended up with n bags, when an ML algorithm is applied on each of these bags, we obtain n different models. Each model is focused on one bag. When it comes to making predictions on new unseen data, each of these n models makes independent predictions on the data. A final prediction for an observation is arrived at by combining the predictions of the observation of all the n models. In case of classification, voting is adopted and the majority is considered as the final prediction. For regression, the average of predictions from all models is considered as the final prediction. 

Decision-tree-based algorithms, such as classification and regression trees (CART), are unstable learners. The reason is that a small change in the training dataset heavily impacts the model created. Model change essentially means that the predictions also change. Bagging is a very effective technique to handle the high sensitivity to data changes. As we can build multiple decision tree models on subsets of a dataset and then arrive at a final prediction based on predictions from each of the models, the effect of changes in data gets nullified or not experienced very significantly.

One intuitive problem experienced with building multiple models on subsets of data is overfitting. However, this is overcome by growing deep trees without applying any pruning on the nodes.

A downside with bagging is that it takes longer to build the models when compared to building a model with a stand-alone ML algorithm. This is obvious because multiple models gets built in bagging, as opposed to one single model, and it takes time to build these multiple models. 

Now, let's implement the R code to achieve a bagging ensemble and compare the performance obtained with that of the performance obtained from KNN. We will then explore the working mechanics of bagging methodology.

The caret library provides a framework to implement bagging with any stand-alone ML algorithm. ldaBagplsBagnbBag, treeBagctreeBag, svmBag, and nnetBag are some of the example methods provided in caret. In this section, we will implement bagging with three different caret methods such as treebag, svmbag, and nbbag

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset