324 Handb ook of Big Data
same underlying algorithm on each bootstrap sample, and obtaining the final prediction
by averaging the results across the fits. Bagging can lead to significant model performance
gains when used with weak or unstable algorithms such as classification or regression trees.
The bootstrap samples are drawn with replacement, so each bootstrap sample of size n
contains approximately 63.2% of the unique training examples, while the remainder of the
observations contained in the sample are duplicates. Therefore, in bagging, each model is fit
using only a subset of the original training observations. The drawback of taking a simple
average of the output from the subset fits is that the predictions from each of the fits
are weighted equally, regardless of the individual quality of each fit. The performance of a
bagged fit can be much better compared to that of a nonbagged algorithm, but a simple
average is not necessarily the optimal combination of a set of base learners.
An average mixture (AVGM) procedure for fitting the parameter of a parametric model
has been studied by Zhang et al. [16]. AVGM partitions the full available dataset into disjoint
subsets, estimates the parameter within each subset, and finally combines the estimates by
simple averaging. Under certain conditions on the population risk, the AVGM can achieve
better efficiency than training a parametric model on the full data. A subsampled aver age
mixture (SAVGM) procedure, an extension of AVGM, is proposed in [16] and is shown to
provide substantial performance benefits over AVGM. As with AVGM, SAVGM partitions
the full data into subsets and estimates the parameter within each subset. However,
SAVGM also takes a single subsample from each partition, reestimates the parameter on the
subsample, and combines the two estimates into a so-called subsample-corrected estimate.
The final parameter estimate is obtained by simple averaging of the subsample-corrected
estimates from each partition. Both procedures have a theoretical backing; however, the
results rely on using parametric models.
An ensemble method for classification with large-scale datasets, using subsets of
observations to train algorithms, and combining the classifiers linearly, was implemented
and discussed in the case study of [8] at Twitter, Inc.
While not a subset method, boosting, formulated in [4], is an example of an ensemble
method that differentiates between the quality of each fit in the ensemble. Boosting iterates
the process of training a weak learner on the full dataset, then reweighting observations, with
higher weights given to poorly classified observations from the previous iteration. However,
boosting is not a subset method, because all observations are iteratively reweighted, and
thus all observations are needed at each iteration. Boosting is also a sequential algorithm,
and hence cannot be parallelized.
Another nonsubset ensemble method that differentiates between the quality of each
fit is the Super Learner algorithm of [14], which generalizes and establishes the theory
for stacking procedures developed by Wolpert [15] and extended by Breiman [2]. Super
Learner learns the optimal weighted combination of a base learner library of candidate
base learner algorithms by using cross-validation and a second-level metalearning algorithm.
Super Learner generalizes stacking by allowing for general loss functions and hence a broader
range of estimator combinations.
The Subsemble algorithm is a method proposed in [12], for combining results from fitting
the same underlying algorithm on different subsets of observations. Subsemble is a form of
supervised stacking [2,15] and is similar in nature to the Super Learner algorithm, with
the distinction that base learner fits are trained on subsets of the data instead of the full
training set. Subsemble can also accommodate multiple base learning algorithms, with each
algorithm being fit on each subset. The approach has many benefits and differs from other
ensemble methods in a variety of ways.
First, any type of underlying algorithm, parametric or nonparametric, can be used.
Instead of simply averaging subset-specific fits, Subsemble differentiates fit quality across
the subsets and learns a weighted combination of the subset-specific fits. To evaluate fit