346 Handb ook of Big Data
computational flexibility, in that the training task can be scaled to any size by changing the
number, or size, of the subsets. This allows the user to effectively flatten the training process
into a task that is compatible with available computational resources. If parallelization is
used effectively, all subset-specific fits can be trained at the same time, drastically increasing
the speed of the training process. Because the subsets are typically much smaller than the
original training set, this also reduces the memory requirements of each node in your cluster.
The computational flexibility and speed of the Subsemble algorithm offer a unique solution
to scaling ensemble learning to big data problems.
In the subsemble package, the J subsets can be created by the software at random, or
the subsets can be explicitly specified by the user. Given L base learning algorithms and J
subsets, a total of L × J subset-specific fits will be trained and included in the Subsemble
(by default). This construction allows each base learning algorithm to see each subset of
the training data, so in this sense, there is a similarity to ensembles trained on the full
data. To distinguish the variations on this theme, this type of ensemble construction is
referred to as a cross-product Subsemble. The subsemble package also implements what are
called divisor Subsembles, a structure that can be created if the number of unique base
learning algorithms is a divisor of the number of subsets. In this case, there are only J total
subset-specific fits that make up the ensemble, and each learner only sees approximately
n/J observations from the full training set (assuming that the subsets are of equal size). For
example, if L =2andJ = 10, then each of the two base learning algorithms would be used
to train five subset-specific fits and would only see a total of 50% of the original training
observations. This type of Subsemble allows for quicker training, but will typically result in
less accurate models. Therefore, the cross-product method is the default Subsemble type in
the software.
An algorithm called Supervised Regression Tree Subsemble or SRT Subsemble [35] is also
on the development road map for the subsemble package. SRT Subsemble is an extension of
the regular Subsemble algorithm, which provides a means of learning the optimal number
and constituency of the subsets. This method incurs an additional computational cost, but
can provide greater model performance for the Subsemble.
19.3.3 H2O Ensemble
The H2O Ensemble software contains an implementation of the Super Learner ensemble
algorithm that is built on the distributed, open source, Java-based machine learning platform
for big data, H2O. H2O Ensemble is currently implemented as a stand-alone R package called
h2oEnsemble that makes use of the h2o package, the R interface to the H2O platform.
There are a handful of powerful supervised machine learning algorithms supported by the
h2o package, all of which can be used as base learners for the ensemble. This includes a
high-performance method for deep learning, which allows the user to create ensembles of
deep neural nets or combine the power of deep neural nets with other algorithms, such as
Random Forest or Gradient Boosting Machines (GBMs) [12].
Because the H2O machine learning platform was designed with big data in mind, each
of the H2O base learning algorithms is scalable to very large training sets and enables
parallelism across multiple nodes and cores. The H2O platform comprises a distributed
in-memory parallel computing architecture and has the ability to seamlessly use datasets
stored in Hadoop Distributed File System (HDFS), Amazon’s S3 cloud storage, NoSQL,
and SQL databases in addition to CSV files stored locally or in distributed filesystems. The
H2O Ensemble project aims to match the scalability of the H2O algorithms, so although
the ensemble uses R as its main user interface, most of the computations are performed in
Java via H2O in a distributed, scalable fashion.