Divide and Recombine 39
3.3.1 Division Approaches
Divisions are constructed by either conditioning-variable division or replicate division.
Replicate division creates partitions using random sampling of cases without replace-
ment, and is useful for many analytic recombination methods that will be touched upon in
Section 3.3.2.
Very often the data are embarrassingly divisible, meaning that there are natural ways to
break the data up based on the subject matter, leading to a partitioning based on one or
more of the variables in the data. This constitutes a conditioning-variable division. As an
example, suppose we have 25 years of 90 daily financial variables for 100 banks in the United
States. If we wish to study the behavior of individual banks and then make comparisons
across banks, we would partition the data by bank. If we are interested in how all banks
behave together over the course of each year, we could partition by year. Other aspects
such as geography and type or size of bank might also be valid candidates for a division
specification.
A critical consideration when specifying a division is to obtain subsets that are small
enough to be manageable when loaded into memory, so that they can be processed in a
single process in an environment like R. Sometimes, a division driven by subject matter can
lead to subsets that are too large. In this case, some creativity on the part of the analyst
must be applied to further break down the subsets.
The persistence of a division is important. Division is an expensive operation, as it can
require shuffling a large amount data around on a cluster. A given partitioning of the data
is typically reused many times while we are iterating over different analytical methods. For
example, after partitioning financial data by bank, we will probably apply many different
analytical and visual methods to that partitioning scheme until we have a model we are
happy with. We do not want to incur the cost of division each time we want to try a new
method.
Keeping multiple persistent copies of data formatted in different ways for different anal-
ysis purposes is a common practice with small data, and for a good reason. Having the
appropriate data structure for a given analysis task is critical, and the complexity of the
data often means that these structures will be very different depending on the task (e.g.,
not always tabular). Thus, it is not generally sufficient to simply have a single table that is
indexed in different ways for different analysis tasks. The notion of possibly creating mul-
tiple copies of a large dataset may be alarming to a database engineer, but should not be
surprising to a statistical practitioner, as it is a standard practice with small datasets to
have different copies of the data for different purposes.
3.3.2 Recombination Approaches
Just as there are different ways to divide the data, there are also different ways to recombine
them, as outlined in Figure 3.1. Typically for conditioning-variable division, a recombination
is a collation or aggregation of an analytic method applied to each subset. The results often
are small enough to investigate on a single workstation or may serve as the input for further
D&R operations.
With replicate division, the goal is usually to approximate an overall model fit to the
entire dataset. For example, consider a D&R logistic regression where the data are randomly
partitioned, we apply R’s glm() method to each subset independently, and then we average
the model coefficients. The result of a recombination may be an approximation of the exact
result had we been able to process the data as a whole, as in this example, but a potentially