3
Divide and R ecombine: Approach for Detailed
Analysis and Visualization of Large Complex Data
Ryan Hafen
CONTENTS
3.1 Introduction .................................................................... 35
3.2 Context: Deep Analysis of Large Complex Data .............................. 36
3.2.1 Deep Analysis .......................................................... 36
3.2.2 Large Complex Data .................................................. 37
3.2.3 What is Needed for Analysis of Large Complex Data? ............... 37
3.3 Divide and Recombine ......................................................... 38
3.3.1 Division Approaches ................................................... 39
3.3.2 Recombination Approaches ............................................ 39
3.3.3 Data Structures and Computation .................................... 40
3.3.4 Research in D&R ...................................................... 41
3.4 Trelliscope ...................................................................... 41
3.4.1 Trellis Display/Small Multiples ....................................... 41
3.4.2 Scaling Trellis Display ................................................. 41
3.4.3 Trelliscope ............................................................. 42
3.5 Tessera: Computational Environment for D&R ............................... 43
3.5.1 Front End .............................................................. 43
3.5.2 Back Ends ............................................................. 43
3.5.2.1 Small Scale: In-Memory Storage with R MapReduce ...... 44
3.5.2.2 Medium Scale: Local Disk Storage with Multicore R
MapReduce ................................................. 44
3.5.2.3 Large Scale: HDFS and Hadoop MapReduce via RHIPE . 44
3.5.2.4 Large Scale in Memory: Spark ............................. 44
3.6 Discussion ...................................................................... 45
References ............................................................................. 45
3.1 Introduction
The amount of data being captured and stored is ever increasing, and the need to make
sense of it poses great statistical challenges in methodology, theory, and computation. In this
chapter, we present a framework for statistical analysis and visualization of large complex
data: divide and recombine (D&R).
In D&R, a large dataset is broken into pieces in a meaningful way, statistical or
visual methods are applied to each subset in an embarrassingly parallel fashion, and the
35
36 Handbook of Big Data
results of these computations are recombined in a manner that yields a statistically valid
result. We introduce D&R in Section 3.3 and discuss various division and recombination
schemes.
D&R provides the foundation for Trelliscope, an approach to detailed visualization of
large complex data. Trelliscope is a multipanel display system based on the concepts of
Trellis display. In Trellis display, data are broken into subsets, a visualization method is
applied to each subset, and the resulting panels are arranged in a grid, facilitating meaning-
ful visual comparison between panels. Trelliscope extends Trellis by providing a multipanel
display system that can handle a very large number of panels and provides a paradigm for
effectively viewing the panels. Trelliscope is introduced in Section 3.4.
In Section 3.5, we present an ongoing open source project working toward the goal
of providing a computational framework for D&R and Trelliscope, called Tessera. Tessera
provides an R interface that flexibly ties to scalable back ends such as Hadoop or Spark.
The analyst programs entirely in R, large distributed data objects (DDOs) are represented
as native R objects, and D&R and Trelliscope operations are made available through simple
R commands.
3.2 Context: Deep Analysis of Large Complex Data
There are many domains that touch data, and hence several definitions of the terms data
analysis, visualization,andbig data. It is useful therefore to first set the proper context for
the approaches we present in this chapter. Doing so will identify the attributes necessary
for an appropriate methodology and computational environment.
3.2.1 Deep Analysis
The term analysis can mean many things. Often, the term is used for tasks such as com-
puting summaries and presenting them in a report, running a database query, processing
data through a set of predetermined analytical or machine learning routines. While these
are useful, there is in them an inherent notion of knowing aprioriwhat is the right thing
to be done to the data. However, data most often do not come with a model. The type
of analysis we strive to address is that which we have most often encountered when faced
with large complex datasets—analysis where we do not know what to do with the data
and we need to find the most appropriate mathematical way to represent the phenomena
generating the data. This type of analysis is very exploratory in nature. There is a lot of
trial and error involved. We iterate between hypothesizing, fitting, and validating models.
In this context, it is natural that analysis involves great deal of visualization, which is one
of the best ways to drive this iterative process, from generating new ideas to assessing the
validity of hypothesized models, to presenting results. We call this type of analysis deep
analysis.
While almost always useful in scientific disciplines, deep exploratory analysis and model
building is not always the right approach. When the goal is pure classification or prediction
accuracy, we may not care as much about understanding the data as we do about simply
choosing the algorithm with the best performance. But even in these cases, a more open-
ended approach that includes exploration and visualization can yield vast improvements.
For instance, consider the case where one might choose the best performer from a collection
of algorithms, which are all poor performers due to their lack of suitability to the data,
and this lack of suitability might be best determined through exploration. Or consider an
Divide and Recombine 37
analyst with domain expertise who might be able to provide insights based on explorations
that vastly improve the quality of the data or help the analyst look at the data from a new
perspective. In the words of the father of exploratory data analysis,JohnTukey:
Restricting one’s self to planned analysis failing to accompany it with exploration
loses sight of the most interesting results too frequently to be comfortable. [17]
This discussion of deep analysis is nothing new to the statistical practitioner, and to
such our discussion may feel a bit belabored. But in the domain of big data, its practice
severely lags behind the other analytical approaches and is often ignored, and hence deserves
attention.
3.2.2 Large Complex Data
Another term that pervades the industry is big data.Aswiththetermanalysis,thisalsocan
mean a lot of things. We tend to use the term large complex data to describe data that poses
the most pressing problems for deep analysis. Large complex data can have any or all of
the following attributes: a large number of records, many variables, complex data structures
that are not readily put into a tabular form, or intricate patterns and dependencies that
require complex models and methods of analysis.
Size alone may not be an issue if the data are not complex. For example, in the case
of tabular i.i.d data with a very large number of rows and a small number of variables,
analyzing a small sample of the data will probably suffice. It is the complexity that poses
more of a problem, regardless of size.
When data are complex in either structure or phenomena generating the data, we need
to analyze the data in detail. Summaries or samples will generally not suffice. For instance,
take the case of analyzing computer network traffic for thousands of computers in a large
enterprise. Because of the large number of actors in a computer network, many of which
are influenced by human behavior, there are so many different kinds of activity that can
be observed and modeled such that downsampling or trying to summarize will surely result
in lost information. We must address the fact that we need statistical approaches to deep
analysis that can handle large volumes complex data.
3.2.3 What is Needed for Analysis of Large Complex Data?
Now that we have provided some context, it is useful to discuss what is required to effec-
tively analyze large complex data in practice. These requirements provide the basis for the
approaches proposed in the remainder of the chapter.
By our definition of deep analysis, many requirements are readily apparent. First, due to
the possibility of having several candidate models or hypotheses, we must have at our finger-
tips a library of the thousands of statistical, machine learning, and visualization methods.
Second, due to the need for efficient iteration through the specification of different models
or visualizations, we must also have access to a high-level interactive statistical computing
software environment in which simple commands can execute complex algorithms or data
operations and in which we can flexibly handle data of different structures.
There are many environments that accommodate these requirements for small datasets,
one of the most prominent being R, which is the language of choice for our implementation
and discussions in this chapter. We cannot afford to lose the expressiveness of the high-level
computing environment when dealing with large data. We would like to be able to handle
data and drive the analysis from a high-level environment while transparently harnessing
38 Handbook of Big Data
distributed storage and computing frameworks. With big data, we need a statistical method-
ology that will provide access to the thousands of methods available in a language such as
R without the need to reimplement them. Our proposed approach is D&R, described in
Section 3.3.
3.3 Divide and Recombine
D&R is a statistical framework for data analysis based on the popular split-apply-combine
paradigm [20]. It is suited for situations where the number of cases outnumbers the number
of variables. In D&R, cases are partitioned into manageable subsets in a meaningful way
for the analysis task at hand, analytic methods (e.g., fitting a model) are applied to each
subset independently, and the results are recombined (e.g., averaging the model coefficients
from each subset) to yield a statistically valid—although not always exact—result. The key
to D&R is that by computing independently on small subsets, we can scalably leverage all
of the statistical methods already available in an environment like R.
Figure 3.1 shows a visual illustration of D&R. A large dataset is partitioned into subsets
where each subset is small enough to be manageable when loaded into memory in a single
process in an environment such as R. Subsets are persistent, and can be stored across
multiple disks and nodes in a cluster. After partitioning the data, we apply an analytic
method in parallel to each individual subset and merge the results of these computations
in the recombination step. A recombination can be an aggregation of analytic outputs to
provide a statistical model result. It can yield a new (perhaps smaller) dataset to be used
for further analysis, or it can even be a visual display, which we will discuss in Section 3.4.
In the remainder of this section, we provide the necessary background for D&R, but we
point readers to [3,6] for more details.
Subset Output
Subset Output
Subset
Data
Divide RecombineOne analytic method
of analysis thread
Result
New data
for analysis
subthread
Visual
displays
Visualization
recombination
Analytic
recombination
Statistic
recombination
Output
Subset Output
Subset Output
Subset Output
FIGURE 3.1
Diagram of the D&R statistical and computational framework.
Divide and Recombine 39
3.3.1 Division Approaches
Divisions are constructed by either conditioning-variable division or replicate division.
Replicate division creates partitions using random sampling of cases without replace-
ment, and is useful for many analytic recombination methods that will be touched upon in
Section 3.3.2.
Very often the data are embarrassingly divisible, meaning that there are natural ways to
break the data up based on the subject matter, leading to a partitioning based on one or
more of the variables in the data. This constitutes a conditioning-variable division. As an
example, suppose we have 25 years of 90 daily financial variables for 100 banks in the United
States. If we wish to study the behavior of individual banks and then make comparisons
across banks, we would partition the data by bank. If we are interested in how all banks
behave together over the course of each year, we could partition by year. Other aspects
such as geography and type or size of bank might also be valid candidates for a division
specification.
A critical consideration when specifying a division is to obtain subsets that are small
enough to be manageable when loaded into memory, so that they can be processed in a
single process in an environment like R. Sometimes, a division driven by subject matter can
lead to subsets that are too large. In this case, some creativity on the part of the analyst
must be applied to further break down the subsets.
The persistence of a division is important. Division is an expensive operation, as it can
require shuffling a large amount data around on a cluster. A given partitioning of the data
is typically reused many times while we are iterating over different analytical methods. For
example, after partitioning financial data by bank, we will probably apply many different
analytical and visual methods to that partitioning scheme until we have a model we are
happy with. We do not want to incur the cost of division each time we want to try a new
method.
Keeping multiple persistent copies of data formatted in different ways for different anal-
ysis purposes is a common practice with small data, and for a good reason. Having the
appropriate data structure for a given analysis task is critical, and the complexity of the
data often means that these structures will be very different depending on the task (e.g.,
not always tabular). Thus, it is not generally sufficient to simply have a single table that is
indexed in different ways for different analysis tasks. The notion of possibly creating mul-
tiple copies of a large dataset may be alarming to a database engineer, but should not be
surprising to a statistical practitioner, as it is a standard practice with small datasets to
have different copies of the data for different purposes.
3.3.2 Recombination Approaches
Just as there are different ways to divide the data, there are also different ways to recombine
them, as outlined in Figure 3.1. Typically for conditioning-variable division, a recombination
is a collation or aggregation of an analytic method applied to each subset. The results often
are small enough to investigate on a single workstation or may serve as the input for further
D&R operations.
With replicate division, the goal is usually to approximate an overall model fit to the
entire dataset. For example, consider a D&R logistic regression where the data are randomly
partitioned, we apply R’s glm() method to each subset independently, and then we average
the model coefficients. The result of a recombination may be an approximation of the exact
result had we been able to process the data as a whole, as in this example, but a potentially
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset