3. Divide and Recombine: Approach for Detailed Analysis and Visualization of Large Complex Data (1/3)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Divide and R ecombine: Approach for Detailed

Analysis and Visualization of Large Complex Data

Ryan Hafen

CONTENTS

3.1 Introduction .................................................................... 35

3.2 Context: Deep Analysis of Large Complex Data .............................. 36

3.2.1 Deep Analysis .......................................................... 36

3.2.2 Large Complex Data .................................................. 37

3.2.3 What is Needed for Analysis of Large Complex Data? ............... 37

3.3 Divide and Recombine ......................................................... 38

3.3.1 Division Approaches ................................................... 39

3.3.2 Recombination Approaches ............................................ 39

3.3.3 Data Structures and Computation .................................... 40

3.3.4 Research in D&R ...................................................... 41

3.4 Trelliscope ...................................................................... 41

3.4.1 Trellis Display/Small Multiples ....................................... 41

3.4.2 Scaling Trellis Display ................................................. 41

3.4.3 Trelliscope ............................................................. 42

3.5 Tessera: Computational Environment for D&R ............................... 43

3.5.1 Front End .............................................................. 43

3.5.2 Back Ends ............................................................. 43

3.5.2.1 Small Scale: In-Memory Storage with R MapReduce ...... 44

3.5.2.2 Medium Scale: Local Disk Storage with Multicore R

MapReduce ................................................. 44

3.5.2.3 Large Scale: HDFS and Hadoop MapReduce via RHIPE . 44

3.5.2.4 Large Scale in Memory: Spark ............................. 44

3.6 Discussion ...................................................................... 45

References ............................................................................. 45

3.1 Introduction

The amount of data being captured and stored is ever increasing, and the need to make

sense of it poses great statistical challenges in methodology, theory, and computation. In this

chapter, we present a framework for statistical analysis and visualization of large complex

data: divide and recombine (D&R).

In D&R, a large dataset is broken into pieces in a meaningful way, statistical or

visual methods are applied to each subset in an embarrassingly parallel fashion, and the

36 Handbook of Big Data

results of these computations are recombined in a manner that yields a statistically valid

result. We introduce D&R in Section 3.3 and discuss various division and recombination

schemes.

D&R provides the foundation for Trelliscope, an approach to detailed visualization of

large complex data. Trelliscope is a multipanel display system based on the concepts of

Trellis display. In Trellis display, data are broken into subsets, a visualization method is

applied to each subset, and the resulting panels are arranged in a grid, facilitating meaning-

ful visual comparison between panels. Trelliscope extends Trellis by providing a multipanel

display system that can handle a very large number of panels and provides a paradigm for

eﬀectively viewing the panels. Trelliscope is introduced in Section 3.4.

In Section 3.5, we present an ongoing open source project working toward the goal

of providing a computational framework for D&R and Trelliscope, called Tessera. Tessera

provides an R interface that ﬂexibly ties to scalable back ends such as Hadoop or Spark.

The analyst programs entirely in R, large distributed data objects (DDOs) are represented

as native R objects, and D&R and Trelliscope operations are made available through simple

R commands.

3.2 Context: Deep Analysis of Large Complex Data

There are many domains that touch data, and hence several deﬁnitions of the terms data

analysis, visualization,andbig data. It is useful therefore to ﬁrst set the proper context for

the approaches we present in this chapter. Doing so will identify the attributes necessary

for an appropriate methodology and computational environment.

3.2.1 Deep Analysis

The term analysis can mean many things. Often, the term is used for tasks such as com-

puting summaries and presenting them in a report, running a database query, processing

data through a set of predetermined analytical or machine learning routines. While these

are useful, there is in them an inherent notion of knowing aprioriwhat is the right thing

to be done to the data. However, data most often do not come with a model. The type

of analysis we strive to address is that which we have most often encountered when faced

with large complex datasets—analysis where we do not know what to do with the data

and we need to ﬁnd the most appropriate mathematical way to represent the phenomena

generating the data. This type of analysis is very exploratory in nature. There is a lot of

trial and error involved. We iterate between hypothesizing, ﬁtting, and validating models.

In this context, it is natural that analysis involves great deal of visualization, which is one

of the best ways to drive this iterative process, from generating new ideas to assessing the

validity of hypothesized models, to presenting results. We call this type of analysis deep

analysis.

While almost always useful in scientiﬁc disciplines, deep exploratory analysis and model

building is not always the right approach. When the goal is pure classiﬁcation or prediction

accuracy, we may not care as much about understanding the data as we do about simply

choosing the algorithm with the best performance. But even in these cases, a more open-

ended approach that includes exploration and visualization can yield vast improvements.

For instance, consider the case where one might choose the best performer from a collection

of algorithms, which are all poor performers due to their lack of suitability to the data,

and this lack of suitability might be best determined through exploration. Or consider an

Divide and Recombine 37

analyst with domain expertise who might be able to provide insights based on explorations

that vastly improve the quality of the data or help the analyst look at the data from a new

perspective. In the words of the father of exploratory data analysis,JohnTukey:

Restricting one’s self to planned analysis – failing to accompany it with exploration –

loses sight of the most interesting results too frequently to be comfortable. [17]

This discussion of deep analysis is nothing new to the statistical practitioner, and to

such our discussion may feel a bit belabored. But in the domain of big data, its practice

severely lags behind the other analytical approaches and is often ignored, and hence deserves

attention.

3.2.2 Large Complex Data

Another term that pervades the industry is big data.Aswiththetermanalysis,thisalsocan

mean a lot of things. We tend to use the term large complex data to describe data that poses

the most pressing problems for deep analysis. Large complex data can have any or all of

the following attributes: a large number of records, many variables, complex data structures

that are not readily put into a tabular form, or intricate patterns and dependencies that

require complex models and methods of analysis.

Size alone may not be an issue if the data are not complex. For example, in the case

of tabular i.i.d data with a very large number of rows and a small number of variables,

analyzing a small sample of the data will probably suﬃce. It is the complexity that poses

more of a problem, regardless of size.

When data are complex in either structure or phenomena generating the data, we need

to analyze the data in detail. Summaries or samples will generally not suﬃce. For instance,

take the case of analyzing computer network traﬃc for thousands of computers in a large

enterprise. Because of the large number of actors in a computer network, many of which

are inﬂuenced by human behavior, there are so many diﬀerent kinds of activity that can

be observed and modeled such that downsampling or trying to summarize will surely result

in lost information. We must address the fact that we need statistical approaches to deep

analysis that can handle large volumes complex data.

3.2.3 What is Needed for Analysis of Large Complex Data?

Now that we have provided some context, it is useful to discuss what is required to eﬀec-

tively analyze large complex data in practice. These requirements provide the basis for the

approaches proposed in the remainder of the chapter.

By our deﬁnition of deep analysis, many requirements are readily apparent. First, due to

the possibility of having several candidate models or hypotheses, we must have at our ﬁnger-

tips a library of the thousands of statistical, machine learning, and visualization methods.

Second, due to the need for eﬃcient iteration through the speciﬁcation of diﬀerent models

or visualizations, we must also have access to a high-level interactive statistical computing

software environment in which simple commands can execute complex algorithms or data

operations and in which we can ﬂexibly handle data of diﬀerent structures.

There are many environments that accommodate these requirements for small datasets,

one of the most prominent being R, which is the language of choice for our implementation

and discussions in this chapter. We cannot aﬀord to lose the expressiveness of the high-level

computing environment when dealing with large data. We would like to be able to handle

data and drive the analysis from a high-level environment while transparently harnessing

38 Handbook of Big Data

distributed storage and computing frameworks. With big data, we need a statistical method-

ology that will provide access to the thousands of methods available in a language such as

R without the need to reimplement them. Our proposed approach is D&R, described in

Section 3.3.

3.3 Divide and Recombine

D&R is a statistical framework for data analysis based on the popular split-apply-combine

paradigm [20]. It is suited for situations where the number of cases outnumbers the number

of variables. In D&R, cases are partitioned into manageable subsets in a meaningful way

for the analysis task at hand, analytic methods (e.g., ﬁtting a model) are applied to each

subset independently, and the results are recombined (e.g., averaging the model coeﬃcients

from each subset) to yield a statistically valid—although not always exact—result. The key

to D&R is that by computing independently on small subsets, we can scalably leverage all

of the statistical methods already available in an environment like R.

Figure 3.1 shows a visual illustration of D&R. A large dataset is partitioned into subsets

where each subset is small enough to be manageable when loaded into memory in a single

process in an environment such as R. Subsets are persistent, and can be stored across

multiple disks and nodes in a cluster. After partitioning the data, we apply an analytic

method in parallel to each individual subset and merge the results of these computations

in the recombination step. A recombination can be an aggregation of analytic outputs to

provide a statistical model result. It can yield a new (perhaps smaller) dataset to be used

for further analysis, or it can even be a visual display, which we will discuss in Section 3.4.

In the remainder of this section, we provide the necessary background for D&R, but we

point readers to [3,6] for more details.

Subset Output

Subset

Data

Divide RecombineOne analytic method

of analysis thread

Result

New data

for analysis

subthread

Visual

displays

Visualization

recombination

Analytic

recombination

Statistic

recombination

Output

Subset Output

FIGURE 3.1

Diagram of the D&R statistical and computational framework.

Divide and Recombine 39

3.3.1 Division Approaches

Divisions are constructed by either conditioning-variable division or replicate division.

Replicate division creates partitions using random sampling of cases without replace-

ment, and is useful for many analytic recombination methods that will be touched upon in

Section 3.3.2.

Very often the data are embarrassingly divisible, meaning that there are natural ways to

break the data up based on the subject matter, leading to a partitioning based on one or

more of the variables in the data. This constitutes a conditioning-variable division. As an

example, suppose we have 25 years of 90 daily ﬁnancial variables for 100 banks in the United

States. If we wish to study the behavior of individual banks and then make comparisons

across banks, we would partition the data by bank. If we are interested in how all banks

behave together over the course of each year, we could partition by year. Other aspects

such as geography and type or size of bank might also be valid candidates for a division

speciﬁcation.

A critical consideration when specifying a division is to obtain subsets that are small

enough to be manageable when loaded into memory, so that they can be processed in a

single process in an environment like R. Sometimes, a division driven by subject matter can

lead to subsets that are too large. In this case, some creativity on the part of the analyst

must be applied to further break down the subsets.

The persistence of a division is important. Division is an expensive operation, as it can

require shuﬄing a large amount data around on a cluster. A given partitioning of the data

is typically reused many times while we are iterating over diﬀerent analytical methods. For

example, after partitioning ﬁnancial data by bank, we will probably apply many diﬀerent

analytical and visual methods to that partitioning scheme until we have a model we are

happy with. We do not want to incur the cost of division each time we want to try a new

method.

Keeping multiple persistent copies of data formatted in diﬀerent ways for diﬀerent anal-

ysis purposes is a common practice with small data, and for a good reason. Having the

appropriate data structure for a given analysis task is critical, and the complexity of the

data often means that these structures will be very diﬀerent depending on the task (e.g.,

not always tabular). Thus, it is not generally suﬃcient to simply have a single table that is

indexed in diﬀerent ways for diﬀerent analysis tasks. The notion of possibly creating mul-

tiple copies of a large dataset may be alarming to a database engineer, but should not be

surprising to a statistical practitioner, as it is a standard practice with small datasets to

have diﬀerent copies of the data for diﬀerent purposes.

3.3.2 Recombination Approaches

Just as there are diﬀerent ways to divide the data, there are also diﬀerent ways to recombine

them, as outlined in Figure 3.1. Typically for conditioning-variable division, a recombination

is a collation or aggregation of an analytic method applied to each subset. The results often

are small enough to investigate on a single workstation or may serve as the input for further

D&R operations.

With replicate division, the goal is usually to approximate an overall model ﬁt to the

entire dataset. For example, consider a D&R logistic regression where the data are randomly

partitioned, we apply R’s glm() method to each subset independently, and then we average

the model coeﬃcients. The result of a recombination may be an approximation of the exact

result had we been able to process the data as a whole, as in this example, but a potentially

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 3. Divide and Recombine: Approach for Detailed Analysis and Visualization of Large Complex Data (1/3)

Create new playlist

Sign In

Sign Up

Table of Contents for
3. Divide and Recombine: Approach for Detailed Analysis and Visualization of Large Complex Data (1/3)