5. Interactive Visual Analysis of Big Data (2/3)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

66 Handbook of Big Data

step for imMens computes all possible sums where the All placeholder appears in at

least n − 4 of the columns. In other words, imMens’s partial data cubes are at most four

dimensional. There is a very good reason for this particular choice: imMens was built to

enable extremely fast brushing as a form of interactive exploration [4]. In this mode, a user

moves his or her mouse over one particular two-dimensional projection of a dataset, and

the position of the mouse implies a restriction on the values they are interested in.

So that is one restricted dimension for the X position and another dimension for the Y

position.Now,foreveryother plot in a brushed multidimensional histogram, we want to

compute the sum of events in any particular bin. Those are, potentially, two extra dimensions

for a total of four. Because most of the interactive exploration in scatterplots happens with

one active brush, the representation in imMens works precisely in the right way. When

higher levels of aggregation are necessary (for storage reasons, sometimes imMens only

computes three-dimensional tables), imMens uses the GPU parallel power to perform this

aggregation quickly. As a result, after the data tiles are loaded on the client, imMens can

sustain interaction rates of 50 frames per second for datasets with hundreds of millions to

billions of events.

Another deﬁning characteristic of imMens’s storage is that its binning scheme is dense:

bins in which no events are stored take as much space as bins in which many events are

stored. This limits the eﬀective resolution of imMens’s binning scheme and is a signiﬁcant

limitation. Many datasets of interest have features at multiple scales: in time series analysis,

for example, features can happen at week-long scales or at hour-long scales.

Nanocubes, by contrast, uses a sparse, multilevel scheme for its address space, essentially

computing a nested sequence of 2

-ary trees (binary trees for single-dimensional values,

quad-trees for spatial values, etc.) [15]. In order to avoid the obvious exponential blowup,

nanocubes reuse large portions of the allocated data structures: the algorithm detects that

further reﬁnements of a query do not change the output sets and share those portions

across the data structure. As a simpliﬁed example, consider that every query for “rich men

in Seattle” will include “former CEOs of Microsoft” Bill Gates. It makes no sense, then, to

store separate results for “rich men in Seattle AND former CEOs of Microsoft” and “rich

men in Seattle”: the two result sets are identical. The construction algorithm for nanocubes

is essentially a vast generalization of this kind of rule and results in large storage gains.

Speciﬁcally, datasets that would take petabytes of storage to precompute in dense schemes

can be stored in a nanocube in a few tens of gigabytes.

Both nanocubes and imMens expose their application program interface (API) via web-

based visualization. Compared to imMens, the main disadvantage of nanocubes is that it

requires a special server process to answer its queries. imMens, however, preprocesses the

dataset into data tile ﬁles, which are served via a regular web server; all of the subsequent

computation happens on the client side. For deployment purposes, this is a very favorable

setup.

In addition, both nanocubes and imMens (in their currently available implementations)

share a limitation in the format of the data stored in their multivariate bins: these tech-

niques only store event counts. Wickham’s bigvis, by contrast, can store (slightly) more

sophisticated event statistics of events in its bins [24]. As a result, bigvis allows users to

easily create multivariate histogram plots where each bin stores a sample mean (or sample

variance, or other such simple statistics). This is a signiﬁcant improvement in ﬂexibility, par-

ticularly during exploratory analysis, where better statistics might help us ﬁnd important

patterns in the data.

At the same time, bigvis does not employ any precomputation strategy. It relies on

the speed of current multicore processors in desktops and workstations, and every time

the user changes the axes used for plotting, bigvis recreates the addressing scheme and

rescans the data. Unlike the one for imMens and nanocubes, the addressing scheme in

Interactive Visual Analysis of Big Data 67

bigvis is exceedingly simple, and thus scanning through the dataset is relatively fast. In

addition, the richer statistics stored in bigvis’s bins enable it to provide statistical modeling

features completely absent in nanocubes and imMens. It is possible, for example, to do

linear regressions and LOESS curve ﬁts over bigvis’s data structures.

It should be noted that data cubes do allow for these more general statistics. Speciﬁ-

cally, their data can live in any commutative monoid, and there has been some recent work

exploring connections between algebraic structures and allowable statistical and machine

learning algorithms [13]. Besides enabling nanocubes and imMens to perform more sophis-

ticated analyses, these connections provide an exciting avenue for future work. We envision

a world where exploratory statistics, in general, are assumed to have access to a data cube

like imMens’s or nanocubes. It is plausible that techniques that require repeated scanning

through the dataset (EM, clustering, stochastic gradient descent) can beneﬁt from judicious

preaggregation.

5.3 Sampling

Instead of preaggregation, the other general class of strategy for low-latency visualization

is that of sampling. The idea again is simple: if we are interested in understanding the

behavior of a large population (in this case, our dataset), then it should be enough for us to

study a small subset of this population, intuitively for the same reason that it is enough, in

many statistical analyses, to study a sample instead of the entire population.

In the context of general interactive query engines, BlinkDB is the best known and cur-

rent state of the art [1]. BlinkDB provides a typical SQL querying interface enriched by time

bound or error bound clauses. Its engine then builds a multidimensional stratiﬁed sample

that is updated in a streaming fashion until either of the bounds are satisﬁed. The main

complication lies in building indices capable of eﬃciently updating pieces of the stratiﬁed

sample, using a generalization of Hellerstein et al.’s online aggregation algorithms [12] that

take into account the additional knowledge available from the requested bounds. Fisher

et al.’s experiment with analysts in sampleAction [8] suggests that incremental results and

conﬁdence bounds are typically interpreted correctly. This provides evidence that the faster

cycle “query, examine results, think something new, repeat” is not happening at the expense

of a signiﬁcantly worse data analysis capacity.

Another signiﬁcant advantage of sampling engines, as opposed to aggregation engines,

is that it is much simpler to design a sampling engine that (at some level) has full row-level

access to the database. This usually means that the type of query returned by sampling

engines can be richer than that of aggregation engines. This retained ﬂexibility means

potentially richer visualizations, but also simpler integration with existing visualization sys-

tems. If your application requirements include having (eventual, when the sampling engine

converges) access to full-resolution elements of a dataset, then a sampling engine is probably

a safer choice.

Nevertheless, sampling engines present one serious shortcoming for interactive visualiza-

tion purposes that must be kept in mind. Simply put, in the current state of the art, it is

not straightforward to generate high-conﬁdence, low-error samples for many visualizations

of interest. For example, in a two-dimensional heatmap to be presented on a typical laptop

screen, the database engine would have to run about one million independent stratiﬁed

sampling runs (one for each pixel). It is not clear that the approaches in the literature scale

well to a large query set of this kind. If the system simply chooses to provide a simple

incremental sampling without replacement (such as sampleAction [8]), then the risk is that

this sampling will never actually produce acceptable error rates. This happens because the

68 Handbook of Big Data

heatmap being presented might have a colormap that portrays a wide dynamic range of

possible values. Consider, for example, a logarithmically increasing density plot: the range

of cardinalities in each pixel can easily go from tens to hundreds of thousands. The ﬁrst

example from Liu et al.’s paper showcases this problem very elegantly: a naive sample

from a geographical dataset destroys much of the high-resolution structure that makes a

visualization valuable in the ﬁrst place.

5.4 Graphical Considerations

The ﬁnal class of techniques we will discuss exploits the peculiarities of the human visual

system to produce faster results. In other words, these techniques settle for approximate

results, but approximate in a perceptual sense. This is done in an attempt to save work that

would not have produced any perceptible change in the ﬁnal image. This is an attractive

proposition in principle, but in practice it requires tight integration between the visualiza-

tion and the querying system, because even a change in the visualization that is typically

considered innocuous, such as a change in the color scale or the transformation of one of

the axes of the plot (from linear to logarithmic, for example), can cause the computational

trade-oﬀs behind the perceptibility approximations to change wildly.

The techniques discussed in this (admittedly shorter) section are the closest to current

research and, as such, do not appear to be readily available in any open-source software

package. Nevertheless, they point to important problems in current systems and in the

direction of where we can expect future advances to happen.

The ﬁrst system we highlight is Battle et al.’s Scalr, which introduces knowledge about

the perceptually relevant variables directly into the query optimizer of a database en-

gine [25]. This allows scalr to transform an exact query into an approximate one by a

variety of diﬀerent means, including both aggregation and sampling as discussed above.

scalr is implemented on top of SciDB [22], whose focus on array data provides attractive

operators for the manipulation of multiresolution constructs that tend to be relevant in

visualization settings such as images and heatmaps.

The second technique we highlight is Kim et al.’s sampling with ordering guarantees [14].

This is an application of the general strategy of sampling, but done under a speciﬁc per-

ceptual constraint. The authors assume that a visualization will be consumed mostly for

the ordering relationships it depicts. In other words, they assume that the bulk of the vi-

sual information to be processed in a bar chart is the relative ordering of the bar heights.

With that assumption in mind, they design an online streaming update algorithm that

very quickly converges (about 25× faster than other sampling strategies, for a total time

of below 1 s and 10

entries in the dataset) to the perceptually correct result. Whether or

not ordering is a realistic perceptual component to be preserved can be set aside for now:

the fundamental innovation here is that knowledge about perceptual properties is driving

the algorithm. In principle, this technique should be generalizable to other settings, and it

shows signiﬁcant promise for the future of big data visualization.

5.5 Outlook and Open Issues

Let us take a step back and look at the area more broadly, now that we know what are

the main techniques at our disposal. Binning strategies attack the problem by observing

that screen resolution is ﬁnite, and that careful precomputation can be suﬃciently eﬃcient

Interactive Visual Analysis of Big Data 69

and rich. Sampling strategies, by contrast, try to spend data-accessing eﬀort only where

the result uncertainty is largest, quickly converging to an approximation of the result in a

statistical sense. Notice, then, that both classes of techniques are actually trying to build

eﬀective approximations of the exact result without requiring a full scan of the data.

We believe that the overall strategy of approximate query processing is the best hope for

interactive data analysis. Speciﬁcally, we believe that perceptual approximations are going

to become a central part of the design of algorithms for interactive visualization. We seek an

infrastructure where richer mathematical descriptions of perceptual characteristics—take,

for example, Harrison et al.’s recent generalization of Weber’s law for perception of corre-

lation [11] or Demiralp et al.’s crowdsourcing of perceptual kernels [6]—can be translated

into potential eﬃciency gains by expressing the perceptual landscape of the visualizations in

the computing language itself.

In essence, we are advocating here for a concerted eﬀort at pushing the visualization

concerns into the back end (be it an analysis back end or a database back end), rather

than being relegated to front ends. This will certainly bring its own set of problems, not

the least of which being that separating the back end from the front end is currently done

for very good engineering reasons: modularity, separation of concerns, etc. Still, the current

situation of state-of-the-art visualization APIs provides a good illustration of the problems

we face. We enjoy a large degree of modularity and independence: the de facto front end for

visualizations has become the web browser, by means of modern APIs such as d3 [5] and

HTML5featuressuchasscalable vector graphics [18]. Web browsers, by design, provide

automatic distributed operation, and data back ends vary widely in sophistication. At the

same same, these libraries fail to scale properly when datasets grow, eventually requiring the

very solutions we discussed earlier in this chapter, special purpose as they are. So although

we look forward to one day having general, modular, and scalable visualization back ends,

our best bet for the near-term future seems to be on a tightly integrated front-end–back-end

design based on perceptual considerations of the ﬁnal visual artifact being displayed.

We also look forward to a deeper consideration of perceptual limitations of the outputs

of data analysis algorithms themselves. As a naive example, if the output of a clustering

algorithm will be displayed on a screen, then we do not need to know the values of cluster

centers to a resolution ﬁner than that of the screen. The same argument holds for many

other potentially costly operations such as model ﬁtting and supervised learning algorithms.

References

1. Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, and

Ion Stoica. BlinkDB: Queries with bounded errors and bounded response times on very

large data. In Proceedings of the 8th ACM European Conference on Computer Systems,

pp. 29–42, Prague, Czech Republic, 2013.

2. James Ahrens, S´ebastien Jourdain, Patrick O’Leary, John Patchett, David H. Rogers,

and Mark Petersen. An image-based approach to extreme scale in situ visualization and

analysis. In Proceedings of the International Conference for High Performance Comput-

ing, Networking, Storage and Analysis, pp. 424–434, Piscataway, NJ, 2014.

3. Francis J. Anscombe. Graphs in statistical analysis. The American Statistician,

27(1):17–21, 1973.

4. Richard A. Becker and William S. Cleveland. Brushing scatterplots. Technometrics,

29(2):127–142, 1987.

70 Handbook of Big Data

5. Michael Bostock, Vadim Ogievetsky, and Jeﬀrey Heer. D

: Data-driven documents.

IEEE Transactions on Visualization and Computer Graphics, 17(12):2301–2309, 2011.

6. Cagatay Demiralp, Michael Bernstein, and Jeﬀrey Heer. Learning perceptual kernels

for visualization design. IEEE Transactions on Visualization and Computer Graphics,

20(12):1933–1942, 2014.

7. Bradley Efron and Robert J. Tibshirani. An Introduction to the Bootstrap. Chapman

& Hall/CRC Press, Boca Raton, FL, 1994.

8. Danyel Fisher, Igor Popov, Steven Drucker, and M.C. Schraefel. Trust me, I’m partially

right: Incremental visualization lets analysts explore large datasets faster. In Proceedings

of the SIGCHI Conference on Human Factors in Computing Systems, pp. 1673–1682,

ACM, New York, 2012.

9. Parke Godfrey, Jarek Gryz, and Piotr Lasek. Interactive visualization of large data sets.

Technical Report EECS-2015-03, York University, Toronto, Canada, 2015.

10. Jim Gray, Surajit Chaudhuri, Adam Bosworth, Andrew Layman, Don Reichart, Murali

Venkatrao, Frank Pellow, and Hamid Pirahesh. Data cube: A relational aggregation

operator generalizing group-by, cross-tab, and sub-totals. Data Mining and Knowledge

Discovery, 1(1):29–53, 1997.

11. Lane Harrison, Fumeng Yang, Steven Franconeri, and Remco Chang. Ranking visu-

alizations of correlation using Weber’s law. IEEE Transactions on Visualization and

Computer Graphics, 20(12):1943–1952, 2014.

12. Joseph M. Hellerstein, Peter J. Haas, and Helen J. Wang. Online aggregation. ACM

SIGMOD Record, 26(2):171–182, 1997.

13. Michael Izbicki. Algebraic classiﬁers: A generic approach to fast cross-validation, online

training, and parallel training. In Proceedings of the 30th International Conference on

Machine Learning, pp. 648–656, Atlanta, GA, 2013.

14. Albert Kim, Eric Blais, Aditya Parameswaran, Piotr Indyk, Samuel Madden, and Ronitt

Rubinfeld. Rapid sampling for visualizations with ordering guarantees. ArXiv e-prints,

December 2014.

15. Lauro Lins, James T Klosowski, and Carlos Scheidegger. Nanocubes for real-time explo-

ration of spatiotemporal datasets. IEEE Transactions on Visualization and Computer

Graphics, 19(12):2456–2465, 2013.

16. Zhicheng Liu and Jeﬀrey Heer. The eﬀects of interactive latency on exploratory

visual analysis. IEEE Transactions on Visualization and Computer Graphics, 20(12):

2122–2131, December 2014.

17. Zhicheng Liu, Biye Jiang, and Jeﬀrey Heer. imMens: Real-time visual querying of big

data. In Proceedings of the 15th Eurographics Conference on Visualization, Computer

Graphics Forum, vol. 32, pp. 421–430, John Wiley & Sons, Chichester, 2013.

18. Mozilla Developer Network. SVG: Scalable Vector Graphics. https://developer.mozilla.

org/en-US/docs/Web/SVG.

19. Thomas Porter and Tom Duﬀ. Compositing digital images. SIGGRAPH Computer

Graphics, 18(3):253–259, January 1984.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 5. Interactive Visual Analysis of Big Data (2/3)

Create new playlist

Sign In

Sign Up

Table of Contents for
5. Interactive Visual Analysis of Big Data (2/3)